MarketBranch predictor
Company Profile

Branch predictor

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch will go before this is known definitively. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high performance in many modern pipelined microprocessor architectures.

Implementation
Static branch prediction Static prediction is the simplest branch prediction technique because it does not rely on information about the dynamic history of code executing. Instead, it predicts the outcome of a branch based solely on the branch instruction. Early implementations of SPARC and MIPS (two of the first commercial RISC architectures) used single-direction static branch prediction: they always predict that a conditional jump will not be taken, so they always fetch the next sequential instruction. Only when the branch or jump is evaluated and found to be taken, does the instruction pointer get set to a non-sequential address. Both CPUs fetch instructions in one cycle and evaluate branches in the decode stage. As a result, the branch target recurrence is two cycles long, and the machine always fetches the instruction immediately after any taken branch. Both architectures define branch delay slots in order to utilize these fetched instructions. A more-advanced form of static prediction assumes that backward branches will be taken and forward branches will not. (A backward branch is one that has a target address that is lower than its own address.) This rule increases prediction accuracy in loops, which can be constructed with a backward branch at the end that is mostly taken, or a forward branch at the beginning that is mostly not taken. With processors that use this prediction method, ordering the instructions can maximize branch-prediction accuracy. The RISC-V ISA recommends that software written for RISC-V harts (hardware threads), or generated to run on RISC-V harts, be optimized with the assumption that backward branches are taken and forward branches are not. (Even when the processor incorporates a more-advanced predictor that diminishes the relative value of static prediction.) Some processors accept branch-prediction hints that override the static prediction. The Intel Pentium 4 and Pentium 4E accept branch-prediction hints as prefixes. In the presence of dynamic prediction, static prediction offers almost no benefit, and overriding the static prediction helps even less. Intel's subsequent Pentium M and Core2 processors ignore branch-prediction hint prefixes. No x86 manufacturer has re-introduced prediction hints. Dynamic branch predictors naturally fall back on static branch prediction when they have no cached information (such as the first time a given branch is encountered). Both the Motorola MPC7450 (G4e) and the Intel Pentium 4 fall back on static prediction. In static prediction, all decisions are made at compile time, before the execution of the program. Dynamic branch prediction Dynamic branch prediction The predictor table is indexed with the instruction address bits, so that the processor can fetch a prediction for every instruction before the instruction is decoded. Two-level predictor The Two-Level Branch Predictor, also referred to as Correlation-Based Branch Predictor, uses a two-dimensional table of counters, also called "Pattern History Table". The table entries are two-bit counters. Two-level adaptive predictor If an if statement is executed three times, the decision made on the third execution might depend upon whether the previous two were taken or not. In such scenarios, a two-level adaptive predictor works more efficiently than a saturation counter. Conditional jumps that are taken every second time or have some other regularly recurring pattern are not predicted well by the saturating counter. A two-level adaptive predictor remembers the history of the last n occurrences of the branch and uses one saturating counter for each of the possible 2n history patterns. This method is illustrated in figure 3. Consider the example of n = 2. This means that the last two occurrences of the branch are stored in a two-bit shift register. This branch history register can have four different binary values, 00, 01, 10, and 11, where zero means "not taken" and one means "taken". A pattern history table contains four entries per branch, one for each of the 22 = 4 possible branch histories, and each entry in the table contains a two-bit saturating counter of the same type as in figure 2 for each branch. The branch history register is used for choosing which of the four saturating counters to use. If the history is 00, then the first counter is used; if the history is 11, then the last of the four counters is used. Assume, for example, that a conditional jump is taken every third time. The branch sequence is 001001001... In this case, entry number 00 in the pattern history table will go to state "strongly taken", indicating that after two zeroes comes a one. Entry number 01 will go to state "strongly not taken", indicating that after 01 comes a zero. The same is the case with entry number 10, while entry number 11 is never used because there are never two consecutive ones. The general rule for a two-level adaptive predictor with an n-bit history is that it can predict any repetitive sequence with any period if all n-bit sub-sequences are different. Since the initial publication in 1991, this method has become very popular. Variants of this prediction method are used in most modern microprocessors. Two-level neural predictor A two-level branch predictor where the second level is replaced with a neural network has been proposed. Local branch prediction A local branch predictor has a separate history buffer for each conditional jump instruction. It may use a two-level adaptive predictor. The history buffer is separate for each conditional jump instruction, while the pattern history table may be separate as well or it may be shared between all conditional jumps. The Intel Pentium MMX, Pentium II, and Pentium III have local branch predictors with a local 4-bit history and a local pattern history table with 16 entries for each conditional jump. On the SPEC'89 benchmarks, very large local predictors saturate at 97.1% correct. combines the local and global prediction principles by concatenating local and global branch histories, possibly with some bits from the program counter as well. Tests indicate that the VIA Nano processor may be using this technique. Hybrid predictor A hybrid predictor, also called combined predictor, implements more than one prediction mechanism. The final prediction is based either on a meta-predictor that remembers which of the predictors has made the best predictions in the past, or a majority vote function based on an odd number of different predictors. Scott McFarling proposed combined branch prediction in his 1993 paper. Newer processors from Intel and AMD can predict indirect branches by using a two-level adaptive predictor. This kind of instruction contributes more than one bit to the history buffer. The zEC12 and later z/Architecture processors from IBM support a instruction that can preload the branch predictor entry for a given instruction with a branch target address constructed by adding the contents of a general-purpose register to an immediate displacement value. Processors without this mechanism will simply predict an indirect jump to go to the same target as it did last time. Neural branch prediction Machine learning for branch prediction using LVQ and multilayer perceptrons, called "neural branch prediction", was proposed by Lucian Vintan (Lucian Blaga University of Sibiu). One year later he developed the perceptron branch predictor. The neural branch predictor research was developed much further by Daniel Jimenez. In 2001, The main advantage of the neural predictor is its ability to exploit long histories while requiring only linear resource growth. Classical predictors require exponential resource growth. Jimenez reports a global improvement of 5.7% over a McFarling-style hybrid predictor. He also used a gshare/perceptron overriding hybrid predictors.). Intel already implements this idea in one of the IA-64's simulators (2003). The AMD Ryzen multi-core processor's Infinity Fabric and the Samsung Exynos processors include perceptron-based neural branch predictors. ==History==
History
The IBM 7030 Stretch, designed in the late 1950s, pre-executes all unconditional branches and any conditional branches that depended on the index registers. For other conditional branches, the first two production models implemented predict untaken; subsequent models were changed to implement predictions based on the current values of the indicator bits (corresponding to today's condition codes). The Stretch designers had considered static hint bits in the branch instructions early in the project but decided against them. Misprediction recovery was provided by the lookahead unit on Stretch, and part of Stretch's reputation for less-than-stellar performance was blamed on the time required for misprediction recovery. Subsequent IBM large computer designs did not use branch prediction with speculative execution until the IBM 3090 in 1985. Two-bit predictors were introduced by Tom McWilliams and Curt Widdoes in 1977 for the Lawrence Livermore National Lab S-1 supercomputer and independently by Jim Smith in 1979 at CDC. Microprogrammed processors, popular from the 1960s to the 1980s and beyond, took multiple cycles per instruction, and generally did not require branch prediction. However, in addition to the IBM 3090, there are several other examples of microprogrammed designs that incorporated branch prediction. The Burroughs B4900, a microprogrammed COBOL machine released around 1982, was pipelined and used branch prediction. The B4900 branch prediction history state is stored back into the in-memory instructions during program execution. The B4900 implements 4-state branch prediction by using 4 semantically equivalent branch opcodes to represent each branch operator type. The opcode used indicated the history of that particular branch instruction. If the hardware determines that the branch prediction state of a particular branch needs to be updated, it rewrites the opcode with the semantically equivalent opcode that hinted the proper history. This scheme obtains a 93% hit rate. and others were granted on this scheme. The DEC VAX 9000, announced in 1989, is both microprogrammed and pipelined, and performs branch prediction. The first commercial RISC processors, the MIPS R2000 and R3000 and the earlier SPARC processors, do only trivial "not-taken" branch prediction. Because they use branch delay slots, fetched just one instruction per cycle, and execute in-order, there is no performance loss. The later R4000 uses the same trivial "not-taken" branch prediction, and loses two cycles to each taken branch because the branch resolution recurrence is four cycles long. Branch prediction became more important with the introduction of pipelined superscalar processors like the Intel Pentium, DEC Alpha 21064, the MIPS R8000, and the IBM POWER series. These processors all rely on one-bit or simple bimodal predictors. The DEC Alpha 21264 (EV6) uses a next-line predictor overridden by a combined local predictor and global predictor, where the combining choice is made by a bimodal predictor. The AMD K8 has a combined bimodal and global predictor, where the combining choice is another bimodal predictor. This processor caches the base and choice bimodal predictor counters in bits of the L2 cache otherwise used for ECC. As a result, it has effectively very large base and choice predictor tables, and parity rather than ECC on instructions in the L2 cache. The parity design is sufficient, since any instruction suffering a parity error can be invalidated and refetched from memory. The Alpha 21464 ==See also==
tickerdossier.comtickerdossier.substack.com