The Alpha 21264 is a four-issue
superscalar microprocessor with
out-of-order execution and
speculative execution. It has a peak execution rate of six instructions per cycle and could sustain four instructions per cycle. It has a seven-stage
instruction pipeline.
Out of order execution At any given stage, the microprocessor could have up to 80 instructions in various stages of execution, surpassing any other contemporary microprocessor. Decoded instructions are held in instruction queues and are issued when their operands are available. The integer queue contained 20 entries and the floating-point queue 15. Each queue could issue as many instructions as there were pipelines.
Ebox The Ebox executes integer, load and store instructions. It has two integer units, two load store units and two integer
register files. Each integer register file contained 80 entries, of which 32 are architectural registers, 40 are rename registers and 8 are PAL shadow registers. There was no entry for register R31 because in the Alpha architecture, R31 is hardwired to zero and is read-only. Each register file served an integer unit and a load store unit, and the register file and its two units are referred to as a "cluster". The two clusters were designated U0 and U1. This scheme was used as it reduced the number of write and read ports required to serve operands and receive results, thus reducing the physical size of the register file, enabling the microprocessor to operate at higher clock frequencies. Writes to any of the register files thus have to be synchronized, which required a clock cycle to complete, negatively impacting performance by one percent. The reduction of performance resulting from the synchronization was compensated in two ways. Firstly, the higher clock frequency achievable offset the loss. Secondly, the logic responsible for instruction issue avoided creating situations where the register file had to be synchronized by issuing instructions that were not dependent on data held in other register file where possible. The clusters are near identical except for two differences: U1 has a seven-cycle pipelined multiplier while U0 has a three-cycle pipeline for executing Motion Video Instructions (MVI), an extension to the Alpha Architecture defining single instruction multiple data (SIMD) instructions for multimedia. The load store units are simple
arithmetic logic units used to calculate
virtual addresses for memory access. They are also capable of executing simple arithmetic and logic instructions. The Alpha 21264 instruction issue logic utilized this capability, issuing instructions to these units when they were available for use (not performing address arithmetic). The Ebox therefore has four 64-bit
adders, four logic units, two
barrel shifters, byte-manipulation logic, two sets of conditional branch logic equally divided between U1 and U0.
Fbox The Fbox is responsible for executing
floating-point instructions. It consists of two floating-point pipelines and a floating-point register file. The pipelines are not identical, one executes the majority of instructions and the other only multiply instructions. The adder pipeline has two non-pipelined units connected to it, a divide unit and a square root unit. Adds, multiplies and most other instructions have a 4-cycle latency, a double-precision divide has 16-cycle latency and a double-precision square root has a 33-cycle latency. The floating point register file contains 72 entries, of which 32 are architectural registers and 40 are rename registers.
Cache The Alpha 21264 has two levels of
cache, a primary cache and secondary cache. The level three (L3, or "victim") cache of the
Alpha 21164 was not used due to problems with bandwidth.
Primary caches The primary cache is split into separate caches for instructions and data ("
modified Harvard architecture"), the I-cache and D-cache, respectively. Both caches have a capacity of 64 KB. The D-cache is dual-ported by transferring data on both the rising and falling edges of the clock signal. This method of dual-porting enabled any combination of reads or writes to the cache every processor cycle. It also avoided duplication the cache so there are two, as in the Alpha 21164. Duplicating the cache restricted the capacity of the cache, as it required more transistors to provide the same amount of capacity, and in turn increased the area required and power consumed.
B-cache The secondary cache, termed the B-cache, is an external cache with a capacity of 1 to 16 MB. It is controlled by the microprocessor and is implemented by synchronous
static random access memory (SSRAM) chips that operate at two thirds, half, one-third or one-fourth the internal clock frequency, or 133 to 333 MHz at 500 MHz. The B-cache was accessed with a dedicated 128-bit bus that operates at the same clock frequency as the SSRAM or at twice the clock frequency if
double data rate SSRAM is used. The B-cache is direct-mapped.
Branch prediction Branch prediction is performed by a tournament branch prediction algorithm. The algorithm was developed by Scott McFarling at Digital's Western Research Laboratory (WRL) and was described in a 1993 paper. This predictor was used as the Alpha 21264 has a minimum branch misprediction penalty of seven cycles. Due to the instruction cache's two cycle latency and the instruction queues, the average branch misprediction penalty is 11 cycles. The algorithm maintains two history tables, Local and Global, and the table used to predict the outcome of a branch is determined by a Choice predictor. The local predictor is a two-level table which records the history of individual branches. It consists of a 1,024-entry by 10-bit branch history table. A two-level table was used as the prediction accuracy is similar to that of a larger single-level table while requiring fewer bits of storage. It has a 1,024-entry branch prediction table. Each entry is a 3-bit saturating counter. The value of the counter determines whether the current branch is taken or not taken. The global predictor is a single-level, 4096-entry branch history table. Each entry is a 2-bit saturating counter; the value of this counter determines whether the current branch is taken or not taken. The choice predictor records the history of the local and global predictors to determine which predictor is the best for a particular branch. It has a 4,096-entry branch history table. Each entry is a 2-bit saturating counter. The value of the counter determines if the local or global predictor is used.
External interface The external interface consisted of a bidirectional 64-bit
double data rate (DDR)
data bus and two 15-bit unidirectional time-multiplexed
address and
control buses, one for signals originating from the Alpha 21264 and one for signals originating from the system. Digital licensed the bus to
Advanced Micro Devices (AMD), and it was subsequently used in their
Athlon microprocessors, where it was known as the EV6 bus. Later, the EV6 bus was evolved to
HyperTransport. == Memory addressing ==