POWER3

The POWER3 was based on the PowerPC 620, an earlier 64-bit PowerPC implementation that was late, under-performing and commercially unsuccessful. Like the PowerPC 620, the POWER3 has three fixed-point units, but the single floating-point unit (FPU) was replaced with two floating-point fused multiply–add units, and an extra load-store unit was added (for a total of two) to improve floating-point performance. The POWER3 is a superscalar design that executed instructions out of order. It has a seven-stage integer pipeline, a minimal eight-stage load/store pipeline and a ten-stage floating-point pipeline. The front end consists of two stages: fetch and decode. During the first stage, eight instructions were fetched from a 32 KB instruction cache and placed in a 12-entry instruction buffer. During the second stage, four instructions were taken from the instruction buffer, decoded, and issued to instruction queues. Restrictions on instruction issue are few: of the two integer instruction queues, only one can accept one instruction, the other can accept up to four, as does the floating-point instruction queue. If the queues do not have enough unused entries, instructions cannot be issued. The front end has a short pipeline, resulting in a small three-cycle branch misprediction penalty. In stage three, instructions in the instruction queues that are ready for execution have their operands read from the register files. The general-purpose register file contains 48 registers, of which 32 are general-purpose registers and 16 are rename registers for register renaming. To reduce the number of ports required to provide data and receive results, the general purpose register file is duplicated so that there are two copies, the first supporting three integer execution units and the second supporting the two load/store units. This scheme was similar to a contemporary microprocessor, the DEC Alpha 21264, but was simpler as it did not require an extra clock cycle to synchronize the two copies due to the POWER3's higher cycle times. The floating-point register file contains 56 registers, of which 32 are floating-point registers and 24 rename registers. Compared to the PowerPC 620, there were more rename registers, which allowed more instructions to be executed out of order, improving performance. Execution begins in stage four. The instruction queues dispatch up to eight instructions to the execution units. Integer instructions are executed in three integer execution units (termed "fixed-point units" by IBM). Two of the units are identical and execute all integer instructions except for multiply and divide. All instructions executed by them have a one-cycle latency. The third unit executes multiply and divide instructions. These instructions are not pipelined and have multi-cycle latencies. 64-bit multiply has a nine-cycle latency and 64-bit divide has a 37-cycle latency. Floating-point instructions are executed in two floating-point units (FPUs). The FPUs are capable of fused multiply–add, where multiplication and addition is performed simultaneously. Such instructions, along with individual add and multiply, have a four-cycle latency. Divide and square-root instructions are executed in the same FPUs, but are assisted by specialized hardware. Single-precision (32-bit) divide and square-root instructions have a 14-cycle latency, whereas double-precision (64-bit) divide and square-root instructions have an 18-cycle and a 22-cycle latency, respectively. After execution is completed, the instructions are held in buffers before being committed and made visible to software. Execution finishes in stage five for integer instructions and stage eight for floating-point. Committing occurs during stage six for integers, stage nine for floating-point. Writeback occurs in the stage after commit. The POWER3 can retire up to four instructions per cycle. The PowerPC 620 data cache was optimized for technical and scientific applications. Its capacity was doubled to 64 KB, to improve the cache-hit rate; the cache was dual-ported, implemented by interleaving eight banks, to enable two loads or two stores to be performed in one cycle in certain cases; and the line-size was increased to 128-bytes. The L2 cache bus was doubled in width to 256 bits to compensate for the larger cache line size and to retain a four-cycle latency for cache refills. The POWER3 contained 15 million transistors on a 270 mm2 die. It was fabricated in IBM's CMOS-6S2 process, a complementary metal–oxide–semiconductor process that is a hybrid of 0.25 μm feature sizes and 0.35 μm metal layers. The process features five layers of aluminium. It was packaged in the same 1,088-column ceramic column grid array as the P2SC, but with a different pin out. == POWER3-II ==