Pentium Pro

package package Pentium Pro 256 KB The lead architect of Pentium Pro was Fred Pollack who was specialized in superscalarity and had also worked as the lead engineer of the Intel iAPX 432. Summary The Pentium Pro incorporated a new microarchitecture, different from the Pentium's P5 microarchitecture. It has a decoupled, 14-stage superpipelined architecture which used an instruction pool. The Pentium Pro (P6) implemented many radical architectural differences mirroring other contemporary x86 designs such as the NexGen Nx586 and Cyrix 6x86. The Pentium Pro pipeline had extra decode stages to dynamically translate IA-32 instructions into buffered micro-operation sequences which could then be analysed, reordered, and renamed in order to detect parallelizable operations that may be issued to more than one execution unit at once. The Pentium Pro thus featured out-of-order execution, including speculative execution via register renaming. It also had a wider 36-bit address bus, usable by Physical Address Extension (PAE), allowing it to access up to 64 GB ( of memory. The Pentium Pro has an 8 KB instruction cache, from which up to 16 bytes are fetched on each cycle and sent to the instruction decoders. There are three instruction decoders. The decoders are unequal in ability: only one can decode any x86 instruction, while the other two can only decode simple x86 instructions. This restricts the Pentium Pro's ability to decode multiple instructions simultaneously, limiting superscalar execution. x86 instructions are decoded into 118-bit micro-operations (micro-ops). The micro-ops are reduced instruction set computer (RISC)-like; that is, they encode an operation, two sources, and a destination. The general decoder can generate up to four micro-ops per cycle, whereas the simple decoders can generate one micro-op each per cycle. Thus, x86 instructions that operate on the memory (e.g., add this register to this location in the memory) can only be processed by the general decoder, as this operation requires a minimum of three micro-ops. Likewise, the simple decoders are limited to instructions that can be translated into one micro-op. Instructions that require more micro-ops than four are translated with the assistance of a sequencer, which generates the required micro-ops over multiple clock cycles. The Pentium Pro was the first processor in the x86 family to support upgradeable microcode under BIOS and/or operating system (OS) control. Micro-ops exit the re-order buffer (ROB) and enter a reservation station (RS), where they await dispatch to the execution units. In each clock cycle, up to five micro-ops can be dispatched to five execution units. The Pentium Pro has a total of six execution units: two integer units, one floating-point unit (FPU), a load unit, store address unit, and a store data unit. One of the integer units shares the same ports as the FPU, and therefore the Pentium Pro can only dispatch one integer micro-op and one floating-point micro-op, or two integer micro-ops per a cycle, in addition to micro-ops for the other three execution units. Of the two integer units, only the one that shares the path with the FPU on port 0 has the full complement of functions such as a barrel shifter, multiplier, divider, and support for LEA instructions. The second integer unit, which is connected to port 1, does not have these facilities and is limited to simple operations such as add, subtract, and the calculation of branch target addresses. Specific use of partial registers was then a common performance optimization, as it incurred no performance penalty on pre-P6 Intel processors; also, the dominant operating systems at the time of the Pentium Pro's release were 16-bit MS-DOS, and mixed 16/32-bit Windows 3.1x and Windows 95 (although the latter requires a 32-bit 80386 CPU as a minimum, much of its code is still 16-bit for performance reasons, such as the 16-bit Windows USER dynamic link library, user.exe). This, along with the high cost of Pentium Pro systems, led to tepid sales among PC buyers at the time. To fully use the Pentium Pro's P6 microarchitecture, a fully 32-bit operating system is needed, such as Windows NT, Linux, Unix, or OS/2. The performance issues on legacy code were later partly mitigated by Intel with the Pentium II. Compared to RISC microprocessors, the Pentium Pro, when introduced, slightly outperformed the fastest RISC microprocessors on integer performance when running the SPECint95 benchmark, but floating-point performance was significantly lower, half that of some RISC microprocessors. The Pentium Pro's integer performance lead disappeared rapidly, first overtaken by the MIPS Technologies R10000 in January 1996, and then by Digital Equipment Corporation's EV56 variant of the Alpha 21164. Reviewers quickly noted the very slow writes to video memory as the weak spot of the P6 platform, with performance here being as low as 10% of an identically clocked Pentium system in benchmarks such as VIDSPEED. Methods to circumvent this included setting VESA drawing to system memory instead of video memory in games such as Quake, and later on utilities such as FASTVID emerged, which could double performance in certain games by enabling the write combining features of the CPU. Memory type range registers (MTRRs) are set automatically by Windows video drivers starting from 1997, and from there the improved cache/memory subsystem and FPU performance caused it to outclass the Pentium clock-for-clock in the emerging 3D games of the mid–to–late 1990s, particularly when using Windows NT 4.0. However, its lack of MMX implementation (which came out a year after the Pentium Pro's release) reduces performance in multimedia applications that made use of those instructions. Caching Likely Pentium Pro's most noticeable addition was its on-package L2 cache, which ranged from 256 KB at introduction to 1 MB in 1997. At the time, manufacturing technology did not feasibly allow a large L2 cache to be integrated into the processor core. Intel instead placed the L2 die(s) separately in the package which still allowed it to run at the same clock speed as the CPU core. Additionally, unlike most motherboard-based cache schemes that shared the main system bus with the CPU, the Pentium Pro's cache had its own back-side bus (called dual independent bus by Intel). Because of this, the CPU could read main memory and cache concurrently, greatly reducing a traditional bottleneck. The cache was also "non-blocking", meaning that the processor could issue more than one cache request at a time (up to 4), reducing cache-miss penalties; an example of memory-level parallelism (MLP). These properties combined to produce an L2 cache that was immensely faster than the motherboard-based caches of older processors. This cache alone gave the CPU an advantage in input/output performance over older x86 CPUs. In multiprocessor configurations, Pentium Pro's integrated cache skyrocketed performance in comparison to architectures which had each CPU sharing a central cache. However, this far faster L2 cache did come with some complications. The Pentium Pro's "on-package cache" arrangement was unique. The processor and the cache were on separate dies in the same package and connected closely by a full-speed bus. The two or three dies had to be bonded together early in the production process, before testing was possible. This meant that a single, tiny flaw in either die made it necessary to discard the entire assembly, which was one of the reasons for the Pentium Pro's relatively low production yield and high cost. All versions of the chip were expensive, those with 1024 KB being particularly so, since it required two 512 KB cache dies as well as the processor die. ==Available models==