package package Pentium Pro 256 KB The lead architect of Pentium Pro was
Fred Pollack who was specialized in
superscalarity and had also worked as the lead engineer of the
Intel iAPX 432.
Summary The Pentium Pro incorporated a new
microarchitecture, different from the Pentium's
P5 microarchitecture. It has a decoupled, 14-stage superpipelined architecture which used an instruction pool. The Pentium Pro (
P6) implemented many radical architectural differences mirroring other contemporary
x86 designs such as the
NexGen Nx586 and
Cyrix 6x86. The Pentium Pro pipeline had extra decode stages to dynamically translate
IA-32 instructions into buffered
micro-operation sequences which could then be analysed, reordered, and renamed in order to detect parallelizable operations that may be issued to more than one
execution unit at once. The Pentium Pro thus featured
out-of-order execution, including
speculative execution via
register renaming. It also had a wider 36-bit
address bus, usable by
Physical Address Extension (PAE), allowing it to access up to 64 GB ( of memory. The Pentium Pro has an 8 KB
instruction cache, from which up to 16 bytes are
fetched on each cycle and sent to the
instruction decoders. There are three instruction decoders. The decoders are unequal in ability: only one can decode any x86 instruction, while the other two can only decode simple x86 instructions. This restricts the Pentium Pro's ability to decode multiple instructions simultaneously, limiting superscalar execution. x86 instructions are decoded into 118-bit
micro-operations (micro-ops). The micro-ops are
reduced instruction set computer (RISC)-like; that is, they encode an operation, two sources, and a destination. The general decoder can generate up to four micro-ops per cycle, whereas the simple decoders can generate one micro-op each per cycle. Thus, x86 instructions that operate on the memory (e.g., add this register to this location in the memory) can only be processed by the general decoder, as this operation requires a minimum of three micro-ops. Likewise, the simple decoders are limited to instructions that can be translated into one micro-op. Instructions that require more micro-ops than four are translated with the assistance of a sequencer, which generates the required micro-ops over multiple clock cycles. The Pentium Pro was the first processor in the x86 family to support upgradeable
microcode under
BIOS and/or
operating system (OS) control. Micro-ops exit the
re-order buffer (ROB) and enter a
reservation station (RS), where they await dispatch to the execution units. In each clock cycle, up to five micro-ops can be dispatched to five execution units. The Pentium Pro has a total of six execution units: two integer units, one
floating-point unit (FPU), a load unit, store address unit, and a store data unit. One of the integer units shares the same ports as the FPU, and therefore the Pentium Pro can only dispatch one integer micro-op and one floating-point micro-op, or two integer micro-ops per a cycle, in addition to micro-ops for the other three execution units. Of the two integer units, only the one that shares the path with the FPU on port 0 has the full complement of functions such as a
barrel shifter, multiplier, divider, and support for LEA instructions. The second integer unit, which is connected to port 1, does not have these facilities and is limited to simple operations such as add, subtract, and the calculation of branch target addresses. Specific use of partial registers was then a common performance optimization, as it incurred no performance penalty on pre-P6 Intel processors; also, the dominant operating systems at the time of the Pentium Pro's release were 16-bit
MS-DOS, and mixed 16/32-bit
Windows 3.1x and
Windows 95 (although the latter requires a 32-bit
80386 CPU as a minimum, much of its code is still 16-bit for performance reasons, such as the 16-bit
Windows USER dynamic link library,
user.exe). This, along with the high cost of Pentium Pro systems, led to tepid sales among PC buyers at the time. To fully use the Pentium Pro's
P6 microarchitecture, a fully 32-bit operating system is needed, such as
Windows NT,
Linux,
Unix, or
OS/2. The performance issues on legacy code were later partly mitigated by Intel with the Pentium II. Compared to RISC microprocessors, the Pentium Pro, when introduced, slightly outperformed the fastest RISC microprocessors on integer performance when running the
SPECint95 benchmark, but floating-point performance was significantly lower, half that of some RISC microprocessors. The Pentium Pro's integer performance lead disappeared rapidly, first overtaken by the
MIPS Technologies R10000 in January 1996, and then by
Digital Equipment Corporation's EV56 variant of the
Alpha 21164. Reviewers quickly noted the very slow writes to video memory as the weak spot of the P6 platform, with performance here being as low as 10% of an identically clocked Pentium system in benchmarks such as VIDSPEED. Methods to circumvent this included setting VESA drawing to system memory instead of video memory in games such as
Quake, and later on utilities such as FASTVID emerged, which could double performance in certain games by enabling the
write combining features of the CPU.
Memory type range registers (MTRRs) are set automatically by Windows video drivers starting from 1997, and from there the improved cache/memory subsystem and FPU performance caused it to outclass the Pentium clock-for-clock in the emerging 3D games of the mid–to–late 1990s, particularly when using
Windows NT 4.0. However, its lack of
MMX implementation (which came out a year after the Pentium Pro's release) reduces performance in multimedia applications that made use of those instructions.
Caching Likely Pentium Pro's most noticeable addition was its on-package
L2 cache, which ranged from 256 KB at introduction to 1 MB in 1997. At the time, manufacturing technology did not feasibly allow a large L2 cache to be integrated into the processor core. Intel instead placed the L2 die(s) separately in the package which still allowed it to run at the same clock speed as the CPU core. Additionally, unlike most motherboard-based cache schemes that shared the main system bus with the CPU, the Pentium Pro's cache had its own
back-side bus (called
dual independent bus by Intel). Because of this, the CPU could read main memory and cache concurrently, greatly reducing a traditional bottleneck. The cache was also "non-blocking", meaning that the processor could issue more than one cache request at a time (up to 4), reducing cache-miss penalties; an example of
memory-level parallelism (MLP). These properties combined to produce an L2 cache that was immensely faster than the motherboard-based caches of older processors. This cache alone gave the CPU an advantage in input/output performance over older
x86 CPUs. In multiprocessor configurations, Pentium Pro's integrated cache skyrocketed performance in comparison to architectures which had each CPU sharing a central cache. However, this far faster L2 cache did come with some complications. The Pentium Pro's "on-package cache" arrangement was unique. The processor and the cache were on separate dies in the same package and connected closely by a full-speed bus. The two or three dies had to be bonded together early in the production process, before testing was possible. This meant that a single, tiny flaw in either die made it necessary to discard the entire assembly, which was one of the reasons for the Pentium Pro's relatively low production yield and high cost. All versions of the chip were expensive, those with 1024 KB being particularly so, since it required two 512 KB cache dies as well as the processor die. ==Available models==