Out-of-order execution is a restricted form of
dataflow architecture, which was a major research area in
computer architecture in the 1970s and early 1980s.
Early use in supercomputers Arguably the first machine to use out-of-order execution is the
CDC 6600 (1964), which used a
scoreboard to resolve conflicts. The 6600 however lacked
WAW conflict handling, choosing instead to stall. This situation was termed a "First Order Conflict" by Thornton. Whilst it had both
RAW conflict resolution (termed "Second Order Conflict") and
WAR conflict resolution (termed "Third Order Conflict") all of which is sufficient to declare it capable of full out-of-order execution, the 6600 did not have precise exception handling. An early and limited form of Branch prediction was possible as long as the branch was to locations on what was termed the "Instruction Stack" which was limited to within a depth of seven words from the Program Counter. About two years later, the
IBM System/360 Model 91 (1966) introduced
register renaming with
Tomasulo's algorithm, which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible. An instruction addressing a write into a register
rn can be executed before an earlier instruction using the register
rn is executed, by actually writing into an alternative (renamed) register
alt-rn, which is turned into a normal register
rn only when all the earlier instructions addressing
rn have been executed, but until then
rn is given for earlier instructions and
alt-rn for later ones addressing
rn. In the Model 91 the register renaming is implemented by a
bypass termed
Common Data Bus (CDB) and memory source operand buffers, leaving the physical architectural registers unused for many cycles as the oldest state of registers addressed by any unexecuted instruction is found on the CDB. Another advantage the Model 91 has over the 6600 is the ability to execute instructions out-of-order in the same
execution unit, not just between the units like the 6600. This is accomplished by
reservation stations, from which instructions go to the execution unit when ready, as opposed to the FIFO queue of each execution unit of the 6600. The Model 91 is also capable of reordering loads and stores to execute before the preceding loads and stores, Only the floating-point registers of the Model 91 are renamed, making it subject to the same WAW and WAR limitations as the CDC 6600 when running fixed-point calculations. The 91 and 6600 both also suffer from
imprecise exceptions, which needed to be solved before out-of-order execution could be applied generally and made practical outside supercomputers.
Precise exceptions To have
precise exceptions, the proper in-order state of the program's execution must be available upon an exception. By 1985 various approaches were developed as described by
James E. Smith and Andrew R. Pleszkun. The
CDC Cyber 205 was a precursor, as upon a virtual memory interrupt the entire state of the processor (including the information on the partially executed instructions) is saved into an
invisible exchange package, so that it can resume at the same state of execution. However to make all exceptions precise, there has to be a way to cancel the effects of instructions. The CDC Cyber 990 (1984) implements precise interrupts by using a history buffer, which holds the old (overwritten) values of registers that are restored when an exception necessitates the reverting of instructions. In the 1980s many early
RISC microprocessors, had out-of-order writeback to the registers, invariably resulting in imprecise exceptions. The
Motorola 88100 was one of the few early microprocessors that did not suffer from imprecise exceptions despite out-of-order writes, although it did allow both precise and imprecise floating-point exceptions. Instructions started execution in order, but some (e.g. floating-point) took more cycles to complete execution. However, the single-cycle execution of the most basic instructions greatly reduced the scope of the problem compared to the CDC 6600.
Decoupling Smith also researched how to make different execution units operate more independently of each other and of the memory, front-end, and branching. He implemented those ideas in the
Astronautics ZS-1 (1988), featuring a decoupling of the integer/load/store
pipeline from the floating-point pipeline, allowing inter-pipeline reordering. The ZS-1 was also capable of executing loads ahead of preceding stores. In his 1984 paper, he opined that enforcing the precise exceptions only on the integer/memory pipeline should be sufficient for many use cases, as it even permits
virtual memory. Each pipeline had an instruction buffer to decouple it from the instruction decoder, to prevent the stalling of the front end. To further decouple the memory access from execution, each of the two pipelines was associated with two addressable
queues that effectively performed limited register renaming. A similar decoupled architecture had been used a bit earlier in the Culler 7. The ZS-1's ISA, like IBM's subsequent POWER, aided the early execution of branches.
Research comes to fruition With the
POWER1 (1990), IBM returned to out-of-order execution. It was the first processor to combine register renaming (though again only floating-point registers) with precise exceptions. It uses a
physical register file (i.e. a dynamically remapped file with both uncommitted and committed values) instead of a reorder buffer, but the ability to cancel instructions is needed only in the branch unit, which implements a history buffer (named
program counter stack by IBM) to undo changes to count, link, and condition registers. The reordering capability of even the floating-point instructions is still very limited; due to POWER1's inability to reorder floating-point arithmetic instructions (results became available in-order), their destination registers aren't renamed. POWER1 also doesn't have
reservation stations needed for out-of-order use of the same execution unit. The next year IBM's
ES/9000 model 900 had register renaming added for the general-purpose registers. It also has
reservation stations with six entries for the dual integer unit (each cycle, from the six instructions up to two can be selected and then executed) and six entries for the FPU. Other units have simple FIFO queues. The reordering distance is up to 32 instructions. The A19 of
Unisys'
A-series of mainframes was also released in 1991 and was claimed to have out-of-order execution, and one analyst called the A19's technology three to five years ahead of the competition.
Wide adoption The first
superscalar single-chip processors (
Intel i960CA in 1989) used a simple scoreboarding scheduling like the CDC 6600 had a quarter of a century earlier. In 1992–1996 a rapid advancement of techniques, enabled by
increasing transistor counts, saw proliferation down to
personal computers. The
Motorola 88110 (1992) used a history buffer to revert instructions. Loads could be executed ahead of preceding stores. While stores and branches were waiting to start execution, subsequent instructions of other types could keep flowing through all the pipeline stages, including writeback. The 12-entry capacity of the history buffer placed a limit on the reorder distance. The
PowerPC 601 (1993) was an evolution of the
RISC Single Chip, itself a simplification of POWER1. The 601 permitted branch and floating-point instructions to overtake the integer instructions already in the fetched instruction queue, the lowest four entries of which were scanned for dispatchability. In the case of a cache miss, loads and stores could be reordered. Only the link and count registers could be renamed. In the fall of 1994
NexGen and
IBM with Motorola brought the renaming of general-purpose registers to single-chip CPUs. NexGen's Nx586 was the first
x86 processor capable of out-of-order execution and featured a reordering distance of up to 14
micro-operations. The
PowerPC 603 renamed both the general-purpose and FP registers. Each of the four non-branch execution units can have one instruction wait in front of it without blocking the instruction flow to the other units. A five-entry
reorder buffer lets no more than four instructions overtake an unexecuted instruction. Due to a store buffer, a load can access cache ahead of a preceding store.
PowerPC 604 (1995) was the first single-chip processor with
execution unit-level reordering, as three out of its six units each had a two-entry reservation station permitting the newer entry to execute before the older. The reorder buffer capacity is 16 instructions. A four-entry load queue and a six-entry store queue track the reordering of loads and stores upon cache misses.
HAL SPARC64 (1995) exceeded the reordering capacity of the
ES/9000 model 900 by having three 8-entry reservation stations for integer, floating-point, and
address generation unit, and a 12-entry reservation station for load/store, which permits greater reordering of cache/memory access than preceding processors. Up to 64 instructions can be in a reordered state at a time.
Pentium Pro (1995) introduced a
unified reservation station, which at the 20 micro-OP capacity permitted very flexible reordering, backed by a 40-entry reorder buffer. Loads can be reordered ahead of both loads and stores. The practically attainable
per-cycle rate of execution rose further as full out-of-order execution was further adopted by
SGI/
MIPS (
R10000) and
HP PA-RISC (
PA-8000) in 1996. The same year
Cyrix 6x86 and
AMD K5 brought advanced reordering techniques into mainstream personal computers. Since
DEC Alpha gained out-of-order execution in 1998 (
Alpha 21264), the top-performing out-of-order processor cores have been unmatched by in-order cores other than
HP/
Intel Itanium 2 and
IBM POWER6, though the latter had an out-of-order
floating-point unit. The other high-end in-order processors fell far behind, namely
Sun's
UltraSPARC III/
IV, and IBM's
mainframes which had lost the out-of-order execution capability for the second time, remaining in-order into the
z10 generation. Later big in-order processors were focused on multithreaded performance, but eventually the
SPARC T series and
Xeon Phi changed to out-of-order execution in 2011 and 2016 respectively. Almost all processors for phones and other lower-end applications remained in-order until . First,
Qualcomm's
Scorpion (reordering distance of 32) shipped in
Snapdragon, and a bit later
Arm's
A9 succeeded
A8. For low-end
x86 personal computers in-order
Bonnell microarchitecture in early
Intel Atom processors were first challenged by
AMD's
Bobcat microarchitecture, and in 2013 were succeeded by an out-of-order
Silvermont microarchitecture. Because the complexity of out-of-order execution precludes achieving the lowest minimum power consumption, cost and size, in-order execution is still prevalent in
microcontrollers and
embedded systems, as well as in phone-class cores such as Arm's
A55 and
A510 in
big.LITTLE configurations. == Basic concept ==