A traditional pipelined CPU has a register fetch stage near the beginning of the pipeline which reads the register operands of the instruction from the
physical register file and a write-back stage near the end where the register outputs are written back out to the physical register file. Since there are multiple clock cycles between those stages, an instruction can't be in the pipeline directly after another instruction that produces a value for a register required by the first. Processors that take
pipeline hazards into account can automatically insert
bubbles in the pipeline to keep an instruction at the register fetch stage until all of its inputs have been written back. In a pipeline design with one execute stage, a bypass bus can be added to allow data produced by the execute stage in one cycle to be directly used as an input to the execute stage on the next cycle thus eliminating the latency penalty for back-to-back
dependent instructions. More complex bypass buses can be designed for more complex pipelines and superscalar processors can make use of forwarding networks to allow data to be routed between
execution units. Not all instructions have latencies that are known at the time that instructions are scheduled. If the prediction is incorrect and the read is a miss, the results of the dependent instruction will be discarded and the instruction will be rescheduled after the read is complete. If the L2 cache latency is known, the instruction could be rescheduled to attempt to use the bypass bus again at the L2 latency. Processors that decode instructions into multiple
micro-operations and schedule them separately can replay only the μops that are dependent on the mispredicted instruction. == History ==