The simplest type of multithreading occurs when one thread runs until it is blocked by an event that normally would create a long-latency stall. Such a stall might be a cache miss that has to access off-chip memory, which might take hundreds of CPU cycles for the data to return. Instead of waiting for the stall to resolve, a threaded processor would switch execution to another thread that was ready to run. Only when the data for the previous thread had arrived, would the previous thread be placed back on the list of
ready-to-run threads. For example: • Cycle : instruction from thread is issued. • Cycle : instruction from thread is issued. • Cycle : instruction from thread is issued, which is a load instruction that misses in all caches. • Cycle : thread scheduler invoked, switches to thread . • Cycle : instruction from thread is issued. • Cycle : instruction from thread is issued. Conceptually, it is similar to cooperative multi-tasking used in
real-time operating systems, in which tasks voluntarily give up execution time when they need to wait upon some type of event. This type of multithreading is known as block, cooperative or coarse-grained multithreading. The goal of multithreading hardware support is to allow quick switching between a blocked thread and another thread ready to run. Switching from one thread to another means the hardware switches from using one register set to another. To achieve this goal, the hardware for the program visible registers, as well as some processor control registers (such as the program counter), is replicated. For example, to quickly switch between two threads, the processor is built with two sets of registers. Additional hardware support for multithreading allows thread switching to be done in one CPU cycle, bringing performance improvements. Also, additional hardware allows each thread to behave as if it were executing alone and not sharing any hardware resources with other threads, minimizing the amount of software changes needed within the application and the operating system to support multithreading. Many families of
microcontrollers and embedded processors have multiple register banks to allow quick
context switching for interrupts. Such schemes can be considered a type of block multithreading among the user program thread and the interrupt threads.
Fine-grained multithreading The purpose of fine-grained multithreading is to remove all
data dependency stalls from the execution
pipeline. Since one thread is relatively independent from other threads, there is less chance of one instruction in one pipelining stage needing an output from an older instruction in the pipeline. Conceptually, it is similar to
preemptive multitasking used in operating systems; an analogy would be that the time slice given to each active thread is one CPU cycle. For example: • Cycle : an instruction from thread is issued. • Cycle : an instruction from thread is issued. This type of multithreading was first called barrel processing, in which the
staves of a barrel represent the pipeline stages and their executing threads. Interleaved, preemptive, fine-grained or time-sliced multithreading are more modern terminology. In addition to the hardware costs discussed in the block type of multithreading, interleaved multithreading has an additional cost of each pipeline stage tracking the thread ID of the instruction it is processing. Also, since there are more threads being executed concurrently in the pipeline, shared resources such as caches and TLBs need to be larger to avoid thrashing between the different threads.
Simultaneous multithreading The most advanced type of multithreading applies to
superscalar processors. Whereas a normal superscalar processor issues multiple instructions from a single thread every CPU cycle, in simultaneous multithreading (SMT) a superscalar processor can issue instructions from multiple threads every CPU cycle. Recognizing that any single thread has a limited amount of
instruction-level parallelism, this type of multithreading tries to exploit parallelism available across multiple threads to decrease the waste associated with unused issue slots. For example: • Cycle : instructions and from thread and instruction from thread are simultaneously issued. • Cycle : instruction from thread , instruction from thread , and instruction from thread are all simultaneously issued. • Cycle : instruction from thread and instructions and from thread are all simultaneously issued. To distinguish the other types of multithreading from SMT, the term "
temporal multithreading" is used to denote when instructions from only one thread can be issued at a time. In addition to the hardware costs discussed for interleaved multithreading, SMT has the additional cost of each pipeline stage tracking the thread ID of each instruction being processed. Again, shared resources such as caches and TLBs have to be sized for the large number of active threads being processed. Implementations include
DEC (later
Compaq)
EV8 (not completed),
Intel Hyper-Threading Technology,
IBM POWER5/
POWER6/
POWER7/
POWER8/
POWER9, IBM
z13/
z14/
z15,
Sun Microsystems UltraSPARC T2,
Cray XMT, and
AMD Bulldozer and
Zen microarchitectures. ==Implementation specifics==