Modern processors have multiple interacting on-chip caches. The operation of a particular cache can be completely specified by the cache size, the cache block size, the number of blocks in a set, the cache set replacement policy, and the cache write policy (write-through or write-back). Intel's
Crystalwell variant of its
Haswell processors introduced an on-package 128 MiB
eDRAM Level 4 cache which serves as a victim cache to the processors' Level 3 cache. In the
Skylake microarchitecture the Level 4 cache no longer works as a victim cache.
Trace cache One of the more extreme examples of cache specialization is the
trace cache (also known as
execution trace cache) found in the
Intel Pentium 4 microprocessors. A trace cache is a mechanism for increasing the instruction fetch bandwidth and decreasing power consumption (in the case of the Pentium 4) by storing traces of
instructions that have already been fetched and decoded. A trace cache stores instructions either after they have been decoded, or as they are retired. Generally, instructions are added to trace caches in groups representing either individual
basic blocks or dynamic instruction traces. The Pentium 4's trace cache stores
micro-operations resulting from decoding x86 instructions, providing also the functionality of a micro-operation cache. Having this, the next time an instruction is needed, it does not have to be decoded into micro-ops again. is a special cache that is part of L2 cache in
AMD's
Bulldozer microarchitecture. Stores from both L1D caches in the module go through the WCC, where they are buffered and coalesced. The WCC's task is reducing number of writes to the L2 cache.
Micro-operation (μop or uop) cache A
micro-operation cache (
μop cache,
uop cache or
UC) is a specialized cache that stores
micro-operations of decoded instructions, as received directly from the
instruction decoders or from the instruction cache. When an instruction needs to be decoded, the μop cache is checked for its decoded form which is re-used if cached; if it is not available, the instruction is decoded and then cached. One of the early works describing μop cache as an alternative frontend for the Intel
P6 processor family is the 2001 paper
"Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Length ISA". Later, Intel included μop caches in its
Sandy Bridge processors and in successive microarchitectures like
Ivy Bridge and
Haswell. AMD implemented a μop cache in their
Zen microarchitecture. Fetching complete pre-decoded instructions eliminates the need to repeatedly decode variable length complex instructions into simpler fixed-length micro-operations, and simplifies the process of predicting, fetching, rotating and aligning fetched instructions. A μop cache effectively offloads the fetch and decode hardware, thus decreasing
power consumption and improving the frontend supply of decoded micro-operations. The μop cache also increases performance by more consistently delivering decoded micro-operations to the backend and eliminating various bottlenecks in the CPU's fetch and decode logic.
Branch target instruction cache A
branch target cache or
branch target instruction cache, the name used on
ARM microprocessors, is a specialized cache which holds the first few instructions at the destination of a taken branch. This is used by low-powered processors which do not need a normal instruction cache because the memory system is capable of delivering instructions fast enough to satisfy the CPU without one. However, this only applies to consecutive instructions in sequence; it still takes several cycles of latency to restart instruction fetch at a new address, causing a few cycles of pipeline bubble after a control transfer. A branch target cache provides instructions for those few cycles avoiding a delay after most taken branches. This allows full-speed operation with a much smaller cache than a traditional full-time instruction cache.
Smart cache Smart cache is a
level 2 or
level 3 caching method for multiple execution cores, developed by
Intel. Smart Cache shares the actual cache memory between the cores of a
multi-core processor. In comparison to a dedicated per-core cache, the overall
cache miss rate decreases when cores do not require equal parts of the cache space. Consequently, a single core can use the full level 2 or level 3 cache while the other cores are inactive. Furthermore, the shared cache makes it faster to share memory among different execution cores.
Multi-level caches Another issue is the fundamental tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger, slower caches. Multi-level caches generally operate by checking the fastest but smallest cache,
level 1 (
L1), first; if it hits, the processor proceeds at high speed. If that cache misses, the slower but larger next level cache,
level 2 (
L2), is checked, and so on, before accessing external memory. As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache. Price-sensitive designs used this to pull the entire cache hierarchy on-chip, but by the 2010s some of the highest-performance designs returned to having large off-chip caches, which is often implemented in
eDRAM and mounted on a
multi-chip module, as a fourth cache level. In rare cases, such as in the mainframe CPU
IBM z15 (2019), all levels down to L1 are implemented by eDRAM, replacing
SRAM entirely (for cache, SRAM is still used for registers).
Apple's ARM-based Apple silicon series, starting with the
A14 and
M1, have a 192 KiB L1i cache for each of the high-performance cores, an unusually large amount; however the high-efficiency cores only have 128 KiB. Since then other processors such as
Intel's
Lunar Lake and
Qualcomm's
Oryon have also implemented similar L1i cache sizes. The benefits of L3 and L4 caches depend on the application's access patterns. Examples of products incorporating L3 and L4 caches include the following: •
Alpha 21164 (1995) had 1 to 64 MiB off-chip L3 cache. •
AMD K6-III (1999) had motherboard-based L3 cache. • IBM
POWER4 (2001) had off-chip L3 caches of 32 MiB per processor, shared among several processors. •
Itanium 2 (2003) had a 6 MiB
unified level 3 (L3) cache on-die; the
Itanium 2 (2003) MX 2 module incorporated two Itanium 2 processors along with a shared 64 MiB L4 cache on a
multi-chip module that was pin compatible with a Madison processor. • Intel's
Xeon MP product codenamed "Tulsa" (2006) features 16 MiB of on-die L3 cache shared between two processor cores. •
AMD Phenom (2007) with 2 MiB of L3 cache. • AMD
Phenom II (2008) has up to 6 MiB on-die unified L3 cache. •
Intel Core i7 (2008) has an 8 MiB on-die unified L3 cache that is inclusive, shared by all cores. • Intel
Haswell CPUs with integrated
Intel Iris Pro Graphics have 128 MiB of eDRAM acting essentially as an L4 cache. Finally, at the other end of the memory hierarchy, the CPU
register file itself can be considered the smallest, fastest cache in the system, with the special characteristic that it is scheduled in software—typically by a compiler, as it allocates registers to hold values retrieved from main memory for, as an example,
loop nest optimization. However, with
register renaming most compiler register assignments are reallocated dynamically by hardware at runtime into a register bank, allowing the CPU to break false data dependencies and thus easing pipeline hazards. Register files sometimes also have hierarchy: The
Cray-1 (circa 1976) had eight address "A" and eight scalar data "S" registers that were generally usable. There was also a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster than main memory. The "B" and "T" registers were provided because the Cray-1 did not have a data cache. (The Cray-1 did, however, have an instruction cache.)
Multi-core chips When considering a chip with
multiple cores, there is a question of whether the caches should be shared or local to each core. Implementing shared cache inevitably introduces more wiring and complexity. But then, having one cache per
chip, rather than
core, greatly reduces the amount of space needed, and thus one can include a larger cache. Typically, sharing the L1 cache is undesirable because the resulting increase in latency would make each core run considerably slower than a single-core chip. However, for the highest-level cache (usually L3, the last one called before accessing memory), having a global cache is desirable for several reasons, such as allowing a single core to use the whole cache, reducing data redundancy by making it possible for different processes or threads to share cached data, and reducing the complexity of utilized cache coherency protocols. For example, an eight-core chip with three levels may include an L1 cache for each core, one intermediate L2 cache for each pair of cores, and one L3 cache shared between all cores. A shared highest-level cache (usually L3, called before accessing memory), is usually referred to as a
last-level cache (LLC). Additional techniques are used for increasing the level of parallelism when LLC is shared between multiple cores, including slicing it into multiple pieces which are addressing certain ranges of memory addresses, and can be accessed independently.
Separate versus unified In a separate cache structure, instructions and data are cached separately, meaning that a cache line is used to cache either instructions or data, but not both; various benefits have been demonstrated with separate data and instruction
translation lookaside buffers. In a unified structure, this constraint is not present, and cache lines can be used to cache both instructions and data.
Exclusive versus inclusive Multi-level caches introduce new design decisions. For instance, in some processors, all data in the L1 cache must also be somewhere in the L2 cache. These caches are called
strictly inclusive. Other processors (like the
AMD Athlon) have
exclusive caches: data are guaranteed to be in at most one of the L1 and L2 caches, never in both. Still other processors (like the Intel
Pentium II,
III, and
4) do not require that data in the L1 cache also reside in the L2 cache, although it may often do so. There is no universally accepted name for this intermediate policy; two common names are "non-exclusive" and "partially-inclusive". The advantage of exclusive caches is that they store more data. This advantage is larger when the exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many times larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting cache line in the L2 is exchanged with a line in the L1. This exchange is quite a bit more work than just copying a line from L2 to L1, which is what an inclusive cache does.
Scratchpad memory Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress.
Example: the K8 To illustrate both specialization and multi-level caching, here is the cache hierarchy of the K8 core in the AMD
Athlon 64 CPU. The K8 has four specialized caches: an instruction cache, an instruction
TLB, a data TLB, and a data cache. Each of these caches is specialized: • The instruction cache keeps copies of 64-byte lines of memory, and fetches 16 bytes each cycle. Each byte in this cache is stored in ten bits rather than eight, with the extra bits marking the boundaries of instructions (this is an example of predecoding). The cache has only
parity protection rather than
ECC, because parity is smaller and any damaged data can be replaced by fresh data fetched from memory (which always has an up-to-date copy of instructions). • The instruction TLB keeps copies of page table entries (PTEs). Each cycle's instruction fetch has its virtual address translated through this TLB into a physical address. Each entry is either four or eight bytes in memory. Because the K8 has a variable page size, each of the TLBs is split into two sections, one to keep PTEs that map 4 KiB pages, and one to keep PTEs that map 4 MiB or 2 MiB pages. The split allows the fully associative match circuitry in each section to be simpler. The operating system maps different sections of the virtual address space with different size PTEs. • The data TLB has two copies which keep identical entries. The two copies allow two data accesses per cycle to translate virtual addresses to physical addresses. Like the instruction TLB, this TLB is split into two kinds of entries. • The data cache keeps copies of 64-byte lines of memory. It is split into 8 banks (each storing 8 KiB of data), and can fetch two 8-byte data each cycle so long as those data are in different banks. There are two copies of the tags, because each 64-byte line is spread among all eight banks. Each tag copy handles one of the two accesses per cycle. The K8 also has multiple-level caches. There are second-level instruction and data TLBs, which store only PTEs mapping 4 KiB. Both instruction and data caches, and the various TLBs, can fill from the large
unified L2 cache. This cache is exclusive to both the L1 instruction and data caches, which means that any 8-byte line can only be in one of the L1 instruction cache, the L1 data cache, or the L2 cache. It is, however, possible for a line in the data cache to have a PTE which is also in one of the TLBs—the operating system is responsible for keeping the TLBs coherent by flushing portions of them when the page tables in memory are updated. The K8 also caches information that is never stored in memory—prediction information. These caches are not shown in the above diagram. As is usual for this class of CPU, the K8 has fairly complex
branch prediction, with tables that help predict whether branches are taken and other tables which predict the targets of branches and jumps. Some of this information is associated with instructions, in both the level 1 instruction cache and the unified secondary cache. The K8 uses an interesting trick to store prediction information with instructions in the secondary cache. Lines in the secondary cache are protected from accidental data corruption (e.g. by an
alpha particle strike) by either
ECC or
parity, depending on whether those lines were evicted from the data or instruction primary caches. Since the parity code takes fewer bits than the ECC code, lines from the instruction cache have a few spare bits. These bits are used to cache branch prediction information associated with those instructions. The net result is that the branch predictor has a larger effective history table, and so has better accuracy.
More hierarchies Other processors have other kinds of predictors (e.g., the store-to-load bypass predictor in the
DEC Alpha 21264). These predictors are caches in that they store information that is costly to compute. Some of the terminology used when discussing predictors is the same as that for caches (one speaks of a
hit in a branch predictor), but predictors are not generally thought of as part of the cache hierarchy. The K8 keeps the instruction and data caches
coherent in hardware, which means that a store into an instruction closely following the store instruction will change that following instruction. Other processors, like those in the Alpha and MIPS family, have relied on software to keep the instruction cache coherent. Stores are not guaranteed to show up in the instruction stream until a program calls an operating system facility to ensure coherency.
Tag RAM In computer engineering, a
tag RAM is used to specify which of the possible memory locations is currently stored in a CPU cache. For a simple, direct-mapped design fast
SRAM can be used. Higher
associative caches usually employ
content-addressable memory. ==Implementation==