Cache prefetching

Cache prefetching is a technique used by central processing units (CPUs) to boost execution performance by fetching instructions or data from their primary or main storage in slower memory to a faster local memory before it is actually needed. Most modern CPUs have fast and local cache memory in which prefetched data is held until it is required. The source for the prefetch operation is usually main memory. Because of their design, accessing cache memories is typically much faster than accessing main memory. Prefetching can be done with non-blocking cache control instructions. Prefetching is based on the principle of data locality.

Data vs. instruction cache prefetching

Cache prefetching can either fetch data or instructions into cache. • Data prefetching fetches data before it is needed. Because data access patterns show less regularity than instruction patterns, accurate data prefetching is generally more challenging than instruction prefetching. • Instruction prefetching fetches instructions before they need to be executed. The first mainstream microprocessors to use some form of instruction prefetch were the Intel 8086 (six bytes) and the Motorola 68000 (four bytes). In recent years, all high-performance processors use prefetching techniques. == Hardware vs. software cache prefetching ==

Hardware vs. software cache prefetching

Cache prefetching can be accomplished either by hardware or by software. • Software-based prefetching is typically accomplished by having the compiler analyze the code and insert additional "prefetch" instructions in the program during compilation itself. == Methods of hardware prefetching ==

Methods of hardware prefetching

Stream buffers • Stream buffers were developed based on the concept of "one block lookahead (OBL) scheme" proposed by Alan Jay Smith. and many variations of this method have been developed since. The basic idea is that the cache miss address (and k subsequent addresses) are fetched into a separate buffer of depth k. This buffer is called a stream buffer and is separate from the cache. The processor then consumes data/instructions from the stream buffer if the address associated with the prefetched blocks matches the requested address generated by the program executing on the processor. The figure below illustrates this setup: File:CachePrefetching_StreamBuffers.png|center|A typical stream buffer setup as originally proposed by Norman Jouppi in 1990 For each new miss, there would be a new stream buffer allocated, and it would operate in a similar way as described above. • The ideal depth of the stream buffer is subject to experimentation against various benchmarks Strided prefetching This type of prefetching monitors the delta between the addresses of the memory accesses and looks for patterns within it. Regular strides In this pattern, consecutive memory accesses are made to blocks that are s addresses apart. In this case, the prefetcher calculates the s and uses it to compute the memory address for prefetching. For example, if , the address to be prefetched would A+4. Irregular spatial strides In this case, the delta between the addresses of consecutive memory accesses is variable but still follows a pattern. Some prefetcher designs exploit this property to predict and prefetch for future accesses. Irregular temporal prefetching This class of prefetchers looks for memory access streams that repeat over time. For example, in the stream of memory accesses N, A, B, C, E, G, H, A, B, C, I, J, K, A, B, C, L, M, N, O, A, B, C, ...; the stream A, B, C is repeating over time. Other design variations have tried to provide more efficient implementations. Collaborative prefetching Computer applications generate a variety of access patterns. The processor and memory subsystem architectures used to execute these applications further disambiguate the memory access patterns they generate. Hence, the effectiveness and efficiency of prefetching schemes often depends on the application and the architectures used to execute them. Recent research has focused on building collaborative mechanisms to synergistically use multiple prefetching schemes for better prefetching coverage and accuracy. == Methods of software prefetching ==

Methods of software prefetching

Compiler-directed prefetching Compiler-directed prefetching is widely used within loops with a large number of iterations. In this technique, the compiler predicts future cache misses and inserts a prefetch instruction based on the miss penalty and execution time of the instructions. These prefetches are non-blocking memory operations; that is, these memory accesses do not interfere with actual memory accesses. They do not change the state of the processor or cause page faults. One main advantage of software prefetching is that it reduces the number of compulsory cache misses. The following example shows the addition of a prefetch instruction into code to improve cache performance. In the following iteration, for (size_t i = 0; i the th element of the array array1 is accessed. The system can prefetch the elements that are presumably accessed in future iterations by inserting a prefetch instruction as shown below: for (size_t i = 0; i Here, the prefetch stride, k depends on two factors, the cache miss penalty and the time it takes to execute a single iteration of the for-loop. For instance, if one iteration of the loop takes 7 cycles to execute, and the cache miss penalty is 49 cycles, then there should be k = 49/7 = 7 – which means that the system should prefetch 7 elements ahead. With the first iteration, i will be 0, so the system prefetches the 7th element. Now, with this arrangement, the first 7 accesses (i = 0 → 6) will still be misses (under the simplifying assumption that each element of array1 is in a separate cache line of its own). == Comparison of hardware and software prefetching ==

Comparison of hardware and software prefetching

• While software prefetching requires programmer or compiler intervention, hardware prefetching requires special hardware mechanisms. However, software prefetching can mitigate certain constraints of hardware prefetching, leading to improvements in performance.{{Citation | url=https://faculty.cc.gatech.edu/~hyesoon/lee_taco12.pdf| title=When Prefetching Works, When It Doesn't, and Why| journal = ACM Trans. Archit. Code Optim.| doi =10.1145/2133382.2133384| author = Lee, Jaekyu and Kim, Hyesoon and Vuduc, Richard == Metrics of cache prefetching ==

Metrics of cache prefetching

Cache prefetching may be judged by three main metrics. Coverage Coverage is the fraction of total misses that are eliminated because of prefetching, i.e. : , where : Accuracy Accuracy is the fraction of total prefetches that were useful – that is, the ratio of the number of memory addresses prefetched that were actually referenced by the program to the total prefetches done. : While it appears that having perfect accuracy might imply that there are no misses, this is not the case. The prefetches themselves might result in new misses if the prefetched blocks are placed directly into the cache. Although these may be a small fraction of the total number of misses observed without any prefetching, this is a non-zero number of misses. Timeliness The qualitative definition of timeliness is the amount of time elapsed from prefetch to the actual reference. For example: for prefetching to be useful in a for loop where each iteration takes three cycles to execute and the prefetch operation takes twelve cycles, the system must start the prefetch 12/3 = 4 iterations prior to its usage to maintain timeliness. == See also ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com