In many instruction sets, a loop must be implemented by using instructions to increment or decrement a counter, check whether the end of the loop has been reached, and if not jump to the beginning of the loop so it can be repeated. Although this typically only represents around 3–16 bytes of space for each loop, even that small amount could be significant depending on the size of the
CPU caches. More significant is that those instructions each take time to execute, time which is not spent doing useful work. The overhead of such a loop is apparent compared to a completely
unrolled loop, in which the body of the loop is duplicated exactly as many times as it will execute. In that case, no space or execution time is wasted on instructions to repeat the body of the loop. However, the duplication caused by loop unrolling can significantly increase code size, and the larger size can even impact execution time due to
cache misses. (For this reason, it's common to only partially unroll loops, such as transforming it into a loop which performs the work of four iterations in one step before repeating. This balances the advantages of unrolling with the overhead of repeating the loop.) Moreover, completely unrolling a loop is only possible for a limited number of loops: those whose number of iterations is known at
compile time. For example, the following C code could be compiled and optimized into the following x86 assembly code: ==Implementation==