Most CPUs dating from the late 1980s onward have bit operators for ffs or equivalent, but a few modern ones like some of the ARM-Mx series do not. In lieu of hardware operators for ffs, clz and ctz, software can emulate them with shifts, integer arithmetic and bitwise operators. There are several approaches depending on architecture of the CPU and to a lesser extent, the programming language semantics and compiler code generation quality. The approaches may be loosely described as
linear search,
binary search, search+table lookup, de Bruijn multiplication, floating point conversion/exponent extract, and bit operator (branchless) methods. There are tradeoffs between execution time and storage space as well as portability and efficiency. Software emulations are usually deterministic. They return a defined result for all input values; in particular, the result for an input of all zero bits is usually 0 for ffs, and the bit length of the operand for the other operations. If one has a hardware clz or equivalent, ctz can be efficiently computed with bit operations, but the converse is not true: clz is not efficient to compute in the absence of a hardware operator.
2n The function (round up to the nearest power of two) using shifts and bitwise ORs is not efficient to compute as in this 32-bit example and even more inefficient if we have a 64-bit or 128-bit operand:
function pow2(x):
if x = 0 return
invalid //
invalid is implementation defined (not in [0,63]) x ← x - 1
for each y
in {1, 2, 4, 8, 16}: x ← x | (x >> y)
return x + 1
FFS Since ffs = ctz + 1 (POSIX) or ffs = ctz (other implementations), the applicable algorithms for ctz may be used, with a possible final step of adding 1 to the result, and returning 0 instead of the operand length for input of all zero bits.
CTZ The canonical algorithm is a loop counting zeros starting at the LSB until a 1-bit is encountered:
function ctz1 (x)
if x = 0
return w t ← 1 r ← 0
while (x & t) = 0 t ← t n-1] = ctz(i)
for i
in 1..2n-1
function ctz2 (x)
if x = 0
return w r ← 0
while (x & (2n−1)) ≠ 0 x ← x >> n r ← r + n
return r + table[x & (2n−1)] The parameter
n is fixed (typically 8) and represents a
time–space tradeoff. The loop may also be fully
unrolled. But as a linear lookup, this approach is still O(n) in the number of bits in the operand. If
n = 4 is chosen, the table of 16 2-bit entries can be encoded in a single 32-bit
constant using
SIMD within a register techniques:
// binary table ← 0x12131210
function ctz2a (x)
if x = 0
return w r ← 0
while (x & 15) = 0 x ← x >> 4 r ← r + 4
return r + ((table >> 2*(x & 15)) & 3); This technique is quite practical in applications such as the
binary GCD algorithm, where the
x values are uniformly distributed, so the iterative loop is not needed of the time, and suffers minimal
branch misprediction penalty. A
binary search implementation takes a logarithmic number of operations and branches, as in this 32-bit version:
function ctz3 (x)
if x = 0
return 32 n ← 0
if (x & 0x0000FFFF) = 0: n ← n + 16, x ← x >> 16
if (x & 0x000000FF) = 0: n ← n + 8, x ← x >> 8
if (x & 0x0000000F) = 0: n ← n + 4, x ← x >> 4
if (x & 0x00000003) = 0: n ← n + 2, x ← x >> 2
if (x & 0x00000001) = 0: n ← n + 1
// Equivalently, n ← n + 1 - (x & 1) return n This algorithm can be assisted by a table as well, replacing the last 2 or 3 if statements with a 16- or 256-entry lookup table using the least significant bits of as an index. As mentioned in , if the hardware has a clz operator, the most efficient approach to computing ctz is thus:
function ctz4 (x)
if x = 0
return w
// Isolates the LSB x ← x & −x
return w − 1 − clz(x) A similar technique can take advantage of a
population count instruction:
function ctz4a (x)
if x = 0
return w
// Makes a mask of the least-significant bits x ← x ^ (x − 1)
return popcount(x) − 1 An algorithm for 32-bit ctz uses a
de Bruijn sequence to construct a
minimal perfect hash function that eliminates all branches. This algorithm assumes that the result of the multiplication is truncated to 32 bit.
for i
from 0
to 31: table[ 0x077CB531 > 27 & 31 ] ← i
// table [0..31] initialized function ctz5 (x)
if x = 0
return 32
return table[((x & −x) * 0x077CB531) >> 27 & 31] The expression again isolates the least-significant 1 bit. There are then only 32 possible words, which the unsigned multiplication and shift hash to the correct position in the table. This algorithm is branch-free if it does not need to handle the zero input. The technique can be extended to 64-bit words. A minor variant uses the expression to compute a mask of
n trailing ones, as in the
popcount method, and a different minimal perfect hash multiplier. This allows the lookup table to be shared with a
count leading zeros implementation. This also permits a
folded implementation using a half-width multiply. After computing the mask, the
bitwise xor of the top and bottom halves is sufficient to uniquely identify the original mask. A suitable multiplier produces a minimal perfect hash which may be decoded by a lookup table:
for i
from 0
to 31: table[ ((0xffff0000 >> i) * 0x70a7) >> 11 & 31 ] ← 31 - i
// table [0..31] initialized function ctz5 (x)
if x = 0
return 32 x ← x ^ (x - 1) x ← x ^ (x >> 16)
return table[(x * 0x70a7) >> 11 & 31]
// 0x70d3 also works This is advantageous if the narrower multiply saves more time than the folding step adds. The technique can also be extended to 64-bit words, using the 32-bit multipliers , , , or .
CLZ The canonical algorithm examines one bit at a time starting from the MSB until a non-zero bit is found, as shown in this example. It executes in O(n) time where n is the bit-length of the operand, and is not a practical algorithm for general use.
function clz1 (x)
if x = 0
return w t ← 1 > 1 r ← r + 1
return r An improvement on the previous looping approach examines eight bits at a time then uses a 256 (28) entry lookup table for the first non-zero byte. This approach, however, is still O(n) in execution time.
function clz2 (x)
if x = 0
return w t ← 0xff > 8 r ← r + 8
return r + table[x >> (w - 8 - r)] Binary search can reduce execution time to O(log2n):
function clz3 (x)
if x = 0
return 32 n ← 0
if (x & 0xFFFF0000) = 0: n ← n + 16, x ← x 8=256 1-byte entries) can replace the bottom 3 branches in binary search. 64-bit operands require an additional branch. A larger width lookup can be used but the maximum practical table size is limited by the size of L1 data cache on modern processors. Saving a branch is more than offset by the latency of an
L1 cache miss. An algorithm similar to de Bruijn multiplication for CTZ works for CLZ, but rather than isolating the most-significant bit, it rounds up to the nearest integer of the form 2
n−1 using shifts and bitwise ORs: table[0..31] = {0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31}
function clz4 (x)
for each y
in {1, 2, 4, 8, 16}: x ← x | (x >> y)
return table[((x * 0x07C4ACDD) >> 27) % 32] For processors with deep pipelines, like Prescott and later Intel processors, it may be faster to replace branches by bitwise AND and OR operators (even though many more instructions are required) to avoid pipeline flushes for mispredicted branches (and these types of branches are inherently unpredictable):
function clz5 (x) r = (x > 0xFFFF) >= r; q = (x > 0xFF ) >= q; r |= q; q = (x > 0xF ) >= q; r |= q; q = (x > 0x3 ) >= q; r |= q; r |= (x >> 1);
return r; On platforms that provide hardware conversion of integers to floating point, the exponent field can be extracted and subtracted from a constant to compute the count of leading zeros. Corrections are needed to account for rounding errors. Floating point conversion can have substantial latency. This method is highly non-portable and not usually recommended. int x; int r; union { unsigned int u[2]; double d; } t; t.u[LE] = 0x43300000; // LE is 1 for little-endian t.u[!LE] = x; t.d -= 4503599627370496.0; r = (t.u[LE] >> 20) - 0x3FF; // log2 r++; // CLZ ==Applications==