Although the abstract Cooley–Tukey factorization of the DFT, above, applies in some form to all implementations of the algorithm, much greater diversity exists in the techniques for ordering and accessing the data at each stage of the FFT. Of special interest is the problem of devising an
in-place algorithm that overwrites its input with its output data using only O(1) auxiliary storage. The best-known reordering technique involves explicit
bit reversal for in-place radix-2 algorithms.
Bit reversal is the
permutation where the data at an index
n, written in
binary with digits
b4
b3
b2
b1
b0 (e.g. 5 digits for
N=32 inputs), is transferred to the index with reversed digits
b0
b1
b2
b3
b4 . Consider the last stage of a radix-2 DIT algorithm like the one presented above, where the output is written in-place over the input: when E_k and O_k are combined with a size-2 DFT, those two values are overwritten by the outputs. However, the two output values should go in the first and second
halves of the output array, corresponding to the
most significant bit
b4 (for
N=32); whereas the two inputs E_k and O_k are interleaved in the even and odd elements, corresponding to the
least significant bit
b0. Thus, in order to get the output in the correct place,
b0 should take the place of
b4 and the index becomes
b0
b4
b3
b2
b1. And for next recursive stage, those 4 least significant bits will become
b1
b4
b3
b2, If you include all of the recursive stages of a radix-2 DIT algorithm,
all the bits must be reversed and thus one must pre-process the input (
or post-process the output) with a bit reversal to get in-order output. (If each size-
N/2 subtransform is to operate on contiguous data, the DIT
input is pre-processed by bit-reversal.) Correspondingly, if you perform all of the steps in reverse order, you obtain a radix-2 DIF algorithm with bit reversal in post-processing (or pre-processing, respectively). The
logarithm (log) used in this algorithm is a base 2 logarithm. The following is pseudocode for iterative radix-2 FFT algorithm implemented using bit-reversal permutation.
algorithm iterative-fft
is input: Array
a of
n complex values where n is a power of 2.
output: Array
A the DFT of a. bit-reverse-copy(a, A)
n ←
a.length
for s = 1
to log(
n)
do m ← 2
s ωm ← exp(−2π
i/
m)
for k = 0
to n-1
by m do ω ← 1
for j = 0
to m/
2 – 1
do t ←
ω A[
k +
j +
m/2]
u ←
A[
k +
j]
A[
k +
j] ←
u +
t A[
k +
j +
m/2] ←
u –
t ω ←
ω ωm return A The bit-reverse-copy procedure can be implemented as follows.
algorithm bit-reverse-copy(
a,
A)
is input: Array
a of
n complex values where n is a power of 2.
output: Array
A of size
n.
n ←
a.length
for k = 0
to n – 1
do A[rev(k)] :=
a[k] Alternatively, some applications (such as the computation of convolution products) work equally well on bit-reversed data, so one can perform forward transforms, processing, and then inverse transforms all without bit reversal to produce final results in the natural order. Many FFT users, however, prefer natural-order outputs, and a separate, explicit bit-reversal stage can have a non-negligible impact on the computation time, Also, while the permutation is a bit reversal in the radix-2 case, it is more generally an arbitrary (mixed-base) digit reversal for the mixed-radix case, and the permutation algorithms become more complicated to implement. Moreover, it is desirable on many hardware architectures to re-order intermediate stages of the FFT algorithm so that they operate on consecutive (or at least more localized) data elements. To these ends, a number of alternative implementation schemes have been devised for the Cooley–Tukey algorithm that do not require separate bit reversal and/or involve additional permutations at intermediate stages. The problem is greatly simplified if it is
out-of-place: the output array is distinct from the input array or, equivalently, an equal-size auxiliary array is available. The algorithm Even greater potential SIMD advantages (more consecutive accesses) have been proposed for the
Pease algorithm, which also reorders out-of-place with each stage, but this method requires separate bit/digit reversal and O(
N log
N) storage. One can also directly apply the Cooley–Tukey factorization definition with explicit (
depth-first) recursion and small radices, which produces natural-order out-of-place output with no separate permutation step (as in the pseudocode above) and can be argued to have
cache-oblivious locality benefits on systems with
hierarchical memory. A typical strategy for in-place algorithms without auxiliary storage and without separate digit-reversal passes involves small matrix transpositions (which swap individual pairs of digits) at intermediate stages, which can be combined with the radix butterflies to reduce the number of passes over the data. == References ==