Texas Instruments TMS320C64X User Manual
Page 147
DSP_fft16x16t
4-119
C64x+ DSPLIB Reference
The following statements can be made based on above observations:
1) Inner loop “i0” iterates a variable number of times. In particular, the number
of iterations quadruples every time from 1..N/4. Hence, software pipelining
a loop that iterates a variable number of times is not profitable.
2) Outer loop “j” iterates a variable number of times as well. However, the
number of iterations is quartered every time from N/4 ..1. Hence, the
behavior in (a) and (b) are exactly opposite to each other.
3) If the two loops “i” and “j” are coalesced together, then they will iterate for
a fixed number of times, namely N/4. This allows us to combine the “i” and
“j” loops into one loop. Optimized implementations will make use of this
fact.
In addition, the Cooley Tukey FFT accesses three twiddle factors per iteration
of the inner loop, as the butterflies that re-use twiddle factors are lumped
together. This leads to accessing the twiddle factor array at three points each
separated by “ie”. Note that “ie” is initially 1, and is quadrupled with every
iteration. Therefore, these three twiddle factors are not even contiguous in the
array.
To vectorize the FFT, it is desirable to access twiddle factor array using double
word wide loads and fetch the twiddle factors needed. To do this, a modified
twiddle factor array is created, in which the factors WN/4, WN/2, W3N/4 are
arranged to be contiguous. This eliminates the separation between twiddle
factors within a butterfly. However, this implies that we maintain a redundant
version of the twiddle factor array as the loop is traversed from one stage to
another. Hence, the size of the twiddle factor array increases as compared to
the normal Cooley Tukey FFT. The modified twiddle factor array is of size “2
* N”, where the conventional Cooley Tukey FFT is of size “3N/4”, where N is
the number of complex points to be transformed. The routine that generates
the modified twiddle factor array was presented earlier. With the above
transformation of the FFT, both the input data and the twiddle factor array can
be accessed using double-word wide loads to enable packed data processing.
The final stage is optimized to remove the multiplication as w0 = 1. This stage
also performs digit reversal on the data, so the final output is in natural order.
In addition, if the number of points to be transformed is a power of 2, the final
stage applies a DSP_radix2 pass instead of a radix 4. In any case, the outputs
are returned in normal order.
The code shown here performs the bulk of the computation in place. However,
because digit-reversal cannot be performed in-place, the final result is written
to a separate array, y[].