Motorola DSP96002 User Manual
Page 630
MOTOROLA
DSP96002 USER’S MANUAL
B-111
move d2.s,x:(r4)+ ;save lower 2, point to next
_bfly
move x:(r0)+n0,d0.s y:(r4)+n4,d1.s ;adjust r0,r4
_grp
lsr d6 d6.l,n0 ;bflys/2, make old value new offset
lsl d7 n0,n4 ;ngroups*2, move new offset
lea (r0)+n0,r4 ;new lower leg pointer
_stage
move #3,n0 ;offset between 2 butterflies-1
move n0,n4 ;same
move (r4)+ ;point r4 to second bfly
do #n/4,_laststage ;do last stage, 2 bflys at a time
move x:(r0)+,d0.s ;get upper of bfly 1
move x:(r0)-,d1.s ;get lower of bfly 1, point to upper
faddsub.s d0,d1 x:(r4)+,d2.s ;get upper of bfly 2
move x:(r4)-,d3.s ;get lower of bfly 1, point to upper
faddsub.s d2,d3 d1.s,x:(r0)+ ;save upper 1
move d0.s,x:(r0)+n0 ;save lower 1, point to next group
move d3.s,x:(r4)+ ;save upper 2
move d2.s,x:(r4)+n4 ;save lower 2, point to next group
_laststage
end
B.1.45.2 Out-of-place WHT
Since the WHT requires 2 loads and 2 stores per butterfly, the maximum throughput for a WHT butterfly is
4 cycles. However, if the data is split between two memories, then the 2 loads and 2 stores can be per-
formed in 2 cycles. Thus, it is possible to execute each butterfly in 2 cycles. This implementation takes the
input data in a single memory space and on the first stage of the transform, splits the data into X and Y
memory. The middle stages then perform 4 WHT butterflies in 8 cycles. The last stage is split out and also
performs 4 WHT butterflies in 8 cycles. Thus, except for the first stage, all WHT butterflies are performed
in 2 cycles.
In this example, a 16 point transform is performed. The input data are in X:0-f and the output is split be-
tween X and Y memory. The first 8 output values are at x:0-7 and the next 8 output values are at y:0-7 in
bit reversed order starting at x:0. To increase execution speed, an extra block of memory is used at y:0-7.
Thus, with this algorithm, an extra block of memory is required in Y memory equal to one-half of the trans-
form data size in X memory.
If both X and Y memory are on the same port (A or B), then all X and Y memory references are performed
on the same port. Thus, the WHT butterfly executes in 4 cycles. This gives an execution speed of 1.64
milliseconds at 13.5 MIPS. However, if X memory is on port A and Y memory is on port B, then the memory
bandwidth is doubled and an X memory access and Y memory access can occur in a single cycle. This
gives an execution speed of 0.939 milliseconds at 13.5 MIPS.