Texas Instruments TMS320C64X User Manual
Page 102
DSP_mat_mul
4-74
for (i = 0; i < r1; i++)
for (j = 0; j < c2; j++)
{
sum = 0;
for (k = 0; k < c1; k++)
sum += x[k + i*c1] * y[j + k*c2];
r[j + i*c2] = sum >> qs;
}
}
Special Requirements
-
The arrays x[], y[], and r[] are stored in distinct arrays. That is, in-place
processing is not allowed.
-
The input matrices have minimum dimensions of at least 1 row and 1
column, and maximum dimensions of 32767 rows and 32767 columns.
Implementation Notes
-
Bank Conflicts: No bank conflicts occur.
-
Interruptibility: This code blocks interrupts during its innermost loop.
Interrupts are not blocked otherwise. As a result, interrupts can be blocked
for up to 0.25*c1’ + 16 cycles at a time.
-
The ‘i’ loop and ‘k’ loops are unrolled 2x. The ’j’ loop is unrolled 4x. For
dimensions that are not multiples of the various loops’ unroll factors, this
code calculates extra results beyond the edges of the matrix. These extra
results are ultimately discarded. This allows the loops to be unrolled for
efficient operation on large matrices while not losing flexibility.
Benchmarks
Cycles
0.25 * ( r1’ * c2’ * c1’ ) + 2.25 * ( r1’ * c2’ ) + 11, where:
r1’ = 2 * ceil(r1/2.0) (r1 rounded up to next even)
c1’ = 2 * ceil(c1/2.0) (c1 rounded up to next even)
c2’ = 4 * ceil(c2/4.0) (c2 rounded up to next mult of 4)
For r1= 1, c1= 1, c2= 1: 33 cycles
For r1= 8, c1=20, c2= 8: 475 cycles
Codesize
416 bytes