Intel ARCHITECTURE IA-32 User Manual
Page 339

Optimizing Cache Usage
6
6-49
In Example 6-11, eight
_mm_load_ps
and
_mm_stream_ps
intrinsics are
used so that all of the data prefetched (a 128-byte cache line) is written
back. The prefetch and streaming-stores are executed in separate loops
to minimize the number of transitions between reading and writing data.
This significantly improves the bandwidth of the memory accesses.
// copy 128 byte per loop
for (j=kk; j _mm_stream_ps((float*)&b[j], _mm_load_ps((float*)&a[j])); _mm_stream_ps((float*)&b[j+2], _mm_load_ps((float*)&a[j+2])); _mm_stream_ps((float*)&b[j+4], _mm_load_ps((float*)&a[j+4])); _mm_stream_ps((float*)&b[j+6], _mm_load_ps((float*)&a[j+6])); _mm_stream_ps((float*)&b[j+8], _mm_load_ps((float*)&a[j+8])); _mm_stream_ps((float*)&b[j+10], _mm_load_ps((float*)&a[j+10])); _mm_stream_ps((float*)&b[j+12], _mm_load_ps((float*)&a[j+12])); _mm_stream_ps((float*)&b[j+14], _mm_load_ps((float*)&a[j+14])); } // finished copying one block } // finished copying N elements _mm_sfence(); Example 6-11 A Memory Copy Routine Using Software Prefetch