Intel ARCHITECTURE IA-32 User Manual

Page 339

Optimizing Cache Usage

6-49

In Example 6-11, eight

_mm_load_ps

and

_mm_stream_ps

intrinsics are

used so that all of the data prefetched (a 128-byte cache line) is written
back. The prefetch and streaming-stores are executed in separate loops
to minimize the number of transitions between reading and writing data.
This significantly improves the bandwidth of the memory accesses.

// copy 128 byte per loop

for (j=kk; j

_mm_stream_ps((float*)&b[j],

_mm_load_ps((float*)&a[j]));

_mm_stream_ps((float*)&b[j+2],

_mm_load_ps((float*)&a[j+2]));

_mm_stream_ps((float*)&b[j+4],

_mm_load_ps((float*)&a[j+4]));

_mm_stream_ps((float*)&b[j+6],

_mm_load_ps((float*)&a[j+6]));

_mm_stream_ps((float*)&b[j+8],

_mm_load_ps((float*)&a[j+8]));

_mm_stream_ps((float*)&b[j+10],

_mm_load_ps((float*)&a[j+10]));

_mm_stream_ps((float*)&b[j+12],

_mm_load_ps((float*)&a[j+12]));

_mm_stream_ps((float*)&b[j+14],

_mm_load_ps((float*)&a[j+14]));

} // finished copying one block

}

// finished copying N elements

_mm_sfence();

Example 6-11 A Memory Copy Routine Using Software Prefetch