Figure 6-7 – Intel ARCHITECTURE IA-32 User Manual
Page 326

IA-32 Intel® Architecture Optimization
6-36
Figure 6-7 shows how prefetch instructions and strip-mining can be
applied to increase performance in both of these scenarios.
For Pentium 4 processors, the left scenario shows a graphical
implementation of using
prefetchnta
to prefetch data into selected
ways of the second-level cache only (SM1 denotes strip mine one way
of second-level), minimizing second-level cache pollution. Use
prefetchnta
if the data is only touched once during the entire
execution pass in order to minimize cache pollution in the higher level
caches. This provides instant availability, assuming the prefetch was
issued far ahead enough, when the read access is issued.
Figure 6-7
Examples of Prefetch and Strip-mining for Temporally Adjacent and
Non-Adjacent Passes Loops
Temporally
non-adjacent passes
Temporally
adjacent passes
Prefetchnta
Dataset A
Reuse
Dataset A
Reuse
Dataset B
Prefetchnta
Dataset B
SM1
SM1
Prefetcht0
Dataset A
Prefetcht0
Dataset B
Reuse
Dataset B
Reuse
Dataset A
SM2