AMD ATHLON 64 User Manual
Page 24

24
Analysis and Recommendations
Chapter 3
40555
Rev. 3.00
June 2006
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 
ccNUMA Multiprocessor Systems
afterwords no longer needs the data structure and if only one of the worker threads needs the data 
structure. In other words, the data structure is not truly shared between the worker threads.
It is best in this case to use a data initialization scheme that avoids incorrect data placement due to 
first touch. This is done by allowing each worker thread to first touch its own data or by explicitly 
pinning the data associated with each worker thread on the node where the worker thread runs.
Certain OSs provide memory placement tools and APIs that also permit data migration. A worker 
thread can use these to migrate the data from the node where the start up thread did the first touch to 
the node where the worker thread needs it. There is a cost associated with the migration and it would 
be less efficient than using the correct data initialization scheme in the first place.
If it is not possible to modify the application to use a correct data initialization scheme or if data is 
truly being shared by the various worker threads—as in a database application—then a technique 
called node interleaving can be used to improve performance. Node interleaving allows for memory 
to be interleaved across any subset of nodes in the multiprocessor system. When the node interleaving 
policy is used, it overrides the default local allocation policy used by the OS on first touch.
Let us assume that the data structure shared between the worker threads in this case is of size 16 KB. 
If the default policy of local allocation is used then the entire 16KB data structure resides on the node 
where the startup thread does first touch. However, using the policy of node interleaving, the 16-KB 
data structure can be interleaved on first touch such that the first 4KB ends up on node 0, the next 
4KB ends up on node 1, and the next 4KB ends up on node 2 and so on. This assumes that there is 
enough physical memory available on each node. Thus, instead of having all memory resident on a 
single node and making that the bottleneck, memory is now spread out across all nodes.
The tools and APIs that support explicit thread and memory placement mentioned in the previous 
sections can also be used by an application to use the node interleaving policy for its memory. For 
additional details refer to Section A.8 on page 46.
By default, the granularity of interleaving offered by the tools/APIs is usually set to the size of the 
virtual page supported by the hardware, which is 4K (when system is configured for normal pages, 
which is the default) and 2M ((when system is configured for large pages,). Therefore any benefit 
from node interleaving will only be obtained if the data being accessed is significantly larger than a 
virtual page size.
If data is being accessed by three or more cores, then it is better to interleave data across the nodes 
that access the data than to leave it resident on a single node. We anticipate that using this rule of 
thumb could give a significant performance improvement. However, developers are advised to 
experiment with their applications to measure any performance change.
A good example of of the use of node interleaving is observed with Spec JBB 2005 using Sun JVM 
1.5.0_04-GA. Using node interleaving improved the peak throughput score reported by Spec JBB 
2005 by 8%. We observe that, as this benchmark starts with a single thread and then ramps up to eight 
threads, all threads end up accessing memory resident on a single node by the virtue of first touch. 
