AMD ATHLON 64 User Manual
Page 23

Chapter 3
Analysis and Recommendations
23
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
ccNUMA Multiprocessor Systems
40555
Rev. 3.00
June 2006
A ccNUMA-aware OS keeps data local on the node where first-touch occurs as long as there is 
enough physical memory available on that node. If enough physical memory is not available on the 
node, then various advanced techniques are used to determine where to place the data, depending on 
the OS.
This migration has the effect of moving the thread farther from its data. Some schedulers try to bring 
the thread back to a core on a node where the data is in local memory, but this is never guaranteed. 
Furthermore, the thread could first touch more data on the node to which it was moved before it is 
moved back. This is a difficult problem for the OS to resolve, since it has no prior information as to 
how long the thread will run and, hence, whether migrating it back is desirable or not.
If an application demonstrates that threads are being moved away from their associated memory by 
the scheduler, it is typically useful to explicitly set thread placement. By explicitly pinning a thread to 
a node, the application can tell the OS to keep the thread on that node and, thus, keep data accessed by 
the thread local to it by the virtue of first touch.
The performance improvement obtained by explicit thread placement may vary depending on whether 
the application is multithreaded, whether it needs more memory than available on a node, whether 
threads are being moved away from their data, etc.
In some cases, where threads are scheduled from the outset on a core that is remote from their data, it 
might be useful to explicitly control the data placement. This is discussed in detail in the Section 
3.2.2.
The previously discussed tools and APIs for explicitly controlling thread placement can also be used 
for explicitly controlling data placement. For additional details on thread and memory placement 
tools and API in various OS, refer to Section A.7 on page 44.
3.2.2
Data Placement Techniques to Alleviate Unnecessary Data 
Sharing Between Nodes Due to First Touch
When data is shared between threads running on different nodes, the default policy of local allocation 
by first touch used by the OS can become non-optimal.
For example, a multithreaded application may have a startup thread that sets up the environment, 
allocates and initializes a data structure and forks off worker threads. As per the default local 
allocation policy, the data structure is placed on physical memory of the node where the start up 
thread did the first touch. The forked worker threads are spread around by the scheduler to be 
balanced across all nodes and their cores. A worker thread starts accessing the data structure remotely 
from the memory on the node where the first touch occurred. This could lead to significant memory 
and HyperTransport traffic in the system. This makes the node where the data resides the bottleneck. 
This situation is especially bad for performance if the startup thread only does the initialization and 
