AMD ATHLON 64 User Manual

Page 23

background image

Chapter 3

Analysis and Recommendations

23

Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™

ccNUMA Multiprocessor Systems

40555

Rev. 3.00

June 2006

A ccNUMA-aware OS keeps data local on the node where first-touch occurs as long as there is
enough physical memory available on that node. If enough physical memory is not available on the
node, then various advanced techniques are used to determine where to place the data, depending on
the OS.

Data once placed on a node due to first touch normally resides on that node for its lifetime. However,
the OS scheduler can migrate the thread that first touched the data from one core to another core—
even to a core on a different node. This can be done by the OS for the purpose of load balancing [3].

This migration has the effect of moving the thread farther from its data. Some schedulers try to bring
the thread back to a core on a node where the data is in local memory, but this is never guaranteed.
Furthermore, the thread could first touch more data on the node to which it was moved before it is
moved back. This is a difficult problem for the OS to resolve, since it has no prior information as to
how long the thread will run and, hence, whether migrating it back is desirable or not.

If an application demonstrates that threads are being moved away from their associated memory by
the scheduler, it is typically useful to explicitly set thread placement. By explicitly pinning a thread to
a node, the application can tell the OS to keep the thread on that node and, thus, keep data accessed by
the thread local to it by the virtue of first touch.

The performance improvement obtained by explicit thread placement may vary depending on whether
the application is multithreaded, whether it needs more memory than available on a node, whether
threads are being moved away from their data, etc.

In some cases, where threads are scheduled from the outset on a core that is remote from their data, it
might be useful to explicitly control the data placement. This is discussed in detail in the Section
3.2.2.

The previously discussed tools and APIs for explicitly controlling thread placement can also be used
for explicitly controlling data placement. For additional details on thread and memory placement
tools and API in various OS, refer to Section A.7 on page 44.

3.2.2

Data Placement Techniques to Alleviate Unnecessary Data
Sharing Between Nodes Due to First Touch

When data is shared between threads running on different nodes, the default policy of local allocation
by first touch used by the OS can become non-optimal.

For example, a multithreaded application may have a startup thread that sets up the environment,
allocates and initializes a data structure and forks off worker threads. As per the default local
allocation policy, the data structure is placed on physical memory of the node where the start up
thread did the first touch. The forked worker threads are spread around by the scheduler to be
balanced across all nodes and their cores. A worker thread starts accessing the data structure remotely
from the memory on the node where the first touch occurred. This could lead to significant memory
and HyperTransport traffic in the system. This makes the node where the data resides the bottleneck.
This situation is especially bad for performance if the startup thread only does the initialization and