NUMA ("non-uniform memory architecture") architectures have been brought into the mainstream, first by AMD with its Opteron processors, and now by Intel with its Nehalem (i7) processors. To reduce latency and increase performance, these processors integrate memory controllers into the CPU. Instead of having the CPU read and write memory via the so-called "north bridge" of the chipset, the CPU directly reads and writes the attached memory.

This article has an excellent summary of software issues introduced by NUMA architectures, and how to deal with them.

On multi-CPU systems, interconnects such as HyperTransport or QPI (QuickPath Interconnect) enable CPUs to access memory that is physically connected to other CPUs. Once a "nonlocal" memory access has been performed, the CPU that performed this access may operate on it freely, and cache coherency protocols ensure that the memory will be written back to the "owning" CPU when evicted from the nonlocal CPU's cache. But applications must take care to avoid "false sharing" of data where CPUs repeatedly perform nonlocal accesses, and degrade performance through overuse of the coherency protocols.

Due to this false sharing concern, and because nonlocal memory accesses generally are much more expensive than accessing the memory directly attached to a NUMA-capable CPU, applications running on NUMA platforms must take special care with memory placement and thread affinity. APIs such as libnuma enable applications to control the way memory is associated with specific CPU nodes.

Archaea Software, LLC has expertise in optimizing applications to take advantage of, (or guard against performance pitfalls on) NUMA architectures. If you would like us to do an evaluation, fill out our questionnaire to get started.