NUMA, memory binding and how swap can dilute your locality

On the hwloc-devel mailing list Jeff Sqyres from Cisco posted a message saying:

Someone just made a fairly disturbing statement to me in an Open MPI bug ticket: if you bind some memory to a particular NUMA node, and that memory later gets paged out, then it loses its memory binding information — meaning that it can effectively get paged back in at any physical location. Possibly even on a different NUMA node.

Now this sparked off an interesting thread on the issue, with the best explanation for it being provided by David Singleton from ANU/NCI:

Unless it has changed very recently, Linux swapin_readahead is the main culprit in messing with NUMA locality on that platform. Faulting a single page causes 8 or 16 or whatever contiguous pages to be read from swap. An arbitrary contiguous range of pages in swap may not even come from the same process far less the same NUMA node. My understanding is that since there is no NUMA info with the swap entry, the only policy that can be applied to is that of the faulting vma in the faulting process. The faulted page will have the desired NUMA placement but possibly not the rest. So swapping mixes different process’ NUMA policies leading to a “NUMA diffusion process”.

So when your page gets swapped back in it will drag in a heap of pages that may have nothing to do with it and hence their pages may be misplaced. Or worse still, your page could be one of the bunch dragged back in when a process on a different NUMA node swaps something back in. This will dilute the locality of the pages. Not fun!