One of the things that us HPC folks tend to get hot under the collar about is hardware locality, basically making sure that your memory accesses are as fast as possible by optimising where on the system you’re getting memory from and making sure your process doesn’t get moved further away. Just binding your processes to the cores they are on can make for a significant speed up so it’s well worth doing. If you’ve just got a single socket, or a pre-Nehalem Intel x86 system then your path to RAM has been pretty much identical wherever you are so the only benefits are from not moving away from your CPU cache lines but on AMD Opteron, Nehalem, Itanic, Alpha, etc you really should care a lot more about locality for best performance.
The open source Torque queuing system (which I help out with) does some of this already, if you compile it with –enable-cpuset and have the /dev/cpuset virtual filesystem mounted then before it starts a job on a node it will create a cpuset for that (based on what cores have been allocated on the node) and then put the HPC processes into that cpuset. If you’re using Open-MPI 1.4.x and have the environment variable OMPI_MCA_orte_process_binding
set to core
then each of the MPI ranks will bind itself to one of the cores within that cpuset.
All good ? Well not quite as Torque is reliant on /dev/cpuset being there and being able to parse the contents of it and Open-MPI 1.4.x uses the Portable Linux Process Affinity (PLPA) library which, as its name suggests, is only for Linux. So the good Open-MPI people looked at their PLPA library and decided it needed extending and teamed up with the INRIA libtopology team who were working on how you discover the topology of various architectures and decided to merge the two projects together under the banner of the Portable Hardware Locality (hwloc) library.
The Portable Hardware Locality (hwloc) software package provides a portable abstraction (across OS, versions, architectures, …) of the hierarchical topology of modern architectures, including NUMA memory nodes, sockets, shared caches, cores and simultaneous multithreading. It also gathers various system attributes such as cache and memory information. It primarily aims at helping applications with gathering information about modern computing hardware so as to exploit it accordingly and efficiently.
The portable bit of the name comes from the fact that it works on Linux, Solaris, AIX, Darwin, FreeBSD, Tru64, HP-UX and Windows (though with limitations on some architectures – e.g. Windows – which don’t expose all the info it needs) and can extended for other OS’s if people feel they need to scratch that itch (OpenVMS anyone?). This release is also embeddable into projects (such as Open-MPI 1.5) and I have an interest in Torque picking it up to improve and extend its cpuset support.