NUMA, memory binding and how swap can dilute your locality

On the hwloc-devel mailing list Jeff Sqyres from Cisco posted a message saying:

Someone just made a fairly disturbing statement to me in an Open MPI bug ticket: if you bind some memory to a particular NUMA node, and that memory later gets paged out, then it loses its memory binding information — meaning that it can effectively get paged back in at any physical location. Possibly even on a different NUMA node.

Now this sparked off an interesting thread on the issue, with the best explanation for it being provided by David Singleton from ANU/NCI:

Unless it has changed very recently, Linux swapin_readahead is the main culprit in messing with NUMA locality on that platform. Faulting a single page causes 8 or 16 or whatever contiguous pages to be read from swap. An arbitrary contiguous range of pages in swap may not even come from the same process far less the same NUMA node. My understanding is that since there is no NUMA info with the swap entry, the only policy that can be applied to is that of the faulting vma in the faulting process. The faulted page will have the desired NUMA placement but possibly not the rest. So swapping mixes different process’ NUMA policies leading to a “NUMA diffusion process”.

So when your page gets swapped back in it will drag in a heap of pages that may have nothing to do with it and hence their pages may be misplaced. Or worse still, your page could be one of the bunch dragged back in when a process on a different NUMA node swaps something back in. This will dilute the locality of the pages. Not fun!

Iran’s New GPU Powered Supercomputer(s) ? (Updated x 2)

ComputerWorld has an article about Iran claiming to have two new supercomputers, fairly modest by Top500 standards, but lament the lack of details:

But Iran’s latest supercomputer announcement appears to have no details about the components used to build the systems. Iranian officials have not yet responded to request for details about them.

However, looking at the Iranian photo spread that they link to (which appears to be slashdot’ed now) the boxes in question are SuperMicro based systems (and so could be sourced from just about anywhere), with some of their 2U storage based boxes with heaps of disk and both 2U and some 1U boxes which are presumably the compute nodes. The odd thing is that they’re spaced out quite a bit in the rack, and the 1U systems have two fans on the left hand side (which indicates something unusual about the layout of the box). Here’s an image from that Iranian news story:

Iranian Supermicro GPU node

The nice thing is that it’s pretty easy to find a slew of boxes on SuperMicro’s website that matches the picture, it’s their 1U GPU node range which are dual GPU beasts, for example:

SuperMicro SuperServer 6016GT-TF-TM1

The problem is that this range goes back to a few years, for example in an nVidia presentation on “the worlds fastest 1U server” from 2009. HPC-wire describe these original nodes as:

Inside the SS6016T-GF Supermicro box, the two M1060 GPUs modules are on opposite sides of the server chassis in a mirror image configuration, where one is facing up, the other facing down, allowing the heat to be distributed more evenly. The NVIDIA M1060 part uses a passive heat sink, and is cooled in conjunction with the rest of the server, which contains a total of eight counter-rotating fans. Supermicro also builds a variant of this model, in which it uses a Tesla C1060 card in place of the M1060. The C1060 has the same technical specs as the M1060, the principle difference being that the C1060 has an active fan heat sink of its own. In both instances though, the servers require plenty of juice. Supermicro uses a 1,400 watt power supply to drive these CPU-GPU hybrids.

According to the HPC-Wire article the whole system (2 x CPUs, 2 x GPUs) is rated at about 2TF for single-precision FP. nVidia rate the M1060 card at 933 GF (SP) and 78 GF (DP) so I’d reckon for DP FP you’re looking at maybe 180 GF per node. But now that range includes ones with the newer M2070 Fermi GPUs which can do 1030 GF (SP) and 515 GF (DP) and would get you up to just over 2TF SP and (more importantly for Top500 rankings) over 1TF DP per node.

Now if we assume the claimed 89TF for the larger system is correct, that it is indeed double precision (to be valid for the Top500), they measured it with HPL and assume an efficiency of about 0.5 (which seems about what a top ranked GPU cluster achieves with Infiniband) we can play some number games. Numbers below invalidated by Jeff Johnson’s observation, see the “updated” section for more!

If we assume these are older M1060 GPUs then you are looking at something in the order of 1000 compute nodes to be able to get that number Rmax in Linpack – something of the order of 1MW. From the photos though I didn’t get the sense that it was that large, the way they spaced them out you’d need maybe 200 racks for the whole thing and that would have made an impressive photo (plus an awful lot of switches). Now if they’ve managed to get their hands on a the newer M2070 based nodes then you could be looking at maybe 200 nodes, a more reasonable 280KW and maybe 40 racks. But I still didn’t get the sense that the datacentre was that crowded…

So I’d guess that instead of actually running Linpack on the system they’ve just totted up the Rpeak’s then you would get away with 90 of them, so maybe 15 or 20 racks which feels to me more like the scale depicted in the images. That would still give them a system that would hit about 40TF and give it a respectable ranking around 250 on the Top500 IF they used Infiniband to connect it up. If it was gigabit ethernet then you’d be looking at maybe another 50% hit to its Rmax and that would drop it out of the Top500 altogether as you’d need at least 31TF to qualify in last Novembers list.

It’ll be certainly interesting to see what the system is if/when any more info emerges!

Update In the comments Jeff Johnson has pointed out the 2U boxes are actually 4-in-2U dual socket nodes, i.e. will likely have either 32 or 48 cores depending on whether they contain quad core or six core chips. You can see that best from this rear shot of a rack:

Rear view of a 4-in-2U rack

There are mostly 8 of those units in a rack (though rack at one end of the block has just 7, but there are 2 extra in a central rack below what may be an IB switch), so that’s 256 cores a rack if they’re quad core. There are 8 racks in a block and two blocks to a row so we’ve got 4,096 cores in that one row – or 6,144 if they’re 6 core chips!

The row with the GPU nodes is harder to make out we cannot see the whole row from any combination of the photos, but in the front view we can see 2 populated racks of a block of 5, with 8 to a rack. The first rear view shows that the block next to it also has at least 2 racks populated with 8 GPU nodes. The second rear view is handy because whilst it shows the same racks as the first it demonstrates that these two blocks coincide with a single block of traditional nodes, raising the possibility of another pair of blocks to pair with the other half of the row of traditional nodes.

Front view of GPU node racks Read view of GPU nodes Different rear view of GPU node racks

Assuming M2070 GPUs then you’re looking at 8TF a rack, or 32TF for the row (assuming no other racks populated outside of our view). If the visible nodes are duplicated on the other side then you’re looking at 64TF.

If we assume that the Intel nodes have quad core Nehalem-EP 2.93Ghz then that would compare nicely to a Dell system at Saudi Aramco which is rated at an Rpeak of 48TF. Adding the 32TF for the visible GPU nodes gets us up to 80TF, which is close to their reported number, but still short (and still for Rpeak, not Rmax). So it’s likely that there are more GPU nodes, either in the 4 racks we cannot see into in the visible block of GPU racks or in another part of that row – or both! That would make a real life Top500 run of 89TF feasible, with Infiniband.

Be great to have a floor plan and parts list! 🙂

Update 2

Going through the Iranian news reports they say that the two computers are located at the Amirkabir University of Technology (AUT) in Tehran, and Esfahan University of Technology (in Esfahan). The one in Tehran is the smaller of the two, but the figures they give appear, umm, unreliable. It seems like the usual journalists-not-understanding-tech issue compounded by translation issues – so for instance you get things like:

The supercomputer of Amirkabir University of Technology with graphic power of 22000 billion operations per second is able to process 34000 billion operations per second with the speed of 40GB. […] The supercomputer manufactured in Isfahan University of Technology has calculation ability of 34000 billion operations per second and its Graphics Processing Units (GPU) are able to do more than 32 billion operations per 100 seconds.


Amirkabir University’s supercomputer has a power of 34,000 billion operations per second, and a speed of 40 gigahertz. […] The other supercomputer project carried out by Esfahan University of Technology is among the world’s top 500 supercomputers.

So is the AUT (Tehran) system 22TF or 34TF? Could it be Rpeak 34TF and Rmax 22TF ? Is the Esfahan one 34TF (which would just creep onto the Top500) or higher ?

Unfortunately it’s the Tehran system in the photos, not the Esfahan one (the give away is the HPCRC on the racks). So my estimate of 80TF Rpeak for it could well give a measured 32TF if it’s using ethernet as the interconnect (or if the CPUs are a slower clock in the 2U nodes). Or perhaps the GPU nodes are part of something else ? That and slower clocked CPUs could bring the Rpeak down to 34TF..

Need more data!

VLSCI: Systems Administrator – High Performance Computing, Storage & Infrastructure

* Please note: enquiry and application information via URL below, no agencies please!
* Must be Australian permanent resident or citizen.

Executive summary

Want to work with hundreds of TB of storage, HPC clusters and a Blue Gene supercomputer and have an aptitude for handling storage and data ?


VLSCI currently has in production as stage 1:

  • 2048 node, 8192 core IBM Blue Gene/P supercomputer
  • 80 node, 640 core IBM iDataplex cluster (Intel Nehalem CPUs)
  • ~300TB usable of DDN based IBM GPFS storage plus tape libraries
  • 136 node, 1088 core SGI Altix XE cluster (Intel Nehalem CPUs)
  • ~110TB usable of Panasas storage

There is a refresh to a much larger HPC installation planned for 2012.

Both Intel clusters are CentOS/RHEL 5, the front end and service nodes for the Blue Gene are SuSE SLES 10. The GPFS servers are RHEL5. Panasas runs FreeBSD under the covers.

Job advert

Position no.: 0022139
Employment type: Full-time Fixed Term
Campus: Parkville

Close date: February 3rd, 2011

Victorian Life Sciences Computation Initiative, Melbourne Research

Salary: HEW 7: $69,608 – $75,350 p.a. or HEW 8: $78,313 – $84,765 p.a. plus 17% superannuation.

The Victorian Life Sciences Computation Initiative (VLSCI) is a Victorian Government project hosted at The University of Melbourne which aims to establish a world class Life Sciences Compute Facility for Victorian researchers. The Facility operates a number of supercomputers and storage systems dedicated to High Performance Computing. The VLSCI wishes to recruit a Linux Systems Administrator with knowledge of file systems and an interest in working with technologies such as GPFS, TSM, HSM, NFS.

This position is an opportunity to become involved in leading science and computing fields and work as part of a small but self-contained team. Expect to find yourself learning new skills and developing new and innovative solutions to problems that have not yet been identified. You have every opportunity to make a real difference and will need to contribute to a high level of service and creativity.

More Details

Selection criteria and more details are in the Position Description (PDF) here:

Apologies for the URL shortener, the original URL is a horribly long one.. 🙁

Happy New Year

I hope everyone has a happy, fun and safe new year for all!

2010 was a good year for me, I changed jobs back in January leaving the great people of VPAC for a new challenge, working with a new group of great people to bring up a new HPC centre for life sciences from scratch based at the University of Melbourne (but open to all life scientists across the state). Since then we’ve brought up a 1088 core SGI Altix XE cluster (Intel Nehalem based), a 640 core IBM iDataplex cluster (again Nehalem) and an IBM Blue Gene/P with 8192 cores. This is just stage 1, the big systems are due to arrive in 2012! It’s been great fun and great to just be able to concentrate on running the HPC systems and not have to worry about anything else.

On a personal level life has also been great, my wonderful wife and I celebrated 10 years of marriage this December and look forward to many more! We’ve just had our longest amount of time off together since 2008 thanks to the university shutting down between Xmas and New Year and have spent it quietly pottering around the house and local area. Very relaxing!

IBM Research Collaboratory for Life Sciences – Summer Internships in Computational Biology 2010/11

As part of the Collaboratory between the University of Melbourne and IBM there is the opportunity for three PhD students (or masters students intending to convert) to work for IBM over the summer on computational biology.

Successful candidates will work with researchers on state-of-the-art interdisciplinary projects in areas such as structural biology, precision medicine, neuroscience and imaging, with an emphasis on high performance computing. Assignments could include implementing existing models or algorithms, porting existing codes, running scientific simulation software, developing mathematical and computational models and performing scientific research.

See the article on the VLSCI website for further details and how to apply (I know nothing more than this!). 🙂

Beware of Corsair CMFSSD-32D1 SSD Drives

Meant to blog this a while back, but work has been keeping me busy. A friend of mine in the US, Joe Landman, runs a business making serious HPC storage gear and has found a rather disturbing problem with Corsair CMFSSD-32D1 SSD drives. Here is how he describes it after Corsair went silent on him about this issue (ellipses are his):

We are experiencing about a 70% failure rate, within 3 months of acquisition. In many different chassis, in many different parts of the world, with many different power supplies, many different motherboards. This is a time correlated failure. I have never … ever … in 25+ years doing this stuff … ever … seen anything like this. Its either a really … really bad silicon error in a controller chip or a firmware bug … or some other crappy part.

It came right out of the blue and the failure mode is pretty scary:

Imagine for a moment, you have these in a RAID 1 configuration. And because of the the failure, the unit refuses to get past the POST section. So there you are, with a remote machine, say, I dunno, 6000 miles away from you, and an SSD, with a putative 100+ year MTBF fails, and fails in a way that stops POST. So the system on reboot, freezes at the drive detection phase.

Remember that with a 2 drive RAID1 mirror and a 70% failure rate (plus Murphy) you’re looking at a real risk of a double disk failure, which Joe has seen at some of his customers. He’s got a neat way to use a loopback device on a spinning disk as an extra member of the RAID1 set to at least have a copy of the data where it can be recovered from.

So tell your friends, just say “NO” to Corsair CMFSSD-32D1 SSD’s.

Oracle HPC Going Going Gone ?

After El Reg’s article on HPC going down the gurgler at Oracle/Sun now HPC Wire are suggesting the same:

I, myself, have spoken with two credible sources that told me HPC engineering talent is also being axed. Although this has been rumored to have been going on for some time, the recent RIF last week was said to cut particularly deep.

One thing I hadn’t noticed though was:

If I still haven’t convinced you that Oracle is cutting HPC from its lineup, consider that the company has no exhibit at the Supercomputing Conference (SC10) in November, and as far as I can tell, is offering no presentations. Given that this is the largest HPC exhibition of the year, this should be a clear signal that Oracle is going to be leaving the teraflopping and petaflopping to others.

Now back at SC’09 in Portland I asked the Sun folks (whilst the whole Sun/Oracle deal was going through) what they thought, and they said they reckoned it would be OK because Oracle had already told them they would have a booth at SC10. Well sadly it seems that’s not the case and to me that is the clearest indication that Oracle are exiting the HPC market. Of course they won’t say that (Oracle don’t seem to say much at all, even to the OpenSolaris folks, and when they do it doesn’t see to be very nice).

Geeks versus Lawyers, or, China versus the US

Interesting take on why China may well dominate technology in the near future at BusinessWeek:

In China, eight of the nine members of the Standing Committee of the Political Bureau, including the Chinese president, Hu Jintao, have engineering degrees; one has a degree in geology.

Contrast that with the US:

Of the 15 U.S. cabinet members, six have law degrees. Only one cabinet member has a hard-science degree — Secretary of Energy Steven Chu, who won the Nobel Prize in physics in 1997, has a doctorate in physics. President Barack Obama and Vice President Joe Biden have law degrees.

Basically it comes down to political will and understanding on the part of the people with the power.

(Via the ever excellent InsideHPC)

VLSCI: Systems Administrator – High Performance Computing, Storage & Infrastructure

Please note: enquiry and application information at the URL below, no agencies please!

Executive summary

Want to work with hundreds of TB of storage, HPC clusters and a Blue Gene supercomputer and have an aptitude for storage and data ?


To give you an idea of what this job relates to, VLSCI currently has in production:

  • 136 node, 1088 core SGI cluster (Intel Nehalem CPUs)
  • ~110TB usable of Panasas storage

Shortly arriving (< 1 month away):

  • 2048 node, 8192 core IBM Blue Gene/P supercomputer
  • 80 node, 640 core IBM iDataplex cluster (Intel Nehalem CPUs)
  • ~300TB usable of IBM GPFS storage plus tape libraries
  • 2012 – more! 😉

    Both Intel clusters are CentOS 5, the front end and service nodes for the Blue Gene are SuSE SLES 10. The GPFS servers are RHEL5. Panasas runs FreeBSD under the covers.
    Continue reading

Moving from VPAC to VLSCI

After almost six and a half years working at VPAC it’s time to move on, in January I’ll be taking up the position of Senior Systems Administrator in the University of Melbourne for the Victorian Life Sciences Computational Initiative (VLSCI). For those who’ve not come across the VLSCI it describes itself thus:

Under the Victorian Life Sciences Computation Initiative, The University of Melbourne will host a $100 million supercomputing program and facility, with $50 million provided by the State Government. The goal of the initiative is for Victoria to retain its standing and enhance its leadership in world life sciences. This will lead to major improvements in public health outcomes in the areas such as cancer, cardiovascular and neurological disease, chronic inflammatory diseases, bone diseases and diabetes.

Their ambitions aren’t what you could call small, they want to be a supercomputing facility ranking in the top 5 in life sciences world-wide. It’s going to be a fun ride and a lot more than just going from a 4 letter acronym to a 5 letter one. 😉

I’ve really enjoyed my time at VPAC over the years and I’m really going to miss the people there, but it’s gotten to the point where I want to be able to focus on running large HPC systems without distraction and the opportunity at VLSCI was too good to ignore!