ARM v8 (64-bit) developer boxes

Looks like things are moving along in the world of 64-bit ARM, systems aimed at early adopting developers are now around. For instance APM have their X-C1 Development Kit Plus which has 8 x 2.4GHz ARMv8 cores, 16GB RAM, 500GB HDD, 1x10gigE, 3x1gigE for ~US$2,500 (or a steep discount if you qualify as a developer). Oh, and it ships with Linux by default of course.

Found via a blog post by Steve McIntyre about bringing up Debian Jessie on ARMv8 (it’ll be a release architecture for it) which has the interesting titbit that (before ARM had their Juno developer boxes):

Then Chen Baozi and the folks running the Tianhe-2 supercomputer project in Guangzhou, China contacted us to offer access to some arm64 hardware

So it looks like (I presume) NUDT are paying it some attention & building/acquiring their own ARMv8 systems.

IBM Pays GlobalFoundries to take Microprocessor Business

Interesting times for IBM, having already divested themselves of the x86 business by selling it on to Lenovo they’ve now announced that they’re paying GlobalFoundries $1.5bn to take pretty much that entire side of the business!

IBM (NYSE: IBM) and GLOBALFOUNDRIES today announced that they have signed a Definitive Agreement under which GLOBALFOUNDRIES plans to acquire IBM’s global commercial semiconductor technology business, including intellectual property, world-class technologists and technologies related to IBM Microelectronics, subject to completion of applicable regulatory reviews. GLOBALFOUNDRIES will also become IBM’s exclusive server processor semiconductor technology provider for 22 nanometer (nm), 14nm and 10nm semiconductors for the next 10 years.

It includes IBM’s IP and patents, though IBM will continue to do research for 5 years and GlobalFoundries will get access to that. Now what happens to those researchers (one of whom happens to be a friend of mine) after that isn’t clear.

When I heard the rumours yesterday I was wondering if IBM was aiming to do an ARM and become a fab-less CPU designer but this is much more like exiting the whole processor business altogether. The fact that they seem to be paying GlobalFoundries to take this off their hands also makes it sound pretty bad.

What this all means for their Power CPU is uncertain, and if I was nVidia and Mellanox in the OpenPOWER alliance I would be hoping I’d know about this before joining up!

Update: I’ve spoken to some IBM’ers about this and they assert they’re not leaving the chip business, they are offloading off the fabs and the manufacturing IP to GlobalFoundries but not the chip design side of things. In my opinion, though, it does mean that should they decide to exit the chip business at some point it’ll be easier for them to do so.

IBM Finally Leaving the Intel Server Market

After selling off the consumer side of their PC business to Lenovo the other shoe has dropped for IBMs x86 business, they are going to (try and) sell the rest of the x86 offerings (and the BNT network switches they bought recently) to Lenovo as well.

The Lenovo announcement is here here: (the IBM website is silent on the matter at the moment) but basically it boils down to:

Subject to the terms and conditions of the Master Asset Purchase Agreement, the Company will acquire certain assets (the “Transferred Assets”) which include, among other things, the following assets at the Initial Closing and Subsequent Closings:

  1. certain enumerated hardware products (including “System x”, “BladeCenter”, “Blade”, “Flex System”, “Pure Flex” products and system networking products including “Blade Network Technology” and certain other related tangible properties);
  2. certain intellectual property rights in connection with the Business;
  3. certain transferred contracts that are related to the Business; and
  4. inventory of the Business consisting of System x products, including the “System x”, “BladeCenter”, “Blade”, “Flex System” and “Pure Flex” products and certain other system networking products including “Blade Network Technology”.

By my reckoning that’s all their Intel related gear (and the network switches they recently bought when they acquired BNT).

So with no more BlueGene series and now no x86 that really leaves IBM with just Power for HPC workloads. I wonder how that will work out for them?

HPC sysadmin job in Melbourne, Australia

No, not where I work for once, but a friend of mine is looking for an HPC sysadmin in his group in the Victoria State Government:

This role requires advanced skills in system and network administration and scripting, clustered computer systems, security, virtualisation and Petabyte-scale storage. It is highly desirable that you have acquired these skills in a Life sciences environment. The heterogeneous environment requires both Linux and Windows skills. You should have the ability to design and implement solutions for automated transfer of data within and between systems and to ensure the security of both internal and Internet-facing systems. In this complex environment, working closely in teams of multi-disciplinary scientists to deliver computing solutions, including advanced troubleshooting and diagnostic skills, will be required. Supervision of other members of the team will also be necessary.

They’ve got a 1500+ core Linux cluster.. 😉

Bandwidth of Tape

As part of the ongoing Stage 2 upgrade at VLSCI we received an extra 1,000 LTO5 tapes, each rated at 1.5TB uncompressed, for an additional raw, uncompressed, storage of 1.5PB. Now we typically get around 2X compression so that’s about 3PB usable. It took our team about 3 hours to uncrate, unpack, transfer and load the 1,000 tapes, effectively shifting 1.5PB of tape in an hour. That’s about 139GB/s or if you are a network person 1.1Tb/s. Not bad! 🙂

Patch for Modules to use shell functions with BASH, not aliases

Whilst the Modules system is awesome in making life easy to maintain multiple versions of packages and their dependencies (and is heavily used in HPC centres like VLSCI) it can have some annoyances (and seems to be fairly half-heartedly maintained looking at the bugtracker on SourceForge). One thing that’s bitten us from time to time is that you can’t really use its “set-alias” functionality as the bash shell does not expand aliases in non-interactive shells and that includes jobs that are launched from an HPC queuing system like Torque, PBSPro, etc.

It does have the compile time option “--disable-shell-alias” but annoyingly the condition is only applied when your shell is “sh“, not “bash“, so I’ve ended up having to patch Modules to make this work for bash as well. This patch is against 3.2.9c:

--- utility.c.orig      2011-11-29 08:27:13.000000000 +1100
+++ utility.c   2012-05-16 15:08:34.012038000 +1000
@@ -1422,7 +1422,7 @@
         **  Shells supporting extended bourne shell syntax ....
        if( (!strcmp( shell_name, "sh") && bourne_alias)
-               ||  !strcmp( shell_name, "bash")
+               || ( !strcmp( shell_name, "bash") && bourne_alias )
                 ||  !strcmp( shell_name, "zsh" )
                 ||  !strcmp( shell_name, "ksh")) {
@@ -1471,7 +1471,7 @@
            fprintf( aliasfile, "'%c", alias_separator);
-        } else if( !strcmp( shell_name, "sh")
+        } else if( ( !strcmp( shell_name, "sh") || !strcmp( shell_name, "bash") )
                &&   bourne_funcs) {

Hopefully this patch will be of use to people..

Japan knocks China off the #1 spot of the Top500 by 3X – a GRAPE machine ?

According to the NYT the new Top500 list (due out in the next few hours) will list the Japanese ‘K’ machine at the #1 spot of the Top500 at 8.2 PF.

The computer, known as “K Computer”, is three times faster than a Chinese rival that previously held the top position, said Jack Dongarra, a professor of electrical engineering and computer science at the University of Tennessee at Knoxville who keeps the official rankings of computer performance.
K is made up of 672 cabinets filled with system boards. Although considered energy-efficient, it still uses enough electricity to power nearly 10,000 homes at a cost of around $10 million annually, Mr. Dongarra said.

The research lab that houses K plans to increase the computer?s size to 800 cabinets. That will raise its speed, which already exceeds that of its five closest competitors combined, Mr. Dongarra said.

The excellent @HPC_Guru on Twitter said:

K Supercomputer Technical details: 80k+ SPARC64 VIIIfx CPUs, 640K+ cores, 1PB+ RAM, 6-dimensional Mesh/Torus interconnect

But I have a reliable source who claims that this is using GRAPE cards as APUs to reach its performance without causing (another) meltdown in Japan..

The press release for the new Top500 says:

Unlike the Chinese system it displaced from the No. 1 slot and other recent very large system, the K Computer does not use graphics processors or other accelerators.

New open source compiler on the way

Some interesting news overnight, the company PathScale have announced their EKOPath4 compiler will be released under open source licenses:

PathScale announced today that the EKOPath 4 Compiler Suite is now available as an open source project and free download for Linux, FreeBSD and Solaris. This release includes documentation and the complete development stack, including compiler, debugger, assembler, runtimes and standard libraries. EKOPath is the product of years of ongoing development, representing one of the industries highest performance Intel 64 and AMD C, C++ and Fortran compilers.

This is the compiler that, when first launched, the company said that if another compiler generated faster code then you should submit a bug report to get it fixed. 🙂

In conversation on the Beowulf list I asked about the licenses and the code and their CTO replied, saying:

Main compiler is GPLv3, v2+ and LGPL – Other parts are a mix of permissive licenses.

and that the code itself was going to be available from the PathScale GitHub account, alongside their existing open source projects.

NUMA, memory binding and how swap can dilute your locality

On the hwloc-devel mailing list Jeff Sqyres from Cisco posted a message saying:

Someone just made a fairly disturbing statement to me in an Open MPI bug ticket: if you bind some memory to a particular NUMA node, and that memory later gets paged out, then it loses its memory binding information — meaning that it can effectively get paged back in at any physical location. Possibly even on a different NUMA node.

Now this sparked off an interesting thread on the issue, with the best explanation for it being provided by David Singleton from ANU/NCI:

Unless it has changed very recently, Linux swapin_readahead is the main culprit in messing with NUMA locality on that platform. Faulting a single page causes 8 or 16 or whatever contiguous pages to be read from swap. An arbitrary contiguous range of pages in swap may not even come from the same process far less the same NUMA node. My understanding is that since there is no NUMA info with the swap entry, the only policy that can be applied to is that of the faulting vma in the faulting process. The faulted page will have the desired NUMA placement but possibly not the rest. So swapping mixes different process’ NUMA policies leading to a “NUMA diffusion process”.

So when your page gets swapped back in it will drag in a heap of pages that may have nothing to do with it and hence their pages may be misplaced. Or worse still, your page could be one of the bunch dragged back in when a process on a different NUMA node swaps something back in. This will dilute the locality of the pages. Not fun!

Iran’s New GPU Powered Supercomputer(s) ? (Updated x 2)

ComputerWorld has an article about Iran claiming to have two new supercomputers, fairly modest by Top500 standards, but lament the lack of details:

But Iran’s latest supercomputer announcement appears to have no details about the components used to build the systems. Iranian officials have not yet responded to request for details about them.

However, looking at the Iranian photo spread that they link to (which appears to be slashdot’ed now) the boxes in question are SuperMicro based systems (and so could be sourced from just about anywhere), with some of their 2U storage based boxes with heaps of disk and both 2U and some 1U boxes which are presumably the compute nodes. The odd thing is that they’re spaced out quite a bit in the rack, and the 1U systems have two fans on the left hand side (which indicates something unusual about the layout of the box). Here’s an image from that Iranian news story:

Iranian Supermicro GPU node

The nice thing is that it’s pretty easy to find a slew of boxes on SuperMicro’s website that matches the picture, it’s their 1U GPU node range which are dual GPU beasts, for example:

SuperMicro SuperServer 6016GT-TF-TM1

The problem is that this range goes back to a few years, for example in an nVidia presentation on “the worlds fastest 1U server” from 2009. HPC-wire describe these original nodes as:

Inside the SS6016T-GF Supermicro box, the two M1060 GPUs modules are on opposite sides of the server chassis in a mirror image configuration, where one is facing up, the other facing down, allowing the heat to be distributed more evenly. The NVIDIA M1060 part uses a passive heat sink, and is cooled in conjunction with the rest of the server, which contains a total of eight counter-rotating fans. Supermicro also builds a variant of this model, in which it uses a Tesla C1060 card in place of the M1060. The C1060 has the same technical specs as the M1060, the principle difference being that the C1060 has an active fan heat sink of its own. In both instances though, the servers require plenty of juice. Supermicro uses a 1,400 watt power supply to drive these CPU-GPU hybrids.

According to the HPC-Wire article the whole system (2 x CPUs, 2 x GPUs) is rated at about 2TF for single-precision FP. nVidia rate the M1060 card at 933 GF (SP) and 78 GF (DP) so I’d reckon for DP FP you’re looking at maybe 180 GF per node. But now that range includes ones with the newer M2070 Fermi GPUs which can do 1030 GF (SP) and 515 GF (DP) and would get you up to just over 2TF SP and (more importantly for Top500 rankings) over 1TF DP per node.

Now if we assume the claimed 89TF for the larger system is correct, that it is indeed double precision (to be valid for the Top500), they measured it with HPL and assume an efficiency of about 0.5 (which seems about what a top ranked GPU cluster achieves with Infiniband) we can play some number games. Numbers below invalidated by Jeff Johnson’s observation, see the “updated” section for more!

If we assume these are older M1060 GPUs then you are looking at something in the order of 1000 compute nodes to be able to get that number Rmax in Linpack – something of the order of 1MW. From the photos though I didn’t get the sense that it was that large, the way they spaced them out you’d need maybe 200 racks for the whole thing and that would have made an impressive photo (plus an awful lot of switches). Now if they’ve managed to get their hands on a the newer M2070 based nodes then you could be looking at maybe 200 nodes, a more reasonable 280KW and maybe 40 racks. But I still didn’t get the sense that the datacentre was that crowded…

So I’d guess that instead of actually running Linpack on the system they’ve just totted up the Rpeak’s then you would get away with 90 of them, so maybe 15 or 20 racks which feels to me more like the scale depicted in the images. That would still give them a system that would hit about 40TF and give it a respectable ranking around 250 on the Top500 IF they used Infiniband to connect it up. If it was gigabit ethernet then you’d be looking at maybe another 50% hit to its Rmax and that would drop it out of the Top500 altogether as you’d need at least 31TF to qualify in last Novembers list.

It’ll be certainly interesting to see what the system is if/when any more info emerges!

Update In the comments Jeff Johnson has pointed out the 2U boxes are actually 4-in-2U dual socket nodes, i.e. will likely have either 32 or 48 cores depending on whether they contain quad core or six core chips. You can see that best from this rear shot of a rack:

Rear view of a 4-in-2U rack

There are mostly 8 of those units in a rack (though rack at one end of the block has just 7, but there are 2 extra in a central rack below what may be an IB switch), so that’s 256 cores a rack if they’re quad core. There are 8 racks in a block and two blocks to a row so we’ve got 4,096 cores in that one row – or 6,144 if they’re 6 core chips!

The row with the GPU nodes is harder to make out we cannot see the whole row from any combination of the photos, but in the front view we can see 2 populated racks of a block of 5, with 8 to a rack. The first rear view shows that the block next to it also has at least 2 racks populated with 8 GPU nodes. The second rear view is handy because whilst it shows the same racks as the first it demonstrates that these two blocks coincide with a single block of traditional nodes, raising the possibility of another pair of blocks to pair with the other half of the row of traditional nodes.

Front view of GPU node racks Read view of GPU nodes Different rear view of GPU node racks

Assuming M2070 GPUs then you’re looking at 8TF a rack, or 32TF for the row (assuming no other racks populated outside of our view). If the visible nodes are duplicated on the other side then you’re looking at 64TF.

If we assume that the Intel nodes have quad core Nehalem-EP 2.93Ghz then that would compare nicely to a Dell system at Saudi Aramco which is rated at an Rpeak of 48TF. Adding the 32TF for the visible GPU nodes gets us up to 80TF, which is close to their reported number, but still short (and still for Rpeak, not Rmax). So it’s likely that there are more GPU nodes, either in the 4 racks we cannot see into in the visible block of GPU racks or in another part of that row – or both! That would make a real life Top500 run of 89TF feasible, with Infiniband.

Be great to have a floor plan and parts list! 🙂

Update 2

Going through the Iranian news reports they say that the two computers are located at the Amirkabir University of Technology (AUT) in Tehran, and Esfahan University of Technology (in Esfahan). The one in Tehran is the smaller of the two, but the figures they give appear, umm, unreliable. It seems like the usual journalists-not-understanding-tech issue compounded by translation issues – so for instance you get things like:

The supercomputer of Amirkabir University of Technology with graphic power of 22000 billion operations per second is able to process 34000 billion operations per second with the speed of 40GB. […] The supercomputer manufactured in Isfahan University of Technology has calculation ability of 34000 billion operations per second and its Graphics Processing Units (GPU) are able to do more than 32 billion operations per 100 seconds.


Amirkabir University’s supercomputer has a power of 34,000 billion operations per second, and a speed of 40 gigahertz. […] The other supercomputer project carried out by Esfahan University of Technology is among the world’s top 500 supercomputers.

So is the AUT (Tehran) system 22TF or 34TF? Could it be Rpeak 34TF and Rmax 22TF ? Is the Esfahan one 34TF (which would just creep onto the Top500) or higher ?

Unfortunately it’s the Tehran system in the photos, not the Esfahan one (the give away is the HPCRC on the racks). So my estimate of 80TF Rpeak for it could well give a measured 32TF if it’s using ethernet as the interconnect (or if the CPUs are a slower clock in the 2U nodes). Or perhaps the GPU nodes are part of something else ? That and slower clocked CPUs could bring the Rpeak down to 34TF..

Need more data!