Playing with Shifter Part 2 – converted Docker containers inside Slurm

This is continuing on from my previous blog about NERSC’s Shifter which lets you safely use Docker containers in an HPC environment.

Getting Shifter to work in Slurm is pretty easy, it includes a plugin that you must install and tell Slurm about. My test config was just:

required /usr/lib64/shifter/shifter_slurm.so shifter_config=/etc/shifter/udiRoot.conf

as I was installing by building RPMs (out preferred method is to install the plugin into our shared filesystem for the cluster so we don’t need to have it in the RAM disk of our diskless nodes). One that is done you can add the shifter programs arguments to your Slurm batch script and then just call shifter inside it to run a process, for instance:

#!/bin/bash

#SBATCH -p debug
#SBATCH --image=debian:wheezy

shifter cat /etc/issue

results in the following on our RHEL compute nodes:

[samuel@bruce Shifter]$ cat slurm-1734069.out 
Debian GNU/Linux 7 \n \l

simply demonstrating that it works. The advantage of using the plugin and this way of specifying the images is that the plugin will prep the container for us at the start of the batch job and keep it around until it ends so you can keep running commands in your script inside the container without the overhead of having to create/destroy it each time. If you need to run something in a different image you just pass the --image option to shifter and then it will need to set up & tear down that container, but the one you specified for your batch job is still there.

That’s great for single CPU jobs, but what about parallel applications? Well turns out that’s easy too – you just request the configuration you need and slap srun in front of the shifter command. You can even run MPI applications this way successfully. I grabbed the dispel4py/docker.openmpi Docker container with shifterimg pull dispel4py/docker.openmpi and tried its Python version of the MPI hello world program:

#!/bin/bash
#SBATCH -p debug
#SBATCH --image=dispel4py/docker.openmpi
#SBATCH --ntasks=3
#SBATCH --tasks-per-node=1

shifter cat /etc/issue

srun shifter python /home/tutorial/mpi4py_benchmarks/helloworld.py

This prints the MPI rank to demonstrate that the MPI wire up was successful and I forced it to run the tasks on separate nodes and print the hostnames to show it’s communicating over a network, not via shared memory on the same node. But the output bemused me a little:

[samuel@bruce Python]$ cat slurm-1734135.out
Ubuntu 14.04.4 LTS \n \l

libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
--------------------------------------------------------------------------
[[30199,2],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: bruce001

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
Hello, World! I am process 0 of 3 on bruce001.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
--------------------------------------------------------------------------
[[30199,2],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: bruce002

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
Hello, World! I am process 1 of 3 on bruce002.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
--------------------------------------------------------------------------
[[30199,2],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: bruce003

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
Hello, World! I am process 2 of 3 on bruce003.

It successfully demonstrates that it is using an Ubuntu container on 3 nodes, but the warnings are triggered because Open-MPI in Ubuntu is built with Infiniband support and it is detecting the presence of the IB cards on the host nodes. This is because Shifter is (as designed) exposing the systems /sys directory to the container. The problem is that this container doesn’t include the Mellanox user-space library needed to make use of the IB cards and so you get warnings that they aren’t working and that it will fall back to a different mechanism (in this case TCP/IP over gigabit Ethernet).

Open-MPI allows you to specify what transports to use, so adding one line to my batch script:

export OMPI_MCA_btl=tcp,self,sm

cleans up the output a lot:

Ubuntu 14.04.4 LTS \n \l

Hello, World! I am process 0 of 3 on bruce001.
Hello, World! I am process 2 of 3 on bruce003.
Hello, World! I am process 1 of 3 on bruce002.

This also begs the question then – what does this do for latency? The image contains a Python version of the OSU latency testing program which uses different message sizes between 2 MPI ranks to provide a histogram of performance. Running this over TCP/IP is trivial with the dispel4py/docker.openmpi container, but of course it’s lacking the Mellanox library I need and as the whole point of Shifter is security I can’t get root access inside the container to install the package. Fortunately the author of the dispel4py/docker.openmpi has their implementation published on Github and so I forked their repo, signed up for Docker and pushed a version which simply adds the libmlx4-1 package I needed.

Running the test over TCP/IP is simply a matter of submitting this batch script which forces it onto 2 separate nodes:

#!/bin/bash
#SBATCH -p debug
#SBATCH --image=chrissamuel/docker.openmpi:latest
#SBATCH --ntasks=2
#SBATCH --tasks-per-node=1

export OMPI_MCA_btl=tcp,self,sm

srun shifter python /home/tutorial/mpi4py_benchmarks/osu_latency.py

giving these latency results:

[samuel@bruce MPI]$ cat slurm-1734137.out
# MPI Latency Test
# Size [B]        Latency [us]
0                        16.19
1                        16.47
2                        16.48
4                        16.55
8                        16.61
16                       16.65
32                       16.80
64                       17.19
128                      17.90
256                      19.28
512                      22.04
1024                     27.36
2048                     64.47
4096                    117.28
8192                    120.06
16384                   145.21
32768                   215.76
65536                   465.22
131072                  926.08
262144                 1509.51
524288                 2563.54
1048576                5081.11
2097152                9604.10
4194304               18651.98

To run that same test over Infiniband I just modified the export in the batch script to force it to use IB (and thus fail if it couldn’t talk between the two nodes):

#!/bin/bash
#SBATCH -p debug
#SBATCH --image=chrissamuel/docker.openmpi:latest
#SBATCH --ntasks=2
#SBATCH --tasks-per-node=1

export OMPI_MCA_btl=openib,self,sm

srun shifter python /home/tutorial/mpi4py_benchmarks/osu_latency.py

which then gave these latency numbers:

[samuel@bruce MPI]$ cat slurm-1734138.out
# MPI Latency Test
# Size [B]        Latency [us]
0                         2.52
1                         2.71
2                         2.72
4                         2.72
8                         2.74
16                        2.76
32                        2.73
64                        2.90
128                       4.03
256                       4.23
512                       4.53
1024                      5.11
2048                      6.30
4096                      7.29
8192                      9.43
16384                    19.73
32768                    29.15
65536                    49.08
131072                   75.19
262144                  123.94
524288                  218.21
1048576                 565.15
2097152                 811.88
4194304                1619.22

So you can see that’s basically an order of magnitude improvement in latency using Infiniband compared to TCP/IP over gigabit Ethernet (which is what you’d expect).

Because there’s no virtualisation going on here there is no extra penalty to pay when doing this, no need to configure any fancy device pass through, no loss of any CPU MSR access, and so I’d argue that Shifter makes Docker containers way more useful for HPC than virtualisation or even Docker itself for the majority of use cases.

Am I excited about Shifter – yup! The potential to allow users build and application stack themselves right down to the OS libraries and (with a little careful thought) having something that could get native interconnect performance is fantastic. Throw in the complexities of dealing with conflicting dependencies between Python modules, system libraries, bioinformatics tools, etc, etc, and needing to provide simple methods for handling these and the advantages seem clear.

So the plan is to roll this out into production at VLSCI in the near future. Fingers crossed! 🙂

Playing with Shifter – NERSC’s tool to use Docker containers in HPC

Early days yet, but playing with NERSC’s Shifter to let us use Docker containers safely on our test RHEL6 cluster is looking really interesting (given you can’t use Docker itself under RHEL6, and if you could the security concerns would cancel it out anyway).

To use a pre-built Ubuntu Xenial image, for instance, you tell it to pull the image:

[samuel@bruce ~]$ shifterimg pull ubuntu:16.04

There’s a number of steps it goes through, first retrieving the container from the Docker Hub:

2016-08-01T18:19:57 Pulling Image: docker:ubuntu:16.04, status: PULLING

Then disarming the Docker container by removing any setuid/setgid bits, etc, and repacking as a Shifter image:

2016-08-01T18:20:41 Pulling Image: docker:ubuntu:16.04, status: CONVERSION

…and then it’s ready to go:

2016-08-01T18:21:04 Pulling Image: docker:ubuntu:16.04, status: READY

Using the image from the command line is pretty easy:

[samuel@bruce ~]$ cat /etc/lsb-release
LSB_VERSION=base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch

[samuel@bruce ~]$ shifter --image=ubuntu:16.04 cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04 LTS"

and the shifter runtime will copy in a site specified /etc/passwd, /etc/group and /etc/nsswitch.conf files so that you can do user/group lookups easily, as well as map in site specified filesystems, so your home directory is just where it would normally be on the cluster.

[samuel@bruce ~]$ shifter --image=debian:wheezy bash --login
samuel@bruce:~$ pwd
/vlsci/VLSCI/samuel

I’ve not yet got to the point of configuring the Slurm plugin so you can queue up a Slurm job that will execute inside a Docker container, but very promising so far!

Correction: a misconception on my part – Shifter doesn’t put a Slurm batch job inside the container. It could, but there are good reasons why it’s better to leave that to the user (soon to be documented on the Shifter wiki page for Slurm integration).

ARM v8 (64-bit) developer boxes

Looks like things are moving along in the world of 64-bit ARM, systems aimed at early adopting developers are now around. For instance APM have their X-C1 Development Kit Plus which has 8 x 2.4GHz ARMv8 cores, 16GB RAM, 500GB HDD, 1x10gigE, 3x1gigE for ~US$2,500 (or a steep discount if you qualify as a developer). Oh, and it ships with Linux by default of course.

Found via a blog post by Steve McIntyre about bringing up Debian Jessie on ARMv8 (it’ll be a release architecture for it) which has the interesting titbit that (before ARM had their Juno developer boxes):

Then Chen Baozi and the folks running the Tianhe-2 supercomputer project in Guangzhou, China contacted us to offer access to some arm64 hardware

So it looks like (I presume) NUDT are paying it some attention & building/acquiring their own ARMv8 systems.

IBM Pays GlobalFoundries to take Microprocessor Business

Interesting times for IBM, having already divested themselves of the x86 business by selling it on to Lenovo they’ve now announced that they’re paying GlobalFoundries $1.5bn to take pretty much that entire side of the business!

IBM (NYSE: IBM) and GLOBALFOUNDRIES today announced that they have signed a Definitive Agreement under which GLOBALFOUNDRIES plans to acquire IBM’s global commercial semiconductor technology business, including intellectual property, world-class technologists and technologies related to IBM Microelectronics, subject to completion of applicable regulatory reviews. GLOBALFOUNDRIES will also become IBM’s exclusive server processor semiconductor technology provider for 22 nanometer (nm), 14nm and 10nm semiconductors for the next 10 years.

It includes IBM’s IP and patents, though IBM will continue to do research for 5 years and GlobalFoundries will get access to that. Now what happens to those researchers (one of whom happens to be a friend of mine) after that isn’t clear.

When I heard the rumours yesterday I was wondering if IBM was aiming to do an ARM and become a fab-less CPU designer but this is much more like exiting the whole processor business altogether. The fact that they seem to be paying GlobalFoundries to take this off their hands also makes it sound pretty bad.

What this all means for their Power CPU is uncertain, and if I was nVidia and Mellanox in the OpenPOWER alliance I would be hoping I’d know about this before joining up!

Update: I’ve spoken to some IBM’ers about this and they assert they’re not leaving the chip business, they are offloading off the fabs and the manufacturing IP to GlobalFoundries but not the chip design side of things. In my opinion, though, it does mean that should they decide to exit the chip business at some point it’ll be easier for them to do so.

IBM Finally Leaving the Intel Server Market

After selling off the consumer side of their PC business to Lenovo the other shoe has dropped for IBMs x86 business, they are going to (try and) sell the rest of the x86 offerings (and the BNT network switches they bought recently) to Lenovo as well.

The Lenovo announcement is here here: http://www.lenovo.com/ww/lenovo/pdf/announcement/E_099220140123a.pdf (the IBM website is silent on the matter at the moment) but basically it boils down to:

Subject to the terms and conditions of the Master Asset Purchase Agreement, the Company will acquire certain assets (the “Transferred Assets”) which include, among other things, the following assets at the Initial Closing and Subsequent Closings:

  1. certain enumerated hardware products (including “System x”, “BladeCenter”, “Blade”, “Flex System”, “Pure Flex” products and system networking products including “Blade Network Technology” and certain other related tangible properties);
  2. certain intellectual property rights in connection with the Business;
  3. certain transferred contracts that are related to the Business; and
  4. inventory of the Business consisting of System x products, including the “System x”, “BladeCenter”, “Blade”, “Flex System” and “Pure Flex” products and certain other system networking products including “Blade Network Technology”.

By my reckoning that’s all their Intel related gear (and the network switches they recently bought when they acquired BNT).

So with no more BlueGene series and now no x86 that really leaves IBM with just Power for HPC workloads. I wonder how that will work out for them?

HPC sysadmin job in Melbourne, Australia

No, not where I work for once, but a friend of mine is looking for an HPC sysadmin in his group in the Victoria State Government:

This role requires advanced skills in system and network administration and scripting, clustered computer systems, security, virtualisation and Petabyte-scale storage. It is highly desirable that you have acquired these skills in a Life sciences environment. The heterogeneous environment requires both Linux and Windows skills. You should have the ability to design and implement solutions for automated transfer of data within and between systems and to ensure the security of both internal and Internet-facing systems. In this complex environment, working closely in teams of multi-disciplinary scientists to deliver computing solutions, including advanced troubleshooting and diagnostic skills, will be required. Supervision of other members of the team will also be necessary.

They’ve got a 1500+ core Linux cluster.. 😉

Bandwidth of Tape

As part of the ongoing Stage 2 upgrade at VLSCI we received an extra 1,000 LTO5 tapes, each rated at 1.5TB uncompressed, for an additional raw, uncompressed, storage of 1.5PB. Now we typically get around 2X compression so that’s about 3PB usable. It took our team about 3 hours to uncrate, unpack, transfer and load the 1,000 tapes, effectively shifting 1.5PB of tape in an hour. That’s about 139GB/s or if you are a network person 1.1Tb/s. Not bad! 🙂

Patch for Modules to use shell functions with BASH, not aliases

Whilst the Modules system is awesome in making life easy to maintain multiple versions of packages and their dependencies (and is heavily used in HPC centres like VLSCI) it can have some annoyances (and seems to be fairly half-heartedly maintained looking at the bugtracker on SourceForge). One thing that’s bitten us from time to time is that you can’t really use its “set-alias” functionality as the bash shell does not expand aliases in non-interactive shells and that includes jobs that are launched from an HPC queuing system like Torque, PBSPro, etc.

It does have the compile time option “--disable-shell-alias” but annoyingly the condition is only applied when your shell is “sh“, not “bash“, so I’ve ended up having to patch Modules to make this work for bash as well. This patch is against 3.2.9c:

--- utility.c.orig      2011-11-29 08:27:13.000000000 +1100
+++ utility.c   2012-05-16 15:08:34.012038000 +1000
@@ -1422,7 +1422,7 @@
         **  Shells supporting extended bourne shell syntax ....
         **/
        if( (!strcmp( shell_name, "sh") && bourne_alias)
-               ||  !strcmp( shell_name, "bash")
+               || ( !strcmp( shell_name, "bash") && bourne_alias )
                 ||  !strcmp( shell_name, "zsh" )
                 ||  !strcmp( shell_name, "ksh")) {
            /**
@@ -1471,7 +1471,7 @@
 
            fprintf( aliasfile, "'%c", alias_separator);
 
-        } else if( !strcmp( shell_name, "sh")
+        } else if( ( !strcmp( shell_name, "sh") || !strcmp( shell_name, "bash") )
                &&   bourne_funcs) {
        /**

Hopefully this patch will be of use to people..

Japan knocks China off the #1 spot of the Top500 by 3X – a GRAPE machine ?

According to the NYT the new Top500 list (due out in the next few hours) will list the Japanese ‘K’ machine at the #1 spot of the Top500 at 8.2 PF.

The computer, known as “K Computer”, is three times faster than a Chinese rival that previously held the top position, said Jack Dongarra, a professor of electrical engineering and computer science at the University of Tennessee at Knoxville who keeps the official rankings of computer performance.
[…]
K is made up of 672 cabinets filled with system boards. Although considered energy-efficient, it still uses enough electricity to power nearly 10,000 homes at a cost of around $10 million annually, Mr. Dongarra said.

The research lab that houses K plans to increase the computer?s size to 800 cabinets. That will raise its speed, which already exceeds that of its five closest competitors combined, Mr. Dongarra said.

The excellent @HPC_Guru on Twitter said:

K Supercomputer Technical details: 80k+ SPARC64 VIIIfx CPUs, 640K+ cores, 1PB+ RAM, 6-dimensional Mesh/Torus interconnect

But I have a reliable source who claims that this is using GRAPE cards as APUs to reach its performance without causing (another) meltdown in Japan..

The press release for the new Top500 says:

Unlike the Chinese system it displaced from the No. 1 slot and other recent very large system, the K Computer does not use graphics processors or other accelerators.

New open source compiler on the way

Some interesting news overnight, the company PathScale have announced their EKOPath4 compiler will be released under open source licenses:

PathScale announced today that the EKOPath 4 Compiler Suite is now available as an open source project and free download for Linux, FreeBSD and Solaris. This release includes documentation and the complete development stack, including compiler, debugger, assembler, runtimes and standard libraries. EKOPath is the product of years of ongoing development, representing one of the industries highest performance Intel 64 and AMD C, C++ and Fortran compilers.

This is the compiler that, when first launched, the company said that if another compiler generated faster code then you should submit a bug report to get it fixed. 🙂

In conversation on the Beowulf list I asked about the licenses and the code and their CTO replied, saying:

Main compiler is GPLv3, v2+ and LGPL – Other parts are a mix of permissive licenses.

and that the code itself was going to be available from the PathScale GitHub account, alongside their existing open source projects.