This is continuing on from my previous blog about NERSC’s Shifter which lets you safely use Docker containers in an HPC environment.
Getting Shifter to work in Slurm is pretty easy, it includes a plugin that you must install and tell Slurm about. My test config was just:
required /usr/lib64/shifter/shifter_slurm.so shifter_config=/etc/shifter/udiRoot.conf
as I was installing by building RPMs (out preferred method is to install the plugin into our shared filesystem for the cluster so we don’t need to have it in the RAM disk of our diskless nodes). One that is done you can add the shifter
programs arguments to your Slurm batch script and then just call shifter inside it to run a process, for instance:
#!/bin/bash #SBATCH -p debug #SBATCH --image=debian:wheezy shifter cat /etc/issue
results in the following on our RHEL compute nodes:
[samuel@bruce Shifter]$ cat slurm-1734069.out Debian GNU/Linux 7 \n \l
simply demonstrating that it works. The advantage of using the plugin and this way of specifying the images is that the plugin will prep the container for us at the start of the batch job and keep it around until it ends so you can keep running commands in your script inside the container without the overhead of having to create/destroy it each time. If you need to run something in a different image you just pass the --image
option to shifter
and then it will need to set up & tear down that container, but the one you specified for your batch job is still there.
That’s great for single CPU jobs, but what about parallel applications? Well turns out that’s easy too – you just request the configuration you need and slap srun in front of the shifter command. You can even run MPI applications this way successfully. I grabbed the dispel4py/docker.openmpi Docker container with shifterimg pull dispel4py/docker.openmpi
and tried its Python version of the MPI hello world program:
#!/bin/bash #SBATCH -p debug #SBATCH --image=dispel4py/docker.openmpi #SBATCH --ntasks=3 #SBATCH --tasks-per-node=1 shifter cat /etc/issue srun shifter python /home/tutorial/mpi4py_benchmarks/helloworld.py
This prints the MPI rank to demonstrate that the MPI wire up was successful and I forced it to run the tasks on separate nodes and print the hostnames to show it’s communicating over a network, not via shared memory on the same node. But the output bemused me a little:
[samuel@bruce Python]$ cat slurm-1734135.out Ubuntu 14.04.4 LTS \n \l libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'. libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0 -------------------------------------------------------------------------- [[30199,2],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: bruce001 Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'. libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'. Hello, World! I am process 0 of 3 on bruce001. libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0 -------------------------------------------------------------------------- [[30199,2],1]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: bruce002 Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- Hello, World! I am process 1 of 3 on bruce002. libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0 -------------------------------------------------------------------------- [[30199,2],2]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: bruce003 Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- Hello, World! I am process 2 of 3 on bruce003.
It successfully demonstrates that it is using an Ubuntu container on 3 nodes, but the warnings are triggered because Open-MPI in Ubuntu is built with Infiniband support and it is detecting the presence of the IB cards on the host nodes. This is because Shifter is (as designed) exposing the systems /sys
directory to the container. The problem is that this container doesn’t include the Mellanox user-space library needed to make use of the IB cards and so you get warnings that they aren’t working and that it will fall back to a different mechanism (in this case TCP/IP over gigabit Ethernet).
Open-MPI allows you to specify what transports to use, so adding one line to my batch script:
export OMPI_MCA_btl=tcp,self,sm
cleans up the output a lot:
Ubuntu 14.04.4 LTS \n \l Hello, World! I am process 0 of 3 on bruce001. Hello, World! I am process 2 of 3 on bruce003. Hello, World! I am process 1 of 3 on bruce002.
This also begs the question then – what does this do for latency? The image contains a Python version of the OSU latency testing program which uses different message sizes between 2 MPI ranks to provide a histogram of performance. Running this over TCP/IP is trivial with the dispel4py/docker.openmpi
container, but of course it’s lacking the Mellanox library I need and as the whole point of Shifter is security I can’t get root access inside the container to install the package. Fortunately the author of the dispel4py/docker.openmpi
has their implementation published on Github and so I forked their repo, signed up for Docker and pushed a version which simply adds the libmlx4-1
package I needed.
Running the test over TCP/IP is simply a matter of submitting this batch script which forces it onto 2 separate nodes:
#!/bin/bash #SBATCH -p debug #SBATCH --image=chrissamuel/docker.openmpi:latest #SBATCH --ntasks=2 #SBATCH --tasks-per-node=1 export OMPI_MCA_btl=tcp,self,sm srun shifter python /home/tutorial/mpi4py_benchmarks/osu_latency.py
giving these latency results:
[samuel@bruce MPI]$ cat slurm-1734137.out # MPI Latency Test # Size [B] Latency [us] 0 16.19 1 16.47 2 16.48 4 16.55 8 16.61 16 16.65 32 16.80 64 17.19 128 17.90 256 19.28 512 22.04 1024 27.36 2048 64.47 4096 117.28 8192 120.06 16384 145.21 32768 215.76 65536 465.22 131072 926.08 262144 1509.51 524288 2563.54 1048576 5081.11 2097152 9604.10 4194304 18651.98
To run that same test over Infiniband I just modified the export in the batch script to force it to use IB (and thus fail if it couldn’t talk between the two nodes):
#!/bin/bash #SBATCH -p debug #SBATCH --image=chrissamuel/docker.openmpi:latest #SBATCH --ntasks=2 #SBATCH --tasks-per-node=1 export OMPI_MCA_btl=openib,self,sm srun shifter python /home/tutorial/mpi4py_benchmarks/osu_latency.py
which then gave these latency numbers:
[samuel@bruce MPI]$ cat slurm-1734138.out # MPI Latency Test # Size [B] Latency [us] 0 2.52 1 2.71 2 2.72 4 2.72 8 2.74 16 2.76 32 2.73 64 2.90 128 4.03 256 4.23 512 4.53 1024 5.11 2048 6.30 4096 7.29 8192 9.43 16384 19.73 32768 29.15 65536 49.08 131072 75.19 262144 123.94 524288 218.21 1048576 565.15 2097152 811.88 4194304 1619.22
So you can see that’s basically an order of magnitude improvement in latency using Infiniband compared to TCP/IP over gigabit Ethernet (which is what you’d expect).
Because there’s no virtualisation going on here there is no extra penalty to pay when doing this, no need to configure any fancy device pass through, no loss of any CPU MSR access, and so I’d argue that Shifter makes Docker containers way more useful for HPC than virtualisation or even Docker itself for the majority of use cases.
Am I excited about Shifter – yup! The potential to allow users build and application stack themselves right down to the OS libraries and (with a little careful thought) having something that could get native interconnect performance is fantastic. Throw in the complexities of dealing with conflicting dependencies between Python modules, system libraries, bioinformatics tools, etc, etc, and needing to provide simple methods for handling these and the advantages seem clear.
So the plan is to roll this out into production at VLSCI in the near future. Fingers crossed! 🙂