Playing with Shifter Part 2 – converted Docker containers inside Slurm

This is continuing on from my previous blog about NERSC’s Shifter which lets you safely use Docker containers in an HPC environment.

Getting Shifter to work in Slurm is pretty easy, it includes a plugin that you must install and tell Slurm about. My test config was just:

required /usr/lib64/shifter/ shifter_config=/etc/shifter/udiRoot.conf

as I was installing by building RPMs (out preferred method is to install the plugin into our shared filesystem for the cluster so we don’t need to have it in the RAM disk of our diskless nodes). One that is done you can add the shifter programs arguments to your Slurm batch script and then just call shifter inside it to run a process, for instance:


#SBATCH -p debug
#SBATCH --image=debian:wheezy

shifter cat /etc/issue

results in the following on our RHEL compute nodes:

[samuel@bruce Shifter]$ cat slurm-1734069.out 
Debian GNU/Linux 7 \n \l

simply demonstrating that it works. The advantage of using the plugin and this way of specifying the images is that the plugin will prep the container for us at the start of the batch job and keep it around until it ends so you can keep running commands in your script inside the container without the overhead of having to create/destroy it each time. If you need to run something in a different image you just pass the --image option to shifter and then it will need to set up & tear down that container, but the one you specified for your batch job is still there.

That’s great for single CPU jobs, but what about parallel applications? Well turns out that’s easy too – you just request the configuration you need and slap srun in front of the shifter command. You can even run MPI applications this way successfully. I grabbed the dispel4py/docker.openmpi Docker container with shifterimg pull dispel4py/docker.openmpi and tried its Python version of the MPI hello world program:

#SBATCH -p debug
#SBATCH --image=dispel4py/docker.openmpi
#SBATCH --ntasks=3
#SBATCH --tasks-per-node=1

shifter cat /etc/issue

srun shifter python /home/tutorial/mpi4py_benchmarks/

This prints the MPI rank to demonstrate that the MPI wire up was successful and I forced it to run the tasks on separate nodes and print the hostnames to show it’s communicating over a network, not via shared memory on the same node. But the output bemused me a little:

[samuel@bruce Python]$ cat slurm-1734135.out
Ubuntu 14.04.4 LTS \n \l

libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
[[30199,2],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: bruce001

Another transport will be used instead, although this may result in
lower performance.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
Hello, World! I am process 0 of 3 on bruce001.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
[[30199,2],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: bruce002

Another transport will be used instead, although this may result in
lower performance.
Hello, World! I am process 1 of 3 on bruce002.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
[[30199,2],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: bruce003

Another transport will be used instead, although this may result in
lower performance.
Hello, World! I am process 2 of 3 on bruce003.

It successfully demonstrates that it is using an Ubuntu container on 3 nodes, but the warnings are triggered because Open-MPI in Ubuntu is built with Infiniband support and it is detecting the presence of the IB cards on the host nodes. This is because Shifter is (as designed) exposing the systems /sys directory to the container. The problem is that this container doesn’t include the Mellanox user-space library needed to make use of the IB cards and so you get warnings that they aren’t working and that it will fall back to a different mechanism (in this case TCP/IP over gigabit Ethernet).

Open-MPI allows you to specify what transports to use, so adding one line to my batch script:

export OMPI_MCA_btl=tcp,self,sm

cleans up the output a lot:

Ubuntu 14.04.4 LTS \n \l

Hello, World! I am process 0 of 3 on bruce001.
Hello, World! I am process 2 of 3 on bruce003.
Hello, World! I am process 1 of 3 on bruce002.

This also begs the question then – what does this do for latency? The image contains a Python version of the OSU latency testing program which uses different message sizes between 2 MPI ranks to provide a histogram of performance. Running this over TCP/IP is trivial with the dispel4py/docker.openmpi container, but of course it’s lacking the Mellanox library I need and as the whole point of Shifter is security I can’t get root access inside the container to install the package. Fortunately the author of the dispel4py/docker.openmpi has their implementation published on Github and so I forked their repo, signed up for Docker and pushed a version which simply adds the libmlx4-1 package I needed.

Running the test over TCP/IP is simply a matter of submitting this batch script which forces it onto 2 separate nodes:

#SBATCH -p debug
#SBATCH --image=chrissamuel/docker.openmpi:latest
#SBATCH --ntasks=2
#SBATCH --tasks-per-node=1

export OMPI_MCA_btl=tcp,self,sm

srun shifter python /home/tutorial/mpi4py_benchmarks/

giving these latency results:

[samuel@bruce MPI]$ cat slurm-1734137.out
# MPI Latency Test
# Size [B]        Latency [us]
0                        16.19
1                        16.47
2                        16.48
4                        16.55
8                        16.61
16                       16.65
32                       16.80
64                       17.19
128                      17.90
256                      19.28
512                      22.04
1024                     27.36
2048                     64.47
4096                    117.28
8192                    120.06
16384                   145.21
32768                   215.76
65536                   465.22
131072                  926.08
262144                 1509.51
524288                 2563.54
1048576                5081.11
2097152                9604.10
4194304               18651.98

To run that same test over Infiniband I just modified the export in the batch script to force it to use IB (and thus fail if it couldn’t talk between the two nodes):

#SBATCH -p debug
#SBATCH --image=chrissamuel/docker.openmpi:latest
#SBATCH --ntasks=2
#SBATCH --tasks-per-node=1

export OMPI_MCA_btl=openib,self,sm

srun shifter python /home/tutorial/mpi4py_benchmarks/

which then gave these latency numbers:

[samuel@bruce MPI]$ cat slurm-1734138.out
# MPI Latency Test
# Size [B]        Latency [us]
0                         2.52
1                         2.71
2                         2.72
4                         2.72
8                         2.74
16                        2.76
32                        2.73
64                        2.90
128                       4.03
256                       4.23
512                       4.53
1024                      5.11
2048                      6.30
4096                      7.29
8192                      9.43
16384                    19.73
32768                    29.15
65536                    49.08
131072                   75.19
262144                  123.94
524288                  218.21
1048576                 565.15
2097152                 811.88
4194304                1619.22

So you can see that’s basically an order of magnitude improvement in latency using Infiniband compared to TCP/IP over gigabit Ethernet (which is what you’d expect).

Because there’s no virtualisation going on here there is no extra penalty to pay when doing this, no need to configure any fancy device pass through, no loss of any CPU MSR access, and so I’d argue that Shifter makes Docker containers way more useful for HPC than virtualisation or even Docker itself for the majority of use cases.

Am I excited about Shifter – yup! The potential to allow users build and application stack themselves right down to the OS libraries and (with a little careful thought) having something that could get native interconnect performance is fantastic. Throw in the complexities of dealing with conflicting dependencies between Python modules, system libraries, bioinformatics tools, etc, etc, and needing to provide simple methods for handling these and the advantages seem clear.

So the plan is to roll this out into production at VLSCI in the near future. Fingers crossed! 🙂

Playing with Shifter – NERSC’s tool to use Docker containers in HPC

Early days yet, but playing with NERSC’s Shifter to let us use Docker containers safely on our test RHEL6 cluster is looking really interesting (given you can’t use Docker itself under RHEL6, and if you could the security concerns would cancel it out anyway).

To use a pre-built Ubuntu Xenial image, for instance, you tell it to pull the image:

[samuel@bruce ~]$ shifterimg pull ubuntu:16.04

There’s a number of steps it goes through, first retrieving the container from the Docker Hub:

2016-08-01T18:19:57 Pulling Image: docker:ubuntu:16.04, status: PULLING

Then disarming the Docker container by removing any setuid/setgid bits, etc, and repacking as a Shifter image:

2016-08-01T18:20:41 Pulling Image: docker:ubuntu:16.04, status: CONVERSION

…and then it’s ready to go:

2016-08-01T18:21:04 Pulling Image: docker:ubuntu:16.04, status: READY

Using the image from the command line is pretty easy:

[samuel@bruce ~]$ cat /etc/lsb-release

[samuel@bruce ~]$ shifter --image=ubuntu:16.04 cat /etc/lsb-release

and the shifter runtime will copy in a site specified /etc/passwd, /etc/group and /etc/nsswitch.conf files so that you can do user/group lookups easily, as well as map in site specified filesystems, so your home directory is just where it would normally be on the cluster.

[samuel@bruce ~]$ shifter --image=debian:wheezy bash --login
samuel@bruce:~$ pwd

I’ve not yet got to the point of configuring the Slurm plugin so you can queue up a Slurm job that will execute inside a Docker container, but very promising so far!

Correction: a misconception on my part – Shifter doesn’t put a Slurm batch job inside the container. It could, but there are good reasons why it’s better to leave that to the user (soon to be documented on the Shifter wiki page for Slurm integration).

Let’s Encrypt – getting your own (free) SSL certificates

For those who’ve not been paying attention the Let’s Encrypt project entered public beta recently so that anyone could get their own SSL certificates. So I jumped right in with the simp_le client (as the standard client tries to configure Apache for you, and I didn’t want that as my config is pretty custom) and used this tutorial as inspiration.

My server is running Debian Squeeze LTS (for long painful reasons that I won’t go into here now) but the client installation was painless, I just patched out a warning about Python 2.6 no longer being supported in venv/lib/python2.6/site-packages/cryptography/ 🙂

It worked well until I got rate limited for creating more than 10 certificates in a day (yeah, I host a number of domains).

Very happy with the outcome, A+ would buy again.. 🙂

UniFi systemd unit file for Ubuntu 15.04

At work we’ve started using some UniFi wireless gear and the system I’ve managed to commandeer to do the control system for it is running Kubuntu 15.04 which uses systemd. Now the UniFi Debian packages don’t supply systemd unit files so I went hunting and found a blog post by Derek Horn about getting it running on CentOS7 so I nabbed his and adapted it for Ubuntu (which wasn’t that hard).

The file lives in /etc/systemd/system/unifi.service and was enabled with systemctl enable unifi.service (from memory, there might have been another step that involved getting systemd to rescan unit files to pick up the new one, but I don’t remember for sure).

Here is the unit file:

# Systemd unit file for unifi-rapid

Description=UniFi Wireless AP Control System

#ExecStart=/usr/bin/java -Xmx1024M -jar /usr/lib/unifi/lib/ace.jar start
ExecStart=/usr/bin/jsvc -nodetach -home /usr/lib/jvm/java-7-openjdk-amd64 -cp /usr/share/java/commons-daemon.jar:/usr/lib/unifi/lib/ace.jar -pidfile /var/run/unifi/ -procname unifi -outfile SYSLOG -errfile SYSLOG -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Xmx1024M com.ubnt.ace.Launcher start
#ExecStop=/usr/bin/java -jar /usr/lib/unifi/lib/ace.jar stop
ExecStop=/usr/bin/jsvc -home /usr/lib/jvm/java-7-openjdk-amd64 -cp /usr/share/java/commons-daemon.jar:/usr/lib/unifi/lib/ace.jar -pidfile /var/run/unifi/ -procname unifi -outfile SYSLOG -errfile SYSLOG -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Xmx1024M -stop com.ubnt.ace.Launcher stop


ARM v8 (64-bit) developer boxes

Looks like things are moving along in the world of 64-bit ARM, systems aimed at early adopting developers are now around. For instance APM have their X-C1 Development Kit Plus which has 8 x 2.4GHz ARMv8 cores, 16GB RAM, 500GB HDD, 1x10gigE, 3x1gigE for ~US$2,500 (or a steep discount if you qualify as a developer). Oh, and it ships with Linux by default of course.

Found via a blog post by Steve McIntyre about bringing up Debian Jessie on ARMv8 (it’ll be a release architecture for it) which has the interesting titbit that (before ARM had their Juno developer boxes):

Then Chen Baozi and the folks running the Tianhe-2 supercomputer project in Guangzhou, China contacted us to offer access to some arm64 hardware

So it looks like (I presume) NUDT are paying it some attention & building/acquiring their own ARMv8 systems.

An Aide Memoir for Debian Upgrades

Documenting this as it’s the process I followed for upgrading Debian on the VM that runs this blog and others, for the purposes of having notes for next time and in case it helps others. Please do not consider it complete, if it breaks your upgrade you get to keep both parts..

  • Ensure all current updates to Debian and installed
  • Firewall the VM from all external access to quiesce activity and prevent email delivery and web access before it’s all tested
  • cd /etc; etckeeper commit "Commit any uncommitted pre-upgrade changes" – Make sure there are no uncommitted changes in /etc
  • Change /etc/apt/sources.list to point to the new Debian version
  • apt-get install debian-archive-keyring – to ensure current PGP keys are in place to verify the archive
  • postconf -e "soft_bounce = yes"Set Postfix to do soft bounces in case of issues bringing things back up
  • Stop all web and email services
  • Comment out all crontab entries for amavis
  • Ensure Maia Mailguard has processed all its outstanding quarrantined items by running the script by hand.
  • Do a full mysqldump, copy it offline
  • Stop MySQL to quiesce it
  • Use the hosting providers web interface to take a snapshot backup of the current state, just in case..
  • script -f upgrade – log everything to a log file
  • Start Postfix only so apt-listchanges can email news to root
  • apt-get update && apt-get dist-upgrade – do the actual upgrade!
  • Fix all the things. 🙂

First experiences with Dell XPS 12 9Q33

My new work laptop is a Dell XPS 12 9Q33 which has an Intel Haswell CPU in it – I can haz hardware transactional memory! It’s also got 8GB RAM, 256GB Samsung SSD and a multitouch touchscreen which will swivel around to form a tablet type device. 🙂

Dell XPS 12 screen rotation to form tablet - linked from Extreme Tech

Of course the first thing to do with this new machine was to install Kubuntu 13.10 as Dell will not ship these without the Windoze tax. So far I’ve got to say it’s been pretty painless, everything works out of the box and the only niggles I’ve found so far are:

UEFI hangs if I have a Kogan 64GB USB3 stick in one of the USB3 ports

Well the solution is don’t do it then. I’ll see if I can figure out how to report this as an issue to Dell.

The Linux i2c-hid driver currently doesn’t properly handle the touch screen and trackpad – both work but the trackpad is detected as a PS2 mouse

This is fine for me, I’m happy to just use the trackpad as an old-school mouse and be able to use the touchscreen should I need to. If you really want to use the trackpad and don’t care about the touchscreen then you just need to blacklist the i2c-hid kernel module and reboot. There is an existing Ubuntu bug on Launchpad about this, but it’s an upstream kernel issue.

The WLAN kill switch (FN+F2) is not recognised

This is probably the most annoying one, and I’ve not (yet) got around to reporting it as a bug. You can work around it from the command line by with: rfkill block all

Things I like:

  • A great screen, 1920×1080 and really sharp
  • A nice fast SSD, bonnie++ measured over 400MB/s block write and over 600MB/s block read using btrfs
  • Small – my previous Dell laptop was a Latitude Z600 which was nice but very wide
  • Haswell CPU – latest Intel goodness
  • low power – powertop reports it getting down to 5 to 6 Watts when it’s idling (with the screen on at 60% brightness)

Seems quite promising so far!

pv: a handy replacement for cat when piping large amounts of data into commands

If you’re ever in the situation of needing to pipe a large amount of data into a program and would usually use cat or just redirect from a file, but would like some idea of how long it may take, then may I recommend to you the “pv” command (packaged in Debian/Ubuntu/RHEL/etc)?

For instance, here is restoring a 9GB MySQL dump into a MariaDB database:

root@db3:/var/tmp# pv db4.sql | mysql
 570MB 0:02:06 [5.01MB/s] [>                                   ]  5% ETA 0:34:28

Suddenly you’ve got the data rate, the percentage complete and an ETA so you can go off and get a coffee whilst it works..

How to delete lots of programs in MythTV, easily

I realised I had over 60 episodes of Get Smart recorded which I was never going to get around to watching, so I wanted to delete them quickly. I had a quick poke at MythWeb but that didn’t seem to have the functionality but a quick google revealed this forum post which says:

When in select recording to watch, mark the recording with a backslash “/”.
Mark all that you want to delete.
Press M to bring up the Recordings list menu.
Select playlist options
Select Delete

Works like a charm!

There’s also Craig’s set of command line tools that can assist with this: