HPC sysadmin job in Melbourne, Australia

No, not where I work for once, but a friend of mine is looking for an HPC sysadmin in his group in the Victoria State Government:

This role requires advanced skills in system and network administration and scripting, clustered computer systems, security, virtualisation and Petabyte-scale storage. It is highly desirable that you have acquired these skills in a Life sciences environment. The heterogeneous environment requires both Linux and Windows skills. You should have the ability to design and implement solutions for automated transfer of data within and between systems and to ensure the security of both internal and Internet-facing systems. In this complex environment, working closely in teams of multi-disciplinary scientists to deliver computing solutions, including advanced troubleshooting and diagnostic skills, will be required. Supervision of other members of the team will also be necessary.

They’ve got a 1500+ core Linux cluster.. ;-)

Locking Down WordPress Admin and Login URLs

For those WordPress admins who are lucky enough to only access via certain defined IP addresses (IPv4 or IPv6) you can lock down access to the wp-admin and wp-login.php URLs in your Apache configuration with just:

<location /wp-admin/>
    Order deny,allow
    Deny from all
    Allow from 127.0.0.0/255.0.0.0 ::1/128 10.1.2.3/32 1234:5678:90ab:cdef::/64
</location>

<files wp-login.php>
    Order deny,allow
    Deny from all
    Allow from 127.0.0.0/255.0.0.0 ::1/128 10.1.2.3/32 1234:5678:90ab:cdef::/64
</files>

Hopefully that helps someone!

August 1993 “Preliminary Hardware Configuration for a Main Service Linux Machine”

From 1992 through to 1994 I was working at the Computer Unit at the University of Wales (well, wrangled an “Employment Training” position there on my own initiative) as a sysadmin and was running Linux on an IBM XT (from very dodgy memory). A friend of mine, Piercarlo Grandi, suggested to me (semi-seriously I suspect) that you could now build a large enough PC to support quite a number of users, and that the Computer Unit could use it as a central server (they were running DEC 5830s with Utrix), so I knocked up a text file and discussed it with my colleagues. They didn’t take it very seriously – little did any of suspect how much that would change.

Well tonight I indulged in a bit of computer archaeology and managed to get the data off my Amiga hard disk (from a GVP A530 expansion unit) and browsing around happpened to stumbleover that text file, dated 8:20pm on the 8th August 1993. It’s quite touchingly naive in places, and my numbers are pretty ropey.. :-)

Preliminary Hardware Configuration for a Main Service Linux Machine

Item                            Each    Number          Total

Case                             100    1                 100
Keyboard + Mouse                 100    1                 100
Floppies                         100    1                 100
DAT Drives                       750    4                3000
EISA SCSI Controllers            300    2                 600
Memory (Mb)                       25    256              6400
Pentium EISA Motherboard        1000    1                1000
3.5Gb SCSI-II Disks             1800    5                7000
Screen+SVGA Card                1000    1                1000
EISA Ethernet Card               200    2                 400
CD-ROM Drive                     300    1                 300

                                                        20000

Projected to be able to support between 200-400 users running Linux 0.99.p12
        (Alpha release kernel with patched IP - appears stable)


Notes:

(1) I've seen reports that the ethernet driver code may suffer from a
    memory leak, but I've not seen any evidence for this yet as my
    machine hasn't been turned on for a long enough period for it to
    cause any problems.

(2) As it is so new there is very little commercial software available
    for it, but there is a quite sizeable free software base with many
    of the GNU packages already ported for it, and this is generally of
    high quality.

(3) The Linux kernel is well thought out, and includes support for shared
    libraries (which Ultrix sadly never picked up) which significantly
    reduces the amount of memory applications need.

(4) A Linux box of the size proposed for the service machine has not
    been attempted yet (as far as I know), but ones of the size of the
    proposed testbed machine are already in usage on the Internet. I
    believe that Linux can handle this scaling up with no problem.

(5) There are apparently companies within the UK who sell support services
    for Linux, I will investigate further.

(6) There is already a large amount of Linux expertise on the Internet,
    including the comp.os.linux newsgroup, the linux-activists mailing
    list and even an IRC channel dedicated to Linux users.

This post is dedicated to Rob Ash, my then boss, who took a chance taking me on after my time as a student mucking around on computers when I was meant to be doing my Physics degree, and who was a great mentor for me.

Mount Burnett Observatory (@MBObservatory) now on Twitter

For almost a year now I’ve been a member of the Mount Burnett Observatory, a community project at the old Monash University astronomical observatory at Mount Burnett in the Dandenong Ranges. It’s great fun with both the original 18″ telescope and new 6″ and 8″ Dobsonian telescopes (some thoughtfully sponsored by the Bendigo Bank for education and outreach purposes).

It’s had a Facebook presence for a while, but nothing on Twitter, so after speaking to the webmaster and the president I’ve now set up a Twitter presence as @MBObservatory.

So if you’re into astronomy and around Melbourne (especially the south-eastern suburbs, though we do have people travelling in from quite a way) and use Twitter please do follow us!


Want to help with the Linux kernel?

Greg Kroah-Hartman, the maintainer of the stable releases of the Linux kernel (the point releases after a 3.x release, e.g. 3.6.5, etc) is looking for help for about 6 months as he’s getting overwhelmed.

I’m looking for someone to help me out with the stable Linux kernel release process. Right now I’m drowning in trees and patches, and could use some one to help me sanity-check the releases I’m doing.

Specifically, I’m looking for someone to help with:

  • test boot the -rc stable kernels to make sure I didn’t do anything foolish.
  • dig through the Linux kernel distro trees and send me the git commit ids, or the backported patches, of things they are shipping that are not in the stable and longterm kernel releases.
  • do code review of the patches going into the stable releases.

If you can help out with this, I’d really appreciate it.

You’ll need to show you’ve had kernel patches accepted, are running the latest stable release candidate kernel and can find distro patches (details at his website). You’ve got until November 7th to apply!

Paying for Freedom (Updated)

There has been much furore over the Microsoft Windows 8 Logo requirements, and how they require UEFI Secure Boot to be enabled, requiring the user to reconfigure their UEFI firmare (on x86 platforms) to be able to boot non-Windows 8 operating systems. People are concerned about the fact that this may be a slippery slope to systems that are locked down completely (as ARM powered Windows 8 systems already will be) with Secure Boot not being allowed to be disabled in order to get the MS Windows logo tick and thus the valuable marketing dollars from Redmond.

Now to me the solution seems obvious – don’t buy systems from people who sell such systems, but instead buy from vendors who believe in making systems that are under your control, and agree that it is you who gets to decide whether or not you want to turn Secure Boot on, or not. Go to companies like ZaReason (who sell around the world and have an Asia Pacific setup in New Zealand now) and System 76 (who used to be US only, but now apparently ship internationally).

The problem seems to be though that people complain that their systems tend to be a bit more expensive than the Dell’s of the world, companies who ship millions of PCs and have huge economies of scale (and power over their OEMs). Because ZaReason and System 76 work on much smaller volumes they don’t get the same deals and so of course their hardware will be more expensive – but that extra cost is actually an investment, a small downpayment on having vendors around in the future who will care about our freedoms to do with our computers as we see fit.

If we don’t make that investment in these companies then we will have no right to complain should we suddenly wake up one morning and find we have a choice between a beige PC that will only boot Windows 8 or later (and the ability to get your own code blessed so it will boot has gone away) and a shiny white Apple iProduct that will allow you to install any of the applications from the App store, but nothing outside of that walled garden.

So I have made my choices, when my desktop PC came a cropper and cooked itself due to the Linux leap second bug I bought a ZaReason Valta desktop and I then replaced my 9 year old laptop with a shiny new UltraLap 430 ultrabook which, I have to say, absolutely rocks with 8GB of RAM and an i5 Ivy Bridge CPU. :-)

I believe freedom is worth investing in.

Update:

As spufidoo mentions in the comments the situation for desktops is not too bad at present whilst you can build your own, though there is always the chance that you end up with motherboards shipping with Secure Boot enabled and only Microsofts key installed (“why would you want anything else?”).

More of an issue are laptops and tablets where you can’t really build your own and you rely on companies to sell you the finished product. This was really the issue I had in mind when I wrote the article but failed to articulate it. We’ve already seen examples of the issues around tablets being locked down with the Nook Tablet from Barnes and Noble (though as the linked article reports people have worked around that now) so unless we support projects like the ZaTab where the package includes the source code we are purely relying on the munificence of companies for whom freedom is not the first thing they are thinking about.

Problems getting stack traces from a Python program (Kubuntu 12.10 development version)

I’m trying to get to the bottom of this bug on Launchpad which completely breaks Synaptic touchpad configuration under KDE:

https://bugs.launchpad.net/ubuntu/+source/synaptiks/+bug/1039261

The tl;dr version is that the Python interpreter is somehow emitting two calls to the Xorg libXi function XIQueryVersion(), the first call sends a client XInput version number of 2.1 and then the second one sends 2.0 (seen using xtrace).

The second call causes a BadValue error, because you’re not meant to send a lower value on any later calls (as can be seen from this Xorg libXi git commit).

This causes the comical error:

The version of the XInput extension installed on your system is too old. Version 2.0 was found, but at least version 2.0 is required

The problem is that the Python code only has the second call sending the 2.0 version number, there is no other call in the package that will send anything else, let alone the 2.1 value.

So I want to generate a call trace every time the XIQueryVersion() function is called, but I’m struggling to get it to work.

The killer at the moment is that both ltrace and gdb (when told to trace children) hang when python runs dash to run ldconfig.real and that blocks – so I never get to the point where the function gets called the first time.

With GDB I’m using:

set detach-on-fork off
set follow-fork-mode child
set follow-exec-mode new
catch load /libXi/
break XIQueryVersion

…and this is what happens:

chris@chris-ultralap:~/Code/Ubuntu$ gdb /usr/bin/python
GNU gdb (GDB) 7.5-ubuntu
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /usr/bin/python...Reading symbols from /usr/lib/debug/usr/bin/python2.7...done.
done.
(gdb) set detach-on-fork off
(gdb) set follow-fork-mode child
(gdb) set follow-exec-mode new
(gdb) catch load /libXi/
Catchpoint 1 (load)
(gdb) break XIQueryVersion
Function "XIQueryVersion" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 2 (XIQueryVersion) pending.
(gdb) run /usr/bin/synaptiks
Starting program: /usr/bin/python /usr/bin/synaptiks
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New process 3788]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Thread 0x7ffff6ccc700 (LWP 3788) is executing new program: /bin/dash
[New process 3789]
process 3789 is executing new program: /bin/dash
process 3789 is executing new program: /sbin/ldconfig.real

…and there it hangs, forever. We never even get to the point where the Python interpreter loads libXi.so, let alone calls the function. :-(

Any ideas?

Upgraded to Twitter Tools 3.0 and Social plugin

The latest Twitter Tools upgrade (v3.0) now has a dependency on the Social plugin from MailChimp to take advantage of the open source “don’t reinvent the wheel” philosophy.

Having now installed Social and upgraded Twitter Tools you should be able to now login with your Twitter account (should you so wish) to leave comments. It also claims comments get tweeted too, but no idea how that works yet so I’ll use this post as a test.. :-)

Google Disaster Recovery Paper in ACM

Via Tim Freeman (@peakscale) on Twitter, this very interesting paper on how Google handles disaster recovery planning and testing. Best quote so far:

When the engineers realized that the shortcuts had failed and that no one could get any work done, they all simultaneously decided it was a good time to get dinner, and we ended up DoS’ing our cafes.

They explicitly prevent “critical personnel, area experts, and leaders from participating”, and are prepared to take downtime (and revenue loss) as part of it. They also exposed some interesting issues that wouldn’t have come to light anyway (as these things inevitably will do):

In the same scenario, we tested the use of a documented emergency communications plan. The first DiRT exercise revealed that exactly one person was able to find the plan and show up on the correct phone bridge at the time of the exercise. During the following drill, more than 100 people were able to find it. This is when we learned the bridge wouldn’t hold more than 40 callers. During another call, one of the callers put the bridge on hold. While the hold music was excellent for the soul, we quickly learned we needed ways to boot people from the bridge.

There was also the time they were running low on diesel fuel for a generator and didn’t know how to find the emergency spending procedure, so someone volunteered to put a 6 figure sum on their personal credit card. Probably would do wonders for any air miles they were accruing that way!

On a more whimsical note, there was one comment in the article that attracted my attention, saying:

most operations teams were already continuously testing their systems and cross-training using formats based on popular role-playing games.

gives pause for thought, if it was Call of Cthulhu I could imagine:

I’m sorry, but your data centre has just been eaten by Shub-Niggurath and your staff have all run away or been consumed by her 1,000 young. Take 5 D6 SAN loss and roll on the permanent insanity table.

Though perhaps Paranoia would have been a more appropriate choice, plenty of troubleshooters needed there I suspect..

Bandwidth of Tape

As part of the ongoing Stage 2 upgrade at VLSCI we received an extra 1,000 LTO5 tapes, each rated at 1.5TB uncompressed, for an additional raw, uncompressed, storage of 1.5PB. Now we typically get around 2X compression so that’s about 3PB usable. It took our team about 3 hours to uncrate, unpack, transfer and load the 1,000 tapes, effectively shifting 1.5PB of tape in an hour. That’s about 139GB/s or if you are a network person 1.1Tb/s. Not bad! :-)