Google Disaster Recovery Paper in ACM

Via Tim Freeman (@peakscale) on Twitter, this very interesting paper on how Google handles disaster recovery planning and testing. Best quote so far:

When the engineers realized that the shortcuts had failed and that no one could get any work done, they all simultaneously decided it was a good time to get dinner, and we ended up DoS’ing our cafes.

They explicitly prevent “critical personnel, area experts, and leaders from participating”, and are prepared to take downtime (and revenue loss) as part of it. They also exposed some interesting issues that wouldn’t have come to light anyway (as these things inevitably will do):

In the same scenario, we tested the use of a documented emergency communications plan. The first DiRT exercise revealed that exactly one person was able to find the plan and show up on the correct phone bridge at the time of the exercise. During the following drill, more than 100 people were able to find it. This is when we learned the bridge wouldn’t hold more than 40 callers. During another call, one of the callers put the bridge on hold. While the hold music was excellent for the soul, we quickly learned we needed ways to boot people from the bridge.

There was also the time they were running low on diesel fuel for a generator and didn’t know how to find the emergency spending procedure, so someone volunteered to put a 6 figure sum on their personal credit card. Probably would do wonders for any air miles they were accruing that way!

On a more whimsical note, there was one comment in the article that attracted my attention, saying:

most operations teams were already continuously testing their systems and cross-training using formats based on popular role-playing games.

gives pause for thought, if it was Call of Cthulhu I could imagine:

I’m sorry, but your data centre has just been eaten by Shub-Niggurath and your staff have all run away or been consumed by her 1,000 young. Take 5 D6 SAN loss and roll on the permanent insanity table.

Though perhaps Paranoia would have been a more appropriate choice, plenty of troubleshooters needed there I suspect..