Usenix Paper on Hard Disk Failures

Another gem from the Beowulf mailing list, this time courtesy of Justin Moore, who is the Google employee on the list I referred to in an earlier post.

This one is a paper published at the 5th USENIX Conference on File and Storage Technologies looking at failure rates in a disk population of 100,000 drives – a similar scale to the Google paper but this time spread over various disk technologies including SATA, Fibre Channel and SCSI.

Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?

They estimate, from the data sheets, that the nominal annual failure rate should be 0.88% but in reality found it to be in excess of 1% with 2-4% being common and ranging all the way up to 13%. They also see something different to the infant mortality that Mark Hahn alluded to when commenting on the Google paper:

We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years.

Their conclusions give numbers to this, saying:

For drives less than five years old, field replacement rates were larger than what the datasheet MTTF suggested by a factor of 2-10. For five to eight year old drives, field replacement rates were a factor of 30 higher than what the datasheet MTTF suggested.

For those interested in the perceived higher reliability of SCSI/FC drives over their SATA breathren the paper has this to say:

In our data sets, the replacement rates of SATA disks are not worse than the replacement rates of SCSI or FC disks. This may indicate that disk-independent factors, such as operating conditions, usage and environmental factors, affect replacement rates more than component specific factors. However, the only evidence we have of a bad batch of disks was found in a collection of SATA disks experiencing high media error rates. We have too little data on bad batches to estimate the relative frequency of bad batches by type of disk, although there is plenty of anecdotal evidence that bad batches are not unique to SATA disks.

Usenix have published the full paper text online in either HTML form or as a PDF document, so take your pick and start reading!