Eugen Leitl posted an interesting paper from Google to the Beowulf list, Failure Trends in a Large Disk Drive Population (PDF), where “large” is in excess of 100,000 drives. The paper abstract says:
Our analysis identiﬁes several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.
Some of the Beowulfers have come up with constructive criticism of the paper, including interesting comment from rgb:
How did they look for predictive models on the SMART data? It sounds like they did a fairly linear data decomposition, looking for first order correlations. Did they try to e.g. build a neural network on it, or use fully multivariate methods (ordinary stats can handle it up to 5-10 variables).
and from Mark Hahn:
funny, when I saw figure5, I thought the temperature effect was pretty dramatic. in fact, all the metrics paint a pretty clear picture of infant mortality, then reasonably fit drives suriving their expected operational life (3 years). in senescence, all forms of stress correlate with increased failure. I have to believe that the 4/5th year decreases in AFR are either due to survival effects or sampling bias.
It will be interesting to see if they take notice of this open source peer review as there is at least one person from Google on the list.
Update: There is also a Usenix paper on hard disk failures that looks at different hard disc types.