ZFS Bug From Solaris Found in Linux FUSE Version and Fixed

Those who know me in my day job know that I’m pretty good at breaking things, so I guess I shouldn’t be surprised I found a ZFS bug that was from the OpenSolaris code base and had been sitting there for about a year unnoticed. The ZFS on Linux developer has now fixed the bug and sent a patch back upstream, so hopefully there will be a fix in OpenSolaris because of work done on Linux!

The good thing is that because I found it on Linux running ZFS using FUSE the bug didn’t take my system down when the ZFS file system daemon died. 🙂 http://www.csamuel.org/2007/06/19/zfsfuse-makes-it-to-linuxworld-and-lwn/

Must Remember for Future ZFS on Linux Testing..

Linus added support for block device based filesystems into 2.6.20, so it’ll be interesting to see what (if any) effect on ZFS/FUSE it will have, especially given it’s named in the commit. 🙂

I never intended this, but people started using fuse to implement block device based “real” filesystems (ntfs-3g, zfs).

Looks like Ubuntu’s Feisty Fawn will ship with this as the 2.6.20 kernels in the development versions have the fuseblk filesystem showing up in /proc/filesystems once you’ve loaded the fuse module, and the fuse-utils package seems to support it too.

Update: Sadly it appears that this isn’t much use for ZFS. 🙁

Usenix Paper on Hard Disk Failures

Another gem from the Beowulf mailing list, this time courtesy of Justin Moore, who is the Google employee on the list I referred to in an earlier post.

This one is a paper published at the 5th USENIX Conference on File and Storage Technologies looking at failure rates in a disk population of 100,000 drives – a similar scale to the Google paper but this time spread over various disk technologies including SATA, Fibre Channel and SCSI.

Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?

They estimate, from the data sheets, that the nominal annual failure rate should be 0.88% but in reality found it to be in excess of 1% with 2-4% being common and ranging all the way up to 13%. They also see something different to the infant mortality that Mark Hahn alluded to when commenting on the Google paper:

We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years.

Their conclusions give numbers to this, saying:

For drives less than five years old, field replacement rates were larger than what the datasheet MTTF suggested by a factor of 2-10. For five to eight year old drives, field replacement rates were a factor of 30 higher than what the datasheet MTTF suggested.

For those interested in the perceived higher reliability of SCSI/FC drives over their SATA breathren the paper has this to say:

In our data sets, the replacement rates of SATA disks are not worse than the replacement rates of SCSI or FC disks. This may indicate that disk-independent factors, such as operating conditions, usage and environmental factors, affect replacement rates more than component specific factors. However, the only evidence we have of a bad batch of disks was found in a collection of SATA disks experiencing high media error rates. We have too little data on bad batches to estimate the relative frequency of bad batches by type of disk, although there is plenty of anecdotal evidence that bad batches are not unique to SATA disks.

Usenix have published the full paper text online in either HTML form or as a PDF document, so take your pick and start reading!

Know Your Rights – Satellites Crashing Onto Your Property

After a bit of stochastic web-enabled research (( i.e. random searching looking for the conclusion of this case triggered by catching up on a story of Rich’s. )) I found this little piece of information from the UNSW Law Journal that everyone should bookmark away just in case they need it..

There’s A Satellite In My Backyard! – Mir And The Convention On International Liability For Damage Caused By Space Objects.

But what is the legal position in relation to damage caused by the return to Earth of a space object such as Mir? Are there any rules in place to cover such an eventuality? Under what circumstances would Russia have been responsible at international law for any such damage? What would be the extent of its liability? How is damage to be measured and what procedures (if any) are in place to facilitate compensation claims and to arrive at a determination of responsibility and its consequences? Once a determination is made, is it a legally binding and enforceable decision?

Just remember where you read it when you need it.. 😎

Google Paper on Hard Disk Failures (Updated)

Eugen Leitl posted an interesting paper from Google to the Beowulf list, Failure Trends in a Large Disk Drive Population (PDF), where “large” is in excess of 100,000 drives. The paper abstract says:

Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.

Some of the Beowulfers have come up with constructive criticism of the paper, including interesting comment from rgb:

How did they look for predictive models on the SMART data? It sounds like they did a fairly linear data decomposition, looking for first order correlations. Did they try to e.g. build a neural network on it, or use fully multivariate methods (ordinary stats can handle it up to 5-10 variables).

and from Mark Hahn:

funny, when I saw figure5, I thought the temperature effect was pretty dramatic. in fact, all the metrics paint a pretty clear picture of infant mortality, then reasonably fit drives suriving their expected operational life (3 years). in senescence, all forms of stress correlate with increased failure. I have to believe that the 4/5th year decreases in AFR are either due to survival effects or sampling bias.

It will be interesting to see if they take notice of this open source peer review as there is at least one person from Google on the list.

Update: There is also a Usenix paper on hard disk failures that looks at different hard disc types.

More Scarey Australian Copyright Braindeadness..

The Association for Progressive Communications has a really interesting summary of the possible implications of new copyright legislation in Australia. They have a set of PDF’s there that give a “risk matrix” for teens, families, small businesses and industry.

If you’ve ever wondered how a bunch of kids singing in a restaurant can turn into a criminal offence under copyright law then this is for you (especially if you own an iPod). Read ’em and weep..


Chillies are old news..

..about 6,000 year old news in fact!

Archaeologists in Ecuador have found evidence that chillies were used in cooking more than 6,000 years ago. […] The team of scientists who made the discovery in a tropical lowland area say the spice must have been transported over the Andes to what is now Ecuador as the chillies only grew naturally to the east of the mountain range.

The BBC also has a nice chilli recipe site which includes recipes for chocolate and chilli ice cream and chocolate chilli crème brûlée!