SCO can’t count


In the latest court appearance in the US it’s reported on GrokLaw that SCO have claimed they have identified "300 million lines of Linux code" affected.



OK, so SCO say 300 million lines in the kernel are affected, now how much is that as a percentage ? That’s easy to find out, the kernel sources are publically accessible, so we can count the number of lines and do the math.



Consider the 2.6.2 kernel, the latest at time of writing. It’s easy to work out, we find every file ending in .c or .h (which will be all source code, though a sizeable chunk is likely to be comments), cat it and pass all that through the wordcount program “wc” with the -l option to say only report the number of lines.



So here we go:



$ find linux-2.6.2 -name ‘*.[ch]’ | xargs cat | wc -l
5298593


Oh dear, that looks very strange. Only 5 million lines of text in all the program files (and again, not all of that will be code). Perhaps they were a bit ambitious and counted the lines in all the files ?



$ find linux-2.6.2 -type f | xargs cat | wc -l
6008957


Oops, doesn’t look any better, just one million extra lines! So SCO are out by a factor of 50!



But, if you work it out, there have been 26 release of the 2.2 series kernel, 25 releases of the 2.4 series and 3 releases of the 2.6 kernel which gives us 53 releases total. Now, if we assume that they’re all roughly the same size (which they’re not, they’ve been slowly increasing over time) then that means that there’s probably a total of 300 million lines of code from 2.0.0 through to 2.6.2.



However, I’ve not included the development series of 2.3 and 2.5, so it could be they’re just claiming everything from 2.3 onwards. Or maybe not, given that there were 76 releases in the 2.5 series. Or maybe half the code from 2.0.0 to 2.6.0. The problem here is that nobody knows asides from SCO what they’re talking about, and they won’t tell anyone, even under a court order!