This is the third in a series of tests (( the previous ones are ZFS on Linux Works! and ZFS versus XFS with Bonnie++ patched to use random data )), but this time we’re going to test out how it handles multiple drives natively, rather than running over an existing software RAID+LVM setup. ZFS has the ability to dynamically add disks to a pool for striping (the default) mirroring or RAID-Z (with single or double parity) which are designed to improve speed (with striping), reliability (with mirroring) and performance and reliability (with RAID-Z).
I can’t use the same hardware as before for this testing, but I do happen to have an old (10+ years) Olivetti Netstrada with 4 200MHz Intel Pentium Pro processors, 256MB of RAM and 5 4GB SCSI drives. This means it’s a lot lot slower than the previous 2.6GHz P4 system and 1GB RAM with dual SATA’s so the overall times for the runs are not comparable at all – if for no other reason than Bonnie++ by default works with file sizes that are twice your RAM size to eliminate as much as possible of the effect of the OS caches.
So here is a result from this system for XFS on a single drive for comparison later on. No LVM here as the box is set up purely for testing ZFS.
Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP netstrada 496M 10067 33 4901 17 9505 14 129.7 3 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 273 16 +++++ +++ 253 12 268 16 +++++ +++ 135 7 netstrada,496M,,,10067,33,4901,17,,,9505,14,129.7,3,16,273,16,+++++,+++,253,12,268,16,+++++,+++,135,7 real 9m43.959s user 0m1.650s sys 1m20.160s
Note this is not my patched version of Bonnie++ that uses random data rather than 0’s for file content, but as I’m not going to test compression here it is unlikely to make much difference.
Now we want to set up ZFS. There are 4 drives completely free on the system (sdb, sdc, sdd, sde) so we’ll just use the bare drives, no need for partition tables now.
zpool create test /dev/sdb
Now we’ll create a file system that we’re going to work in. This is all the same as before because we’re not doing anything special. Yet.
zfs create test/volume1
Here’s a test result from just that drive.
Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP netstrada 496M 3260 2 1778 2 6402 3 34.5 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 323 4 1067 7 301 3 329 4 1549 9 313 3 netstrada,496M,,,3260,2,1778,2,,,6402,3,34.5,0,16,323,4,1067,7,301,3,329,4,1549,9,313,3 real 16m37.150s user 0m2.170s sys 0m23.330s
So not quite half the speed of XFS, but close. So, lets see what happens if I add another drive as a stripe.
Striping
Striping is the simplest way of adding a drive to a ZFS pool, all it takes is just:
zfs add test /dev/sdc
We can check the second drive has been added by doing zpool status
, which says:
pool: test state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /dev/sdb ONLINE 0 0 0 /dev/sdc ONLINE 0 0 0 errors: No known data errors
OK – now what has that done for performance ?
Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP netstrada 496M 3175 2 1705 2 6104 3 34.9 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 325 4 1069 6 315 3 331 5 1557 6 345 3 netstrada,496M,,,3175,2,1705,2,,,6104,3,34.9,0,16,325,4,1069,6,315,3,331,5,1557,6,345,3 real 16m46.772s user 0m2.400s sys 0m23.100s
Nothing at all really – so how about with all 4 drives striped in the pool ?
zpool add test /dev/sdd
zpool add test /dev/sde
Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP netstrada 496M 3136 2 1704 2 5572 3 38.3 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 331 4 1055 6 330 3 333 4 1576 12 348 3 netstrada,496M,,,3136,2,1704,2,,,5572,3,38.3,0,16,331,4,1055,6,330,3,333,4,1576,12,348,3 real 16m32.234s user 0m1.990s sys 0m23.720s
Still nothing – very odd indeed, but this may be one of the areas that work still has to be done.
Blow it away, start again
At the moment ZFS (on Solaris or Linux) only supports removing drives that are marked as being hot spares, so we’ll need to destroy this pool and start again. Once more it’s pretty easy to do (warning, no safety nets here, if you type the commands then your data will go away, pronto). First we need to remove any volumes in the pool.
zfs destroy -r test
Then we can destroy the pool itself.
zpool destroy test
Now we will start again at the same point as before, with just a single drive.
zfs add test /dev/sdb
zfs create test/volume1
Mirroring
To convert a single drive ZFS pool into a mirror we cannot use the zpool add
command, we have to use zfs attach
instead (see the manual page for more information).
zpool attach test /dev/sdb /dev/sdc
If we look at what zpool status
says we see:
pool: test state: ONLINE scrub: resilver completed with 0 errors on Mon Jan 1 22:10:19 2007 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 /dev/sdb ONLINE 0 0 0 /dev/sdc ONLINE 0 0 0 errors: No known data errors
So that confirms that we now have a mirror for testing – dead easy (( We could have set up the mirror from scratch with zpool create test mirror /dev/sdb /dev/sdc
but I wanted to show it was possible to start from just one drive )) ! So does it help with performance ?
Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP netstrada 496M 3069 2 1484 2 5634 3 31.5 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 331 4 1087 7 338 4 329 4 1638 8 342 3 netstrada,496M,,,3069,2,1484,2,,,5634,3,31.5,0,16,331,4,1087,7,338,4,329,4,1638,8,342,3 real 18m3.939s user 0m2.130s sys 0m23.560s
Er, no would appear to be the definitive answer. OK, so what about if we add the two remaining drives into the array and try again ?
zpool attach test /dev/sdb /dev/sdd
zpool attach test /dev/sdb /dev/sde
Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP netstrada 496M 2475 1 1332 1 5638 3 29.2 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 324 4 1041 7 296 3 324 4 1424 8 307 3 netstrada,496M,,,2475,1,1332,1,,,5638,3,29.2,0,16,324,4,1041,7,296,3,324,4,1424,8,307,3 real 19m59.974s user 0m2.500s sys 0m24.570s
So it appears that ZFS mirroring doesn’t impart any performance benefit, but is going to be very reliable.
RAID-Z
To test RAID-Z I’ll destroy the existing pool and then create a new RAID-Z pool using all 4 drives (( It’s late and I’m back at work tomorrow! )).
zfs destroy -r test
zpool destroy test
zpool create test raidz /dev/sdb /dev/sdc /dev/sdd /dev/sde
zfs create test/volume1
This is reported by zfs status
as:
pool: test state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 raidz1 ONLINE 0 0 0 /dev/sdb ONLINE 0 0 0 /dev/sdc ONLINE 0 0 0 /dev/sdd ONLINE 0 0 0 /dev/sde ONLINE 0 0 0 errors: No known data errors
OK – now lets see how that performs:
Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP netstrada 496M 3148 2 1475 2 5353 2 19.0 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 323 4 1064 6 319 3 322 4 1407 6 316 3 netstrada,496M,,,3148,2,1475,2,,,5353,2,19.0,0,16,323,4,1064,6,319,3,322,4,1407,6,316,3 real 21m9.018s user 0m2.250s sys 0m23.130s
So slower again, but that’s going to be because it’s got to do parity calculations on top of its usual processing load.
Summary
The fact that you can easily make a RAID array or mirror and extend them from the command line is very nice, but the fact that adding drives to a striped system doesn’t seem to change the performance at all is a little odd. Now it could be that my test box is underpowered and that I was hitting its hardware limits, but the fact that XFS was much faster than it on a single drive seems to contradict that.
Looking at the results from zpool iostat -v
whilst the RAID-Z case was doing its re-write test it seemed that the I/O load is being nicely balanced between the drives (see below), but it never seemed to exceed a little over 2MB/s. I did peek occasionally at zpool iostat
(without the -v) and even with all 4 drives striped it didn’t exceed that barrier. It may be there is a bottleneck in the code further up that needs to be fixed first and after that the drives will hit their proper performance.
capacity operations bandwidth pool used avail read write read write ------------ ----- ----- ----- ----- ----- ----- test 663M 15.0G 12 26 1.58M 1.85M raidz1 663M 15.0G 12 26 1.58M 1.85M /dev/sdb - - 9 25 403K 640K /dev/sdc - - 9 25 404K 640K /dev/sdd - - 9 26 403K 640K /dev/sde - - 9 22 405K 637K ------------ ----- ----- ----- ----- ----- -----
Anyway, this is just an alpha release, so there’s much more to come!
Update: my guess is that it is this issue from the STATUS file about what is not yet implemented:
Multi-threaded event loop (relatively easy to implement, should come in the next alpha or so).
Well, I just can’t wait – ZFS go mainstream, NOW!!! 🙂
… I wonder when it will be included in a standard GNU/Linux distros, and when it will become a standard Linux filesystem… I am convinced that both things will happen eventually, just when…
… and in a case of question: why the hell should ZFS become the default fs? Dead easy: because of the security of the data. Even if you have a bad luck, it is just a broken harddrive. The point here is that even the casual user really does want to minimize the risk of having his/her mailbox/system/pics^H^H^H^H^H destroyed…
Anyway, when it comes to security as in “no one else having access”, then unfortunately, even on Solaris the encryption isn’t supported yet ( http://www.opensolaris.org/os/project/zfs-crypto/ ). When, oh when will it be available on GNU/Linux…
I doubt if it will be included as standard in many distro’s, Debian considers the CDDL to be non-free under the Debian Free Software Guidelines, Redhat don’t like anything except ext2 and ext3. Ubuntu, Fedora and SuSE may package it up, but as it’s a user-mode filesystem it’s unlikely to get offered by default.
To get in the kernel it either means a complete cleanroom rewrite (i.e. without reference to the original code to avoid possible contamination issues) or for Sun to dual-license OpenSolaris under GPLv2 (or BSD) as well as CDDL.
So, I think it will be something like the current situation with libdvdcss – everyone uses it, but it is not included anywhere :). It is a pity, because that way it isn’t going to be well known among casual users… and it really should.
On the other hand, the benefits are so obvious, that I hope someone, somewhere, will write a clean-room reimplementation of ZFS for the Linux kernel… or maybe both Sun and Linus will go into GPLv3 with OpenSolaris and Linux (and the hell will freeze over it 🙂 ).
Oh well… hope… anyway – the things I am talking about needs years, or at least months to become reality. Meanwhile, everyone I stumble upon had never heard of ZFS. So BIG THANKS to you for being so actively interested in this. More evangelists are still needed, however… Steve Jobs? 😉
I believe ZFS is due to be included in the next version of MacOSX..
I’ve been referring exactly to this speculation. Not owning a Mac, I still think this is (or rather will be) good news for all of us, because it will make ZFS more popular among casual users.
It’s more than just speculation, there’s a blog post from someone who’s driven it from the shell.. 🙂
Ricardo has now committed to the trunk a patch that disables -DDEBUG by default now as that was causing ZFS to do extra checksuming for bug finding, resulting in a much higher computational load.
Now this is something, in case you don’t know this yet:
http://www.eweek.com/article2/0,1895,2084284,00.asp
One step closer to the native kernel-based implementation, isn’t it? Now let someone convince Linus to re-release Linux as GPLv3 and then… 😉
You’ll need to convince *all* the contributors who have specified GPLv2 only (and hope that those who have sadly passed away didn’t choose that wording of the license for their work).. I suspect that could be quite some job!
Linus for sure should be the first one convinced… (anyway, that post of mine above contained some irony 🙂 ). I really wonder whether there will be some GPLv3 fork off Linux… it is in the wind… or maybe everyone will go away from Linux and into Solaris (a Sun’s wet dream).
Anyway, for the serious part, Mozilla did it (changed their license, asking all the developers for permissions, http://www.mozilla.org/MPL/relicensing-faq.html ).
The Linus’ attitude “Linux is just for fun” deeply disturbs me. It is not fun for everyone. Some other people depend on these “toys”. Also, as soon as the “OpenSolaris on GPLv3” thing becomes official, it will be a BIG problem in the Linux world, Solaris contains too much advanced, tempting technologies to just ignore them (ZFS and DTrace are very good examples of that).
I wouldn’t rule out Linux under GPLv3 completely, but I suspect it will be quite a while away, if it ever happens. For a start we still don’t have a GPLv3. 🙂
As for “just for fun”, well it was his baby so he can do what he pleases, the best bit is that even if he gets eaten by a killer penguin at Canberra zoo (again) others will be able to carry on with it, it no longer relies entirely on him (though it would not be a painful transition!). Where I work we run large clusters for scientific HPC and they all depend on Linux. 😉
I dunno about Solaris being “more advanced”, the kernel has some interesting bits (ZFS mainly) but it’s still a System V derivative and will have a heap of legacy code in it, not to mention kernel interfaces they can’t break easily, whereas Linux doesn’t have those sorts of constraints and they actively don’t want stable kernel interfaces so they can keep improving it. Of course where the kernel gets exposed to user space is a different matter…
I spent a long time sysadmining lots of commercial Unix variants (including SunOS and Solaris) and always one of the first things I did with them was install bash and various GNU utilities to make life bearable!
However, now Nexenta exists with a GNU userspace, that makes it much more interesting to use, it’s just a shame that the Solaris kernel is too bloated to boot on my test machine. 🙁
I’m loving the enthusiasm for ZFS, I have discovered it recently and am feeling all those nasty multi-volume hassles melt away… But I’m not sure that your test results are really valid, and here’s why. Please correct me if I’m wrong.
The slowest part of most disk IO is the actual physical disk and the hard disk head, which moves in concentric circles around the spindle of the disk. The speed that the disk rotates, and the speed that the head seeks to a new location is important here.
When you use a striped arrangement, as in ZFS’s raidz1 – you are taking advantage of the fact that several disk heads are reading and writing your data for you. So the computer can divide up a read request between several disks, and get rid of the physical bottleneck by getting different heads to read (or write) different slices.
I notice that you are using sdb, sdc, sdd, and see – which are partitions (or slices) of the same physical disk. This means that striping won’t work – since there’s only one physical disk head involved in reading this data. All you’re really doing with ZFS here is combining several partitions into one easily accessible pool. There should be no performance benefit, and in fact it should be slower, since the same data has to be written to several parts of the same disk by the one head.
SO – if you want faster ZFS and striping, use several physical disks. In fact, if you want a more reliable/resilient setup, don’t mirror several partitions from the same disk together either – since the one disk is likely to stop working thereby killing all your mirrors at once. Instead use several actual disks, ideally several physical controllers too.
Or have I missed something here?
—
Shameless plug: Free Travel Advice and Blog at BallOfDirt.com
Hi Joe,
Your message got labelled as spam by Akismet and fortunately I just spotted it now as I was about to kill off everything it had found!
Anyway, your comment on the sdb, sdc, sdd, and sde being slices of the same physical disk isn’t true for Linux, ‘sd’ refers to a SCSI disk device and then each drive is allocated a letter after that. Then if I was partitioning the drives I would get sdb1, sdb2, etc to refer to each partition.
So just sdb, sdc refers to the entire (unpartitioned) drive.
Hope that helps!
Ah cheers 🙂 Have you tried the latest ZFS (beta 1)? It seems to work pretty well, I’d be interested in performance comparisons (even if you just ran one striped test).
I’ve been trying to work out how to get ZFS going on my Mac. The next OS ( 10.5 “Leopard”) will have it, but they’re delaying that, and I have about 2TB of disks lying around in random configurations/partitions. If only I can get a virtual machine to access the disks directly, I could then export the shares via NFS and use them from my main OS.
I’ve documented some ZFS tests of post-beta code from the repository on the Feisty betas, but that wasn’t on the test system I’ve been talking about here. I have been meaning to get around to it, but not quite got there yet. 😉
Time to turn that box back on and dist-upgrade to the new release!
As for the Mac, I’ve no idea how you could go about that I’m afraid..
OK, I’ve now done some updated figures using striping and RAIDZ with the latest ZFS/FUSE under Ubuntu Feisty Fawn (7.04) and seen some useful improvements in ZFS/FUSE.
what was cpu usage during these tests ? i can almost guarantee that the ZFS system was pegged at 100% and was in fact your bottleneck.
ZFS performance comes at the very drastic expense of CPU power. The nice trade off is that it appears to scale well above other file systems can do, using massive amounts of CPU, but delivering unheard of performance.
If you have the CPU usage metrics, please post back.
-=dave
Hi Dave, thanks for the input – whilst ZFS does eat CPU that doesn’t appear to be the single limiting factor at the moment. Watching “top” on this 4 CPU test rig it appears the zfs-fuse process runs at anywhere between 50%-120% with a lot of idle time about, so it’s just not scaling out to that number of CPUs yet.
As Riccardo has mentioned elsewhere he’s still got a lot of work to do on performance under FUSE, and my guess is that this is one of them! 🙂
Thanks for the info. Out of curiosity, have you had the opportunity to try the same test with additional RAM? If available CPU is ample and RAM changes do not affect your performance in this test, it surely points to ZFS/FUSE shortcomings. As your anecdotal results seem to go against the grain of what others are boasting, it would be nice to know definitevly what the primary factor contributing to the lackluster performance is. But then, time will surely tell. Thanks again for posting the info.
Hi Dave, glad to post the info, it’s all useful stuff!
Sadly I suspect this 10 year old test box tops out at 256MB, and I believe there is still RAM free whilst it is running. Bonnie is a fairly severe test case, it tries to ensure that the system can’t get clever about using RAM to cache, so what other people are testing may not be purely the filesystem (and disk) performance.