ZFS Disk Mirroring, Striping and RAID-Z

This is the third in a series of tests (( the previous ones are ZFS on Linux Works! and ZFS versus XFS with Bonnie++ patched to use random data )), but this time we’re going to test out how it handles multiple drives natively, rather than running over an existing software RAID+LVM setup. ZFS has the ability to dynamically add disks to a pool for striping (the default) mirroring or RAID-Z (with single or double parity) which are designed to improve speed (with striping), reliability (with mirroring) and performance and reliability (with RAID-Z).

I can’t use the same hardware as before for this testing, but I do happen to have an old (10+ years) Olivetti Netstrada with 4 200MHz Intel Pentium Pro processors, 256MB of RAM and 5 4GB SCSI drives. This means it’s a lot lot slower than the previous 2.6GHz P4 system and 1GB RAM with dual SATA’s so the overall times for the runs are not comparable at all – if for no other reason than Bonnie++ by default works with file sizes that are twice your RAM size to eliminate as much as possible of the effect of the OS caches.

So here is a result from this system for XFS on a single drive for comparison later on. No LVM here as the box is set up purely for testing ZFS.

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
netstrada      496M           10067  33  4901  17            9505  14 129.7   3
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   273  16 +++++ +++   253  12   268  16 +++++ +++   135   7
netstrada,496M,,,10067,33,4901,17,,,9505,14,129.7,3,16,273,16,+++++,+++,253,12,268,16,+++++,+++,135,7

real    9m43.959s
user    0m1.650s
sys     1m20.160s

Note this is not my patched version of Bonnie++ that uses random data rather than 0’s for file content, but as I’m not going to test compression here it is unlikely to make much difference.

Now we want to set up ZFS. There are 4 drives completely free on the system (sdb, sdc, sdd, sde) so we’ll just use the bare drives, no need for partition tables now.

zpool create test /dev/sdb

Now we’ll create a file system that we’re going to work in. This is all the same as before because we’re not doing anything special. Yet.

zfs create test/volume1

Here’s a test result from just that drive.

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
netstrada      496M            3260   2  1778   2            6402   3  34.5   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   323   4  1067   7   301   3   329   4  1549   9   313   3
netstrada,496M,,,3260,2,1778,2,,,6402,3,34.5,0,16,323,4,1067,7,301,3,329,4,1549,9,313,3

real    16m37.150s
user    0m2.170s
sys     0m23.330s

So not quite half the speed of XFS, but close. So, lets see what happens if I add another drive as a stripe.

Striping

Striping is the simplest way of adding a drive to a ZFS pool, all it takes is just:

zfs add test /dev/sdc

We can check the second drive has been added by doing zpool status, which says:

  pool: test
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          /dev/sdb  ONLINE       0     0     0
          /dev/sdc  ONLINE       0     0     0

errors: No known data errors

OK – now what has that done for performance ?

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
netstrada      496M            3175   2  1705   2            6104   3  34.9   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   325   4  1069   6   315   3   331   5  1557   6   345   3
netstrada,496M,,,3175,2,1705,2,,,6104,3,34.9,0,16,325,4,1069,6,315,3,331,5,1557,6,345,3

real    16m46.772s
user    0m2.400s
sys     0m23.100s

Nothing at all really – so how about with all 4 drives striped in the pool ?

zpool add test /dev/sdd
zpool add test /dev/sde

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
netstrada      496M            3136   2  1704   2            5572   3  38.3   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   331   4  1055   6   330   3   333   4  1576  12   348   3
netstrada,496M,,,3136,2,1704,2,,,5572,3,38.3,0,16,331,4,1055,6,330,3,333,4,1576,12,348,3

real    16m32.234s
user    0m1.990s
sys     0m23.720s

Still nothing – very odd indeed, but this may be one of the areas that work still has to be done.

Blow it away, start again

At the moment ZFS (on Solaris or Linux) only supports removing drives that are marked as being hot spares, so we’ll need to destroy this pool and start again. Once more it’s pretty easy to do (warning, no safety nets here, if you type the commands then your data will go away, pronto). First we need to remove any volumes in the pool.

zfs destroy -r test

Then we can destroy the pool itself.

zpool destroy test

Now we will start again at the same point as before, with just a single drive.

zfs add test /dev/sdb
zfs create test/volume1

Mirroring

To convert a single drive ZFS pool into a mirror we cannot use the zpool add command, we have to use zfs attach instead (see the manual page for more information).

zpool attach test /dev/sdb /dev/sdc

If we look at what zpool status says we see:

  pool: test
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Jan  1 22:10:19 2007
config:

        NAME          STATE     READ WRITE CKSUM
        test          ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            /dev/sdb  ONLINE       0     0     0
            /dev/sdc  ONLINE       0     0     0

errors: No known data errors

So that confirms that we now have a mirror for testing – dead easy (( We could have set up the mirror from scratch with zpool create test mirror /dev/sdb /dev/sdc but I wanted to show it was possible to start from just one drive )) ! So does it help with performance ?

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
netstrada      496M            3069   2  1484   2            5634   3  31.5   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   331   4  1087   7   338   4   329   4  1638   8   342   3
netstrada,496M,,,3069,2,1484,2,,,5634,3,31.5,0,16,331,4,1087,7,338,4,329,4,1638,8,342,3

real    18m3.939s
user    0m2.130s
sys     0m23.560s

Er, no would appear to be the definitive answer. OK, so what about if we add the two remaining drives into the array and try again ?

zpool attach test /dev/sdb /dev/sdd
zpool attach test /dev/sdb /dev/sde

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
netstrada      496M            2475   1  1332   1            5638   3  29.2   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   324   4  1041   7   296   3   324   4  1424   8   307   3
netstrada,496M,,,2475,1,1332,1,,,5638,3,29.2,0,16,324,4,1041,7,296,3,324,4,1424,8,307,3

real    19m59.974s
user    0m2.500s
sys     0m24.570s

So it appears that ZFS mirroring doesn’t impart any performance benefit, but is going to be very reliable.

RAID-Z

To test RAID-Z I’ll destroy the existing pool and then create a new RAID-Z pool using all 4 drives (( It’s late and I’m back at work tomorrow! )).

zfs destroy -r test
zpool destroy test
zpool create test raidz /dev/sdb /dev/sdc /dev/sdd /dev/sde
zfs create test/volume1

This is reported by zfs status as:

  pool: test
 state: ONLINE
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        test          ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            /dev/sdb  ONLINE       0     0     0
            /dev/sdc  ONLINE       0     0     0
            /dev/sdd  ONLINE       0     0     0
            /dev/sde  ONLINE       0     0     0

errors: No known data errors

OK – now lets see how that performs:

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
netstrada      496M            3148   2  1475   2            5353   2  19.0   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   323   4  1064   6   319   3   322   4  1407   6   316   3
netstrada,496M,,,3148,2,1475,2,,,5353,2,19.0,0,16,323,4,1064,6,319,3,322,4,1407,6,316,3

real    21m9.018s
user    0m2.250s
sys     0m23.130s

So slower again, but that’s going to be because it’s got to do parity calculations on top of its usual processing load.

Summary

The fact that you can easily make a RAID array or mirror and extend them from the command line is very nice, but the fact that adding drives to a striped system doesn’t seem to change the performance at all is a little odd. Now it could be that my test box is underpowered and that I was hitting its hardware limits, but the fact that XFS was much faster than it on a single drive seems to contradict that.

Looking at the results from zpool iostat -v whilst the RAID-Z case was doing its re-write test it seemed that the I/O load is being nicely balanced between the drives (see below), but it never seemed to exceed a little over 2MB/s. I did peek occasionally at zpool iostat (without the -v) and even with all 4 drives striped it didn’t exceed that barrier. It may be there is a bottleneck in the code further up that needs to be fixed first and after that the drives will hit their proper performance.

                 capacity     operations    bandwidth
pool           used  avail   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
test           663M  15.0G     12     26  1.58M  1.85M
  raidz1       663M  15.0G     12     26  1.58M  1.85M
    /dev/sdb      -      -      9     25   403K   640K
    /dev/sdc      -      -      9     25   404K   640K
    /dev/sdd      -      -      9     26   403K   640K
    /dev/sde      -      -      9     22   405K   637K
------------  -----  -----  -----  -----  -----  -----

Anyway, this is just an alpha release, so there’s much more to come!

Update: my guess is that it is this issue from the STATUS file about what is not yet implemented:

Multi-threaded event loop (relatively easy to implement, should come in the next alpha or so).

20 thoughts on “ZFS Disk Mirroring, Striping and RAID-Z

  1. Well, I just can’t wait – ZFS go mainstream, NOW!!! 🙂

    … I wonder when it will be included in a standard GNU/Linux distros, and when it will become a standard Linux filesystem… I am convinced that both things will happen eventually, just when…

    … and in a case of question: why the hell should ZFS become the default fs? Dead easy: because of the security of the data. Even if you have a bad luck, it is just a broken harddrive. The point here is that even the casual user really does want to minimize the risk of having his/her mailbox/system/pics^H^H^H^H^H destroyed…

    Anyway, when it comes to security as in “no one else having access”, then unfortunately, even on Solaris the encryption isn’t supported yet ( http://www.opensolaris.org/os/project/zfs-crypto/ ). When, oh when will it be available on GNU/Linux…

  2. I doubt if it will be included as standard in many distro’s, Debian considers the CDDL to be non-free under the Debian Free Software Guidelines, Redhat don’t like anything except ext2 and ext3. Ubuntu, Fedora and SuSE may package it up, but as it’s a user-mode filesystem it’s unlikely to get offered by default.

    To get in the kernel it either means a complete cleanroom rewrite (i.e. without reference to the original code to avoid possible contamination issues) or for Sun to dual-license OpenSolaris under GPLv2 (or BSD) as well as CDDL.

  3. So, I think it will be something like the current situation with libdvdcss – everyone uses it, but it is not included anywhere :). It is a pity, because that way it isn’t going to be well known among casual users… and it really should.

    On the other hand, the benefits are so obvious, that I hope someone, somewhere, will write a clean-room reimplementation of ZFS for the Linux kernel… or maybe both Sun and Linus will go into GPLv3 with OpenSolaris and Linux (and the hell will freeze over it 🙂 ).

    Oh well… hope… anyway – the things I am talking about needs years, or at least months to become reality. Meanwhile, everyone I stumble upon had never heard of ZFS. So BIG THANKS to you for being so actively interested in this. More evangelists are still needed, however… Steve Jobs? 😉

  4. I’ve been referring exactly to this speculation. Not owning a Mac, I still think this is (or rather will be) good news for all of us, because it will make ZFS more popular among casual users.

  5. Ricardo has now committed to the trunk a patch that disables -DDEBUG by default now as that was causing ZFS to do extra checksuming for bug finding, resulting in a much higher computational load.

  6. You’ll need to convince *all* the contributors who have specified GPLv2 only (and hope that those who have sadly passed away didn’t choose that wording of the license for their work).. I suspect that could be quite some job!

  7. Linus for sure should be the first one convinced… (anyway, that post of mine above contained some irony 🙂 ). I really wonder whether there will be some GPLv3 fork off Linux… it is in the wind… or maybe everyone will go away from Linux and into Solaris (a Sun’s wet dream).

    Anyway, for the serious part, Mozilla did it (changed their license, asking all the developers for permissions, http://www.mozilla.org/MPL/relicensing-faq.html ).

    The Linus’ attitude “Linux is just for fun” deeply disturbs me. It is not fun for everyone. Some other people depend on these “toys”. Also, as soon as the “OpenSolaris on GPLv3” thing becomes official, it will be a BIG problem in the Linux world, Solaris contains too much advanced, tempting technologies to just ignore them (ZFS and DTrace are very good examples of that).

  8. I wouldn’t rule out Linux under GPLv3 completely, but I suspect it will be quite a while away, if it ever happens. For a start we still don’t have a GPLv3. 🙂

    As for “just for fun”, well it was his baby so he can do what he pleases, the best bit is that even if he gets eaten by a killer penguin at Canberra zoo (again) others will be able to carry on with it, it no longer relies entirely on him (though it would not be a painful transition!). Where I work we run large clusters for scientific HPC and they all depend on Linux. 😉

    I dunno about Solaris being “more advanced”, the kernel has some interesting bits (ZFS mainly) but it’s still a System V derivative and will have a heap of legacy code in it, not to mention kernel interfaces they can’t break easily, whereas Linux doesn’t have those sorts of constraints and they actively don’t want stable kernel interfaces so they can keep improving it. Of course where the kernel gets exposed to user space is a different matter…

    I spent a long time sysadmining lots of commercial Unix variants (including SunOS and Solaris) and always one of the first things I did with them was install bash and various GNU utilities to make life bearable!

    However, now Nexenta exists with a GNU userspace, that makes it much more interesting to use, it’s just a shame that the Solaris kernel is too bloated to boot on my test machine. 🙁

  9. I’m loving the enthusiasm for ZFS, I have discovered it recently and am feeling all those nasty multi-volume hassles melt away… But I’m not sure that your test results are really valid, and here’s why. Please correct me if I’m wrong.

    The slowest part of most disk IO is the actual physical disk and the hard disk head, which moves in concentric circles around the spindle of the disk. The speed that the disk rotates, and the speed that the head seeks to a new location is important here.

    When you use a striped arrangement, as in ZFS’s raidz1 – you are taking advantage of the fact that several disk heads are reading and writing your data for you. So the computer can divide up a read request between several disks, and get rid of the physical bottleneck by getting different heads to read (or write) different slices.

    I notice that you are using sdb, sdc, sdd, and see – which are partitions (or slices) of the same physical disk. This means that striping won’t work – since there’s only one physical disk head involved in reading this data. All you’re really doing with ZFS here is combining several partitions into one easily accessible pool. There should be no performance benefit, and in fact it should be slower, since the same data has to be written to several parts of the same disk by the one head.

    SO – if you want faster ZFS and striping, use several physical disks. In fact, if you want a more reliable/resilient setup, don’t mirror several partitions from the same disk together either – since the one disk is likely to stop working thereby killing all your mirrors at once. Instead use several actual disks, ideally several physical controllers too.

    Or have I missed something here?


    Shameless plug: Free Travel Advice and Blog at BallOfDirt.com

  10. Hi Joe,

    Your message got labelled as spam by Akismet and fortunately I just spotted it now as I was about to kill off everything it had found!

    Anyway, your comment on the sdb, sdc, sdd, and sde being slices of the same physical disk isn’t true for Linux, ‘sd’ refers to a SCSI disk device and then each drive is allocated a letter after that. Then if I was partitioning the drives I would get sdb1, sdb2, etc to refer to each partition.

    So just sdb, sdc refers to the entire (unpartitioned) drive.

    Hope that helps!

  11. Ah cheers 🙂 Have you tried the latest ZFS (beta 1)? It seems to work pretty well, I’d be interested in performance comparisons (even if you just ran one striped test).

    I’ve been trying to work out how to get ZFS going on my Mac. The next OS ( 10.5 “Leopard”) will have it, but they’re delaying that, and I have about 2TB of disks lying around in random configurations/partitions. If only I can get a virtual machine to access the disks directly, I could then export the shares via NFS and use them from my main OS.

  12. what was cpu usage during these tests ? i can almost guarantee that the ZFS system was pegged at 100% and was in fact your bottleneck.

    ZFS performance comes at the very drastic expense of CPU power. The nice trade off is that it appears to scale well above other file systems can do, using massive amounts of CPU, but delivering unheard of performance.

    If you have the CPU usage metrics, please post back.

    -=dave

  13. Hi Dave, thanks for the input – whilst ZFS does eat CPU that doesn’t appear to be the single limiting factor at the moment. Watching “top” on this 4 CPU test rig it appears the zfs-fuse process runs at anywhere between 50%-120% with a lot of idle time about, so it’s just not scaling out to that number of CPUs yet.

    As Riccardo has mentioned elsewhere he’s still got a lot of work to do on performance under FUSE, and my guess is that this is one of them! 🙂

  14. Thanks for the info. Out of curiosity, have you had the opportunity to try the same test with additional RAM? If available CPU is ample and RAM changes do not affect your performance in this test, it surely points to ZFS/FUSE shortcomings. As your anecdotal results seem to go against the grain of what others are boasting, it would be nice to know definitevly what the primary factor contributing to the lackluster performance is. But then, time will surely tell. Thanks again for posting the info.

  15. Hi Dave, glad to post the info, it’s all useful stuff!

    Sadly I suspect this 10 year old test box tops out at 256MB, and I believe there is still RAM free whilst it is running. Bonnie is a fairly severe test case, it tries to ensure that the system can’t get clever about using RAM to cache, so what other people are testing may not be purely the filesystem (and disk) performance.

Comments are closed.