ZFS Disk Mirroring, Striping and RAID-Z

This is the third in a series of tests (( the previous ones are ZFS on Linux Works! and ZFS versus XFS with Bonnie++ patched to use random data )), but this time we’re going to test out how it handles multiple drives natively, rather than running over an existing software RAID+LVM setup. ZFS has the ability to dynamically add disks to a pool for striping (the default) mirroring or RAID-Z (with single or double parity) which are designed to improve speed (with striping), reliability (with mirroring) and performance and reliability (with RAID-Z).

I can’t use the same hardware as before for this testing, but I do happen to have an old (10+ years) Olivetti Netstrada with 4 200MHz Intel Pentium Pro processors, 256MB of RAM and 5 4GB SCSI drives. This means it’s a lot lot slower than the previous 2.6GHz P4 system and 1GB RAM with dual SATA’s so the overall times for the runs are not comparable at all – if for no other reason than Bonnie++ by default works with file sizes that are twice your RAM size to eliminate as much as possible of the effect of the OS caches.

So here is a result from this system for XFS on a single drive for comparison later on. No LVM here as the box is set up purely for testing ZFS.

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
netstrada      496M           10067  33  4901  17            9505  14 129.7   3
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   273  16 +++++ +++   253  12   268  16 +++++ +++   135   7
netstrada,496M,,,10067,33,4901,17,,,9505,14,129.7,3,16,273,16,+++++,+++,253,12,268,16,+++++,+++,135,7

real    9m43.959s
user    0m1.650s
sys     1m20.160s

Note this is not my patched version of Bonnie++ that uses random data rather than 0’s for file content, but as I’m not going to test compression here it is unlikely to make much difference.

Now we want to set up ZFS. There are 4 drives completely free on the system (sdb, sdc, sdd, sde) so we’ll just use the bare drives, no need for partition tables now.

zpool create test /dev/sdb

Now we’ll create a file system that we’re going to work in. This is all the same as before because we’re not doing anything special. Yet.

zfs create test/volume1

Here’s a test result from just that drive.

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
netstrada      496M            3260   2  1778   2            6402   3  34.5   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   323   4  1067   7   301   3   329   4  1549   9   313   3
netstrada,496M,,,3260,2,1778,2,,,6402,3,34.5,0,16,323,4,1067,7,301,3,329,4,1549,9,313,3

real    16m37.150s
user    0m2.170s
sys     0m23.330s

So not quite half the speed of XFS, but close. So, lets see what happens if I add another drive as a stripe.

Striping

Striping is the simplest way of adding a drive to a ZFS pool, all it takes is just:

zfs add test /dev/sdc

We can check the second drive has been added by doing zpool status, which says:

  pool: test
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          /dev/sdb  ONLINE       0     0     0
          /dev/sdc  ONLINE       0     0     0

errors: No known data errors

OK – now what has that done for performance ?

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
netstrada      496M            3175   2  1705   2            6104   3  34.9   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   325   4  1069   6   315   3   331   5  1557   6   345   3
netstrada,496M,,,3175,2,1705,2,,,6104,3,34.9,0,16,325,4,1069,6,315,3,331,5,1557,6,345,3

real    16m46.772s
user    0m2.400s
sys     0m23.100s

Nothing at all really – so how about with all 4 drives striped in the pool ?

zpool add test /dev/sdd
zpool add test /dev/sde

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
netstrada      496M            3136   2  1704   2            5572   3  38.3   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   331   4  1055   6   330   3   333   4  1576  12   348   3
netstrada,496M,,,3136,2,1704,2,,,5572,3,38.3,0,16,331,4,1055,6,330,3,333,4,1576,12,348,3

real    16m32.234s
user    0m1.990s
sys     0m23.720s

Still nothing – very odd indeed, but this may be one of the areas that work still has to be done.

Blow it away, start again

At the moment ZFS (on Solaris or Linux) only supports removing drives that are marked as being hot spares, so we’ll need to destroy this pool and start again. Once more it’s pretty easy to do (warning, no safety nets here, if you type the commands then your data will go away, pronto). First we need to remove any volumes in the pool.

zfs destroy -r test

Then we can destroy the pool itself.

zpool destroy test

Now we will start again at the same point as before, with just a single drive.

zfs add test /dev/sdb
zfs create test/volume1

Mirroring

To convert a single drive ZFS pool into a mirror we cannot use the zpool add command, we have to use zfs attach instead (see the manual page for more information).

zpool attach test /dev/sdb /dev/sdc

If we look at what zpool status says we see:

  pool: test
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Jan  1 22:10:19 2007
config:

        NAME          STATE     READ WRITE CKSUM
        test          ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            /dev/sdb  ONLINE       0     0     0
            /dev/sdc  ONLINE       0     0     0

errors: No known data errors

So that confirms that we now have a mirror for testing – dead easy (( We could have set up the mirror from scratch with zpool create test mirror /dev/sdb /dev/sdc but I wanted to show it was possible to start from just one drive )) ! So does it help with performance ?

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
netstrada      496M            3069   2  1484   2            5634   3  31.5   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   331   4  1087   7   338   4   329   4  1638   8   342   3
netstrada,496M,,,3069,2,1484,2,,,5634,3,31.5,0,16,331,4,1087,7,338,4,329,4,1638,8,342,3

real    18m3.939s
user    0m2.130s
sys     0m23.560s

Er, no would appear to be the definitive answer. OK, so what about if we add the two remaining drives into the array and try again ?

zpool attach test /dev/sdb /dev/sdd
zpool attach test /dev/sdb /dev/sde

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
netstrada      496M            2475   1  1332   1            5638   3  29.2   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   324   4  1041   7   296   3   324   4  1424   8   307   3
netstrada,496M,,,2475,1,1332,1,,,5638,3,29.2,0,16,324,4,1041,7,296,3,324,4,1424,8,307,3

real    19m59.974s
user    0m2.500s
sys     0m24.570s

So it appears that ZFS mirroring doesn’t impart any performance benefit, but is going to be very reliable.

RAID-Z

To test RAID-Z I’ll destroy the existing pool and then create a new RAID-Z pool using all 4 drives (( It’s late and I’m back at work tomorrow! )).

zfs destroy -r test
zpool destroy test
zpool create test raidz /dev/sdb /dev/sdc /dev/sdd /dev/sde
zfs create test/volume1

This is reported by zfs status as:

  pool: test
 state: ONLINE
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        test          ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            /dev/sdb  ONLINE       0     0     0
            /dev/sdc  ONLINE       0     0     0
            /dev/sdd  ONLINE       0     0     0
            /dev/sde  ONLINE       0     0     0

errors: No known data errors

OK – now lets see how that performs:

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
netstrada      496M            3148   2  1475   2            5353   2  19.0   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   323   4  1064   6   319   3   322   4  1407   6   316   3
netstrada,496M,,,3148,2,1475,2,,,5353,2,19.0,0,16,323,4,1064,6,319,3,322,4,1407,6,316,3

real    21m9.018s
user    0m2.250s
sys     0m23.130s

So slower again, but that’s going to be because it’s got to do parity calculations on top of its usual processing load.

Summary

The fact that you can easily make a RAID array or mirror and extend them from the command line is very nice, but the fact that adding drives to a striped system doesn’t seem to change the performance at all is a little odd. Now it could be that my test box is underpowered and that I was hitting its hardware limits, but the fact that XFS was much faster than it on a single drive seems to contradict that.

Looking at the results from zpool iostat -v whilst the RAID-Z case was doing its re-write test it seemed that the I/O load is being nicely balanced between the drives (see below), but it never seemed to exceed a little over 2MB/s. I did peek occasionally at zpool iostat (without the -v) and even with all 4 drives striped it didn’t exceed that barrier. It may be there is a bottleneck in the code further up that needs to be fixed first and after that the drives will hit their proper performance.

                 capacity     operations    bandwidth
pool           used  avail   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
test           663M  15.0G     12     26  1.58M  1.85M
  raidz1       663M  15.0G     12     26  1.58M  1.85M
    /dev/sdb      -      -      9     25   403K   640K
    /dev/sdc      -      -      9     25   404K   640K
    /dev/sdd      -      -      9     26   403K   640K
    /dev/sde      -      -      9     22   405K   637K
------------  -----  -----  -----  -----  -----  -----

Anyway, this is just an alpha release, so there’s much more to come!

Update: my guess is that it is this issue from the STATUS file about what is not yet implemented:

Multi-threaded event loop (relatively easy to implement, should come in the next alpha or so).