ZFS on Linux Works! (Update 3)

Here’s my quick experience trying out the ZFS alpha release with write support. First I built and installed ZFS and then ran the run.sh script to run the FUSE process that the kernel will use to provide the ZFS file system. Then the fun really begins.

Test platform: Ubuntu Edgy Eft (6.10) with a pair of Seagate Barracuda 7200.9 300GB drives running through a Silicon Image, Inc. SiI 3112 SATA controller in a software RAID-1 mirror (MD driver) and LVM2 on top of that to provide logical volumes for file systems.

First we need a logical volume to play with, I use LVM over a software RAID-1 mirror (using the MD driver) so it’s pretty easy. I’ll set it to be a 10GB partition so we’ve got space to play with:

root@inside:~# lvcreate -L 10G -n ZFS /dev/sata

Now we’ve got some raw storage to play with we need to create a ZFS pool which will also be a top level directory (which I didn’t realise initially). So we’ll create a pool called “test” and that will also create a /test directory.

root@inside:~# zpool create test /dev/sata/ZFS

OK – so what does zpool say about its status ?

root@inside:~# zpool status

  pool: test
 state: ONLINE
 scrub: none requested
config:

        NAME             STATE     READ WRITE CKSUM
        test             ONLINE       0     0     0
          /dev/sata/ZFS  ONLINE       0     0     0

errors: No known data errors

Well that’s good, it’s told us it’s not spotted any errors yet. πŸ™‚

So we’ve got a pool, now we need to allocate some of that pool to a file system. To make it easy we won’t specify a limit now as (I believe) one can be allocated later. We’ll call this volume1 and it’ll appear as /test/volume1.

root@inside:~# zfs create test/volume1

That’s create the area, made the file system and mounted it for us. Not bad, eh ? Here’s the proof, the next command I typed was:

root@inside:~# zfs list

NAME           USED  AVAIL  REFER  MOUNTPOINT
test           114K  9.78G  25.5K  /test
test/volume1  24.5K  9.78G  24.5K  /test/volume1

Now we’ll give it some real work to do, we’ll use Russell Cokers excellent bonnie++ disk benchmarking tool which will test a heap of I/O characteristics (not all of which we’ll do here because it’ll take too long).

First of all we’ll go into the ZFS file system we just created.

root@inside:~# cd /test/volume1/

Now we’ll run bonnie++ and tell it to only run in “fast” mode, which will skip the per-character I/O tests (life’s too short). I also need to tell it to really run as root, but only because I was too lazy to change the directory owner to my real user. Ahem. πŸ™‚

root@inside:/test/volume1# time bonnie++ -f -u root

This is the result!

Using uid:0, gid:0.
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
inside           2G           13455   1  6626   1           24296   1  58.7   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1832   4  5713   4  1394   2  1955   4  8804   6  1709   3
inside,2G,,,13455,1,6626,1,,,24296,1,58.7,0,16,1832,4,5713,4,1394,2,1955,4,8804,6,1709,3

real    12m27.073s
user    0m1.236s
sys     0m9.405s

Not too bad for an alpha release of a file system, it ran to completion with no errors or crashes!

Now we need an idea of how a comparable file system performs on the same hardware so as a comparison I ran bonnie++ on an XFS partition which is also on an LVM logical volume. This is how it performed (( The original version of this test, with Beagle still running, took 8m 22s )):

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
inside           2G           42738  11 20034   5           42242   5 261.6   1
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1614   4 +++++ +++  1550   3  1236   3 +++++ +++   207   0
inside,2G,,,42738,11,20034,5,,,42242,5,261.6,1,16,1614,4,+++++,+++,1550,3,1236,3,+++++,+++,207,0

real    5m53.601s
user    0m0.292s
sys     0m16.473s

So significantly faster for most operations, though interestingly ZFS was quicker for all create & deletes except the sequential delete.

Now given a previous comment on the ZFS blog about the impact of compression on performance I thought it would be interesting to try it out for myself. First you turn it on with:

root@inside:/test/volume1# zfs set compression=on test

(how easy was that ?) and re-ran bonnie++. What I got really surprised me!

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
inside           2G           13471   1 11813   2           72091   4  1169   2
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1707   4  4501   3  1520   3  1590   4 10065   6  1758   3
inside,2G,,,13471,1,11813,2,,,72091,4,1169.1,2,16,1707,4,4501,3,1520,3,1590,4,10065,6,1758,3

real    6m59.717s
user    0m1.200s
sys     0m8.813s

So this is significantly faster than the run without compression (( Originally, before I realised about Beagle, it looked faster than XFS )) . Now admittedly this is a synthetic test and I presume that Bonnie++ is writing files padded with zeros (( it is )) (or some other constant) rather than with random data, but I was still pretty amazed.

Copying a sample ISO image (Kubuntu 7.04 alpha) from /tmp was a little more realistic, with XFS taking about 33 seconds and ZFS with compression taking almost 2m 18s. Disabling compression decreased the time to around 1m 55s. Next up was another old favourite, untar’ing a bzip2’d Linux kernel image (in this case 2.6.19.1). This was done using time tar xjf /tmp/linux-2.6.19.1.tar.bz2

XFS took just under 1m 22s (( originally 2m 21s with Beagle running )) whilst ZFS without compression took 1m 30s and 1m 27s with compression. So a pretty even match there.

Removing the resulting kernel tree took just over 9s (( originally 24s with Beagle )) on XFS, 14s on ZFS without compression and just under 19s with compression.

I have to say I’m very impressed with what Ricardo has managed to do with this and I really look forward to future releases that he says will improve performance! I’m also quite impressed with the management tools for ZFS!

Update 1

Tested again, this time LD_PRELOAD’ing it against Google’s tcmalloc library to see what happens then.

With compression it was almost a full minute quicker!

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
inside           2G           16500   1 13219   2           82316   5 918.1   2
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  2130   5  7636   5  1609   3  1994   3 13136   9  1821   4
inside,2G,,,16500,1,13219,2,,,82316,5,918.1,2,16,2130,5,7636,5,1609,3,1994,3,13136,9,1821,4

real    6m3.706s
user    0m1.108s
sys     0m8.677s

Now without compression it’s over 1 m 30s quicker too:

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
inside           2G           15158   1  7698   1           30611   2  74.0   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1436   3  5925   4  1741   3  1214   2  7217   6  1761   3
inside,2G,,,15158,1,7698,1,,,30611,2,74.0,0,16,1436,3,5925,4,1741,3,1214,2,7217,6,1761,3

real    10m44.081s
user    0m1.072s
sys     0m8.645s

The kernel untar tests are now 1m 25s to with compression and 1m 12s with compression and the kernel tree remove is just under 15s with compression and just over 14s without.

Update 2

Mea culpa, I’d completely forgotten I had Beagle running and for the XFS tests (which I ran as myself in a sub-directory of my home directory) it was helpfully trying to index the 2GB test files that Bonnie was creating! This handicapped the original XFS tests and severely and disabling it sped the whole thing up by almost 1.5 minutes!

Now I’m going to try testing both XFS and again, this time with a version of Bonnie++ patched to use random data for its writes rather than just 0’s.

Update 3

Corrected the times for untar’ing and removing the kernel tree under XFS now Beagle isn’t running to confuse matters.

21 thoughts on “ZFS on Linux Works! (Update 3)

  1. >First we need a logical volume to play with,

    actually, you don´t even need one – just create some empty file:

    dd if=/dev/zero of=zfs.img bs=1024k count=500

    and create a zpool/zfs on top of that:

    zpool create zfstest /absolute/path/to/zfs.img

  2. I’m also quite impressed by those results, especially taking into account that there’s still a few things to be done that should improve performance by a good margin.
    All in all, looks like zfs-fuse will be a real contender for our default filesystem πŸ˜‰

  3. Roland – the reason I went for a logical volume was because I wanted to see how it behaved directly on a block device and not confuse the issue with another filesystem below also getting involved in the I/O as well. But for those who don’t have free, unallocated, space it’s a very useful trick.

    Ricardo, thanks for that, I’m going to keep my eyes out for the new releases and see how they perform too. As for a default filesystem, hmm, interesting thought, but you’re still going to need some form of initrd/initramfs magic to be able to have your root filesystem on it and to boot from it you’d need to teach grub how to read it, and that would mean a complete rewrite as GPL licensed code. If you’re going to do that might as well be worth just putting it directly into the kernel. πŸ™‚

    It would be so nice if Sun would relicense (or dual license) OpenSolaris as GPLv2 but my guess is that if they do they’ll wait for GPLv3 and use that so it still cannot be included in the Linux kernel. πŸ™

  4. Yup, that’s right. Mind you most distros are quite conservative so they may want to keep the root filesystem on a in-kernel filesystem (they probably wouldn’t want to risk the OOM killer accidentally shooting the zfs-fuse process for instance!). However, as an option for other usage (/home would be a good example) then I think it shows a lot of possibility.

    Not to mention the fact that with FUSE you can upgrade your filesystem without needing to touch your kernel at all. πŸ™‚

  5. Regarding compression, in the past I have considered making the data blocks for Bonnie slightly random but never got around to it.

    If I do this then I’ll calculate MD5 or SHA1 sums of the data too and check for consistency, currently a filesystem could return totally bogus data on the read operations and as long as the metadata is OK then bonnie won’t notice.

  6. /dev/random is not such a good idea, that would mean that running a test immediately after booting (something that I recommend quietly) would be likely to hang until you press some random keys or do other things to generate entropy.

    Using /dev/urandom would work. The main issue is the affect on the results of checking the data. Of course I could make it optional and allow two modes of operation, and it should be noted that some rare software checks it’s data (although the vast majority doesn’t).

    I’ve been delaying release 2.0 of Bonnie for a while mainly because of threading slowing things down, I probably should just bite the bullet and make a new release with incompatible results.

  7. Pingback: node-0 » Blog Archiv » ZFS unter Ubuntu/Kubuntu 6.10 (»Edgy Eft«)

  8. OK, I’ve now osted separately a summary of what happens when I use my patched version of Bonnie that uses data from /dev/urandom rather than just 0’s for its write tests.

    I then tried to build the ZFS code within ZFS and whilst the compile appeared to work just fine the zfs program wouldn’t work. Copying in a known working executable also didn’t work even though the MD5 checksums were identical. Turns out it is a known bug that Ricardo is working on (not fixed in trunk yet).

    So my cunning plan of building a machine to run off ZFS is stopped for now. πŸ™‚

  9. Pingback: ZFS Disk Mirroring, Striping and RAID-Z at The Musings of Chris Samuel

  10. I would do that on a system I was only going to use ZFS on, but the machine I had in mind for that only has 256MB RAM and whilst Linux is very happy it appears it is not enough to even boot the Solaris kernel. πŸ™

    The system I was testing on here is my main box, so it’s going to stay running Linux!

    Of course if Sun had used the GPLv2 or a compatible license rather than inventing a new GPL incompatible license we wouldn’t have this problem.

  11. I’m sure there’s a perfectly good reason why it’s not GPLv2. Such as for instance Linux would be a perfect alternative to Solaris storage servers. “Let’s not give the competition our only weapon”

  12. Solaris 10 requires 256 MB ram and no more. I dont know how mych Nexenta and Schillix and those stuff require, but they are not from SUN.

  13. It may be supposed to, but it doesn’t. Not even the official one.

    To be honest, I’ve lost interest in it on that system. If it’s not portable enough to work on that then it’s not ready for prime time.

  14. This looks really impressive. Reminds me of the NSS storage in Netware, somethign I missed when we went all Microsoft 8)
    However we also still have some large Linux servers and potentially something like this could mean a change the file servers too as our major headache is growth in the user file stores.

    Need more space for user docs and profile, add another LUN from the SAN and add as pool storage.
    The quotering per folder would be really useful too.

    I also hope they add native encryption as well as compression to ZFS as ten it would be a perfect all round user data storage filesystem.
    (I think I’d still prefer safe old ext3 for / 8)

    If the write performance can be resolved then would be perfect for backup staging as well.

  15. I think what John meant was that he hopes they add encryption in addition to the already existing compression capabilities.

    Be warned though it’s still unclear to me as to whether the RAIDZ corruption bug has been tracked down and fixed yet..

  16. Hi first time ZFS install
    benchmarks using ‘time bonnie++ -f -u root’ (Version 1.03d)

    Slackware-current
    ZFS ????-??-?? – Release 0.5.1 (hg clone, and current at this post)
    Model Family: Hitachi Deskstar T7K500
    Device Model: Hitachi HDT725032VLA360
    User Capacity: 320,072,933,376 bytes

    ZFS – 20GB partition (end of disk) /dev/sdb3
    XFS – 280GB partition 96% used (11GB free) /dev/sdb2

    ZFS COMPRESSION=OFF
    kary,2G,,,29257,3,14414,3,,,39285,4,110.4,0,16,7140,7,14374,8,7674,5,5426,5,18404,11,8526,7
    real 6m1.486s
    user 0m0.467s
    sys 0m11.353s

    with compression on it totally

    XFS (yeah.. theres some problem here – Sequential Create and Random Create)
    kary,2G,,,52955,7,25552,6,,,58908,8,118.0,0,16,376,1,+++++,+++,331,1,379,1,+++++,+++,309,1
    real 7m1.703s
    user 0m0.277s
    sys 0m14.482s

    ZFS COMPREESION=ON
    kary,2G,,,67932,7,45118,11,,,139270,16,239.0,0,16,7545,6,13144,9,5975,4,4125,3,5090,3,1941,1
    real 2m33.764s
    user 0m0.440s
    sys 0m10.953s

    ill buy a few kilos of that please

Comments are closed.