Here’s my quick experience trying out the ZFS alpha release with write support. First I built and installed ZFS and then ran the
run.sh script to run the FUSE process that the kernel will use to provide the ZFS file system. Then the fun really begins.
Test platform: Ubuntu Edgy Eft (6.10) with a pair of Seagate Barracuda 7200.9 300GB drives running through a Silicon Image, Inc. SiI 3112 SATA controller in a software RAID-1 mirror (MD driver) and LVM2 on top of that to provide logical volumes for file systems.
First we need a logical volume to play with, I use LVM over a software RAID-1 mirror (using the MD driver) so it’s pretty easy. I’ll set it to be a 10GB partition so we’ve got space to play with:
root@inside:~# lvcreate -L 10G -n ZFS /dev/sata
Now we’ve got some raw storage to play with we need to create a ZFS pool which will also be a top level directory (which I didn’t realise initially). So we’ll create a pool called “test” and that will also create a /test directory.
root@inside:~# zpool create test /dev/sata/ZFS
OK – so what does zpool say about its status ?
root@inside:~# zpool status
pool: test state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /dev/sata/ZFS ONLINE 0 0 0 errors: No known data errors
Well that’s good, it’s told us it’s not spotted any errors yet. 🙂
So we’ve got a pool, now we need to allocate some of that pool to a file system. To make it easy we won’t specify a limit now as (I believe) one can be allocated later. We’ll call this
volume1 and it’ll appear as
root@inside:~# zfs create test/volume1
That’s create the area, made the file system and mounted it for us. Not bad, eh ? Here’s the proof, the next command I typed was:
root@inside:~# zfs list
NAME USED AVAIL REFER MOUNTPOINT test 114K 9.78G 25.5K /test test/volume1 24.5K 9.78G 24.5K /test/volume1
Now we’ll give it some real work to do, we’ll use Russell Cokers excellent bonnie++ disk benchmarking tool which will test a heap of I/O characteristics (not all of which we’ll do here because it’ll take too long).
First of all we’ll go into the ZFS file system we just created.
root@inside:~# cd /test/volume1/
Now we’ll run bonnie++ and tell it to only run in “fast” mode, which will skip the per-character I/O tests (life’s too short). I also need to tell it to really run as root, but only because I was too lazy to change the directory owner to my real user. Ahem. 🙂
root@inside:/test/volume1# time bonnie++ -f -u root
This is the result!
Using uid:0, gid:0. Writing intelligently...done Rewriting...done Reading intelligently...done start 'em...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...done. Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP inside 2G 13455 1 6626 1 24296 1 58.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 1832 4 5713 4 1394 2 1955 4 8804 6 1709 3 inside,2G,,,13455,1,6626,1,,,24296,1,58.7,0,16,1832,4,5713,4,1394,2,1955,4,8804,6,1709,3 real 12m27.073s user 0m1.236s sys 0m9.405s
Not too bad for an alpha release of a file system, it ran to completion with no errors or crashes!
Now we need an idea of how a comparable file system performs on the same hardware so as a comparison I ran bonnie++ on an XFS partition which is also on an LVM logical volume. This is how it performed (( The original version of this test, with Beagle still running, took 8m 22s )):
Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP inside 2G 42738 11 20034 5 42242 5 261.6 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 1614 4 +++++ +++ 1550 3 1236 3 +++++ +++ 207 0 inside,2G,,,42738,11,20034,5,,,42242,5,261.6,1,16,1614,4,+++++,+++,1550,3,1236,3,+++++,+++,207,0 real 5m53.601s user 0m0.292s sys 0m16.473s
So significantly faster for most operations, though interestingly ZFS was quicker for all create & deletes except the sequential delete.
Now given a previous comment on the ZFS blog about the impact of compression on performance I thought it would be interesting to try it out for myself. First you turn it on with:
root@inside:/test/volume1# zfs set compression=on test
(how easy was that ?) and re-ran bonnie++. What I got really surprised me!
Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP inside 2G 13471 1 11813 2 72091 4 1169 2 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 1707 4 4501 3 1520 3 1590 4 10065 6 1758 3 inside,2G,,,13471,1,11813,2,,,72091,4,1169.1,2,16,1707,4,4501,3,1520,3,1590,4,10065,6,1758,3 real 6m59.717s user 0m1.200s sys 0m8.813s
So this is significantly faster than the run without compression (( Originally, before I realised about Beagle, it looked faster than XFS )) . Now admittedly this is a synthetic test and I presume that Bonnie++ is writing files padded with zeros (( it is )) (or some other constant) rather than with random data, but I was still pretty amazed.
Copying a sample ISO image (Kubuntu 7.04 alpha) from /tmp was a little more realistic, with XFS taking about 33 seconds and ZFS with compression taking almost 2m 18s. Disabling compression decreased the time to around 1m 55s. Next up was another old favourite, untar’ing a bzip2’d Linux kernel image (in this case 18.104.22.168). This was done using
time tar xjf /tmp/linux-22.214.171.124.tar.bz2
XFS took just under 1m 22s (( originally 2m 21s with Beagle running )) whilst ZFS without compression took 1m 30s and 1m 27s with compression. So a pretty even match there.
Removing the resulting kernel tree took just over 9s (( originally 24s with Beagle )) on XFS, 14s on ZFS without compression and just under 19s with compression.
I have to say I’m very impressed with what Ricardo has managed to do with this and I really look forward to future releases that he says will improve performance! I’m also quite impressed with the management tools for ZFS!
Tested again, this time LD_PRELOAD’ing it against Google’s tcmalloc library to see what happens then.
With compression it was almost a full minute quicker!
Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP inside 2G 16500 1 13219 2 82316 5 918.1 2 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2130 5 7636 5 1609 3 1994 3 13136 9 1821 4 inside,2G,,,16500,1,13219,2,,,82316,5,918.1,2,16,2130,5,7636,5,1609,3,1994,3,13136,9,1821,4 real 6m3.706s user 0m1.108s sys 0m8.677s
Now without compression it’s over 1 m 30s quicker too:
Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP inside 2G 15158 1 7698 1 30611 2 74.0 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 1436 3 5925 4 1741 3 1214 2 7217 6 1761 3 inside,2G,,,15158,1,7698,1,,,30611,2,74.0,0,16,1436,3,5925,4,1741,3,1214,2,7217,6,1761,3 real 10m44.081s user 0m1.072s sys 0m8.645s
The kernel untar tests are now 1m 25s to with compression and 1m 12s with compression and the kernel tree remove is just under 15s with compression and just over 14s without.
Mea culpa, I’d completely forgotten I had Beagle running and for the XFS tests (which I ran as myself in a sub-directory of my home directory) it was helpfully trying to index the 2GB test files that Bonnie was creating! This handicapped the original XFS tests and severely and disabling it sped the whole thing up by almost 1.5 minutes!
Now I’m going to try testing both XFS and again, this time with a version of Bonnie++ patched to use random data for its writes rather than just 0’s.
Corrected the times for untar’ing and removing the kernel tree under XFS now Beagle isn’t running to confuse matters.
>First we need a logical volume to play with,
actually, you donÂ´t even need one – just create some empty file:
dd if=/dev/zero of=zfs.img bs=1024k count=500
and create a zpool/zfs on top of that:
zpool create zfstest /absolute/path/to/zfs.img
I’m also quite impressed by those results, especially taking into account that there’s still a few things to be done that should improve performance by a good margin.
All in all, looks like zfs-fuse will be a real contender for our default filesystem 😉
Roland – the reason I went for a logical volume was because I wanted to see how it behaved directly on a block device and not confuse the issue with another filesystem below also getting involved in the I/O as well. But for those who don’t have free, unallocated, space it’s a very useful trick.
Ricardo, thanks for that, I’m going to keep my eyes out for the new releases and see how they perform too. As for a default filesystem, hmm, interesting thought, but you’re still going to need some form of initrd/initramfs magic to be able to have your root filesystem on it and to boot from it you’d need to teach grub how to read it, and that would mean a complete rewrite as GPL licensed code. If you’re going to do that might as well be worth just putting it directly into the kernel. 🙂
It would be so nice if Sun would relicense (or dual license) OpenSolaris as GPLv2 but my guess is that if they do they’ll wait for GPLv3 and use that so it still cannot be included in the Linux kernel. 🙁
Exactly, the plan would be to have an initrd/initramfs image. But it could be stored on a small /boot ext2/3 filesystem (with 20 megs or so), so that grub would be able to read it 🙂
Yup, that’s right. Mind you most distros are quite conservative so they may want to keep the root filesystem on a in-kernel filesystem (they probably wouldn’t want to risk the OOM killer accidentally shooting the zfs-fuse process for instance!). However, as an option for other usage (/home would be a good example) then I think it shows a lot of possibility.
Not to mention the fact that with FUSE you can upgrade your filesystem without needing to touch your kernel at all. 🙂
Regarding compression, in the past I have considered making the data blocks for Bonnie slightly random but never got around to it.
If I do this then I’ll calculate MD5 or SHA1 sums of the data too and check for consistency, currently a filesystem could return totally bogus data on the read operations and as long as the metadata is OK then bonnie won’t notice.
Interesting stuff Russell. I wonder how trivial it would be to make it read a pool of data from /dev/random at startup and then use it for its patterns ?
/dev/random is not such a good idea, that would mean that running a test immediately after booting (something that I recommend quietly) would be likely to hang until you press some random keys or do other things to generate entropy.
Using /dev/urandom would work. The main issue is the affect on the results of checking the data. Of course I could make it optional and allow two modes of operation, and it should be noted that some rare software checks it’s data (although the vast majority doesn’t).
I’ve been delaying release 2.0 of Bonnie for a while mainly because of threading slowing things down, I probably should just bite the bullet and make a new release with incompatible results.
[…] Auch andere haben sich natÃ¼rlich schon dieses Themas angenommen, darunter Machine Check Exception und Chris Samuel. […]
OK, I’ve now osted separately a summary of what happens when I use my patched version of Bonnie that uses data from /dev/urandom rather than just 0’s for its write tests.
I then tried to build the ZFS code within ZFS and whilst the compile appeared to work just fine the zfs program wouldn’t work. Copying in a known working executable also didn’t work even though the MD5 checksums were identical. Turns out it is a known bug that Ricardo is working on (not fixed in trunk yet).
So my cunning plan of building a machine to run off ZFS is stopped for now. 🙂
[…] Now we’ll create a file system that we’re going to work in. This is all the same as before because we’re not doing anything special. Yet. […]
grab a copy of Nexenta and you’ll have ZFS up and running in notime 🙂
I would do that on a system I was only going to use ZFS on, but the machine I had in mind for that only has 256MB RAM and whilst Linux is very happy it appears it is not enough to even boot the Solaris kernel. 🙁
The system I was testing on here is my main box, so it’s going to stay running Linux!
Of course if Sun had used the GPLv2 or a compatible license rather than inventing a new GPL incompatible license we wouldn’t have this problem.
I’m sure there’s a perfectly good reason why it’s not GPLv2. Such as for instance Linux would be a perfect alternative to Solaris storage servers. “Let’s not give the competition our only weapon”
Solaris 10 requires 256 MB ram and no more. I dont know how mych Nexenta and Schillix and those stuff require, but they are not from SUN.
It may be supposed to, but it doesn’t. Not even the official one.
To be honest, I’ve lost interest in it on that system. If it’s not portable enough to work on that then it’s not ready for prime time.
I wonder how different the tests would be if you didn’t run on an LVM mirror, and instead used ZFS to mirror. It’s my understanding that ZFS outperforms software mirroring. By using a mirror as the basis for ZFS, you lose some of the built-in performance and reliability, making writes synchronous and forcing resyncs if power is lost.
This looks really impressive. Reminds me of the NSS storage in Netware, somethign I missed when we went all Microsoft 8)
However we also still have some large Linux servers and potentially something like this could mean a change the file servers too as our major headache is growth in the user file stores.
Need more space for user docs and profile, add another LUN from the SAN and add as pool storage.
The quotering per folder would be really useful too.
I also hope they add native encryption as well as compression to ZFS as ten it would be a perfect all round user data storage filesystem.
(I think I’d still prefer safe old ext3 for / 8)
If the write performance can be resolved then would be perfect for backup staging as well.
>I also hope they add native encryption as well as
>compression to ZFS
compression already works with zfs-fuse
I think what John meant was that he hopes they add encryption in addition to the already existing compression capabilities.
Be warned though it’s still unclear to me as to whether the RAIDZ corruption bug has been tracked down and fixed yet..
Hi first time ZFS install
benchmarks using ‘time bonnie++ -f -u root’ (Version 1.03d)
ZFS ????-??-?? – Release 0.5.1 (hg clone, and current at this post)
Model Family: Hitachi Deskstar T7K500
Device Model: Hitachi HDT725032VLA360
User Capacity: 320,072,933,376 bytes
ZFS – 20GB partition (end of disk) /dev/sdb3
XFS – 280GB partition 96% used (11GB free) /dev/sdb2
with compression on it totally
XFS (yeah.. theres some problem here – Sequential Create and Random Create)
ill buy a few kilos of that please