I’m completely new to ZFS, so to start with I thought I’d do some simple benchmarks on it to get a feel for how it behaves. I wanted to push the limits of its performance so I provisioned an Amazon EC2 i2.8xlarge instance (almost $7/hr, time really is money!). This instance has 8 800GB SSDs.

I did an fio test on the SSDs themselves, and got the following output (trimmed):

$ sudo fio --name randwrite --ioengine=libaio --iodepth=2 --rw=randwrite --bs=4k --size=400G --numjobs=8 --runtime=300 --group_reporting --direct=1 --filename=/dev/xvdb
[trimmed]
  write: io=67178MB, bw=229299KB/s, iops=57324, runt=300004msec
[trimmed]

57K IOPS for 4K random writes. Respectable.

I then created a ZFS volume spanning all 8. At first I had one raidz1 vdev with all 8 SSDs in it, but I read about the reasons this is bad for performance, so I ended up with four mirror vdevs, like so:

$ sudo zpool create testpool mirror xvdb xvdc mirror xvdd xvde mirror xvdf xvdg mirror xvdh xvdi
$ sudo zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
testpool  2.91T   284K  2.91T         -     0%     0%  1.00x  ONLINE  -
  mirror   744G   112K   744G         -     0%     0%
    xvdb      -      -      -         -      -      -
    xvdc      -      -      -         -      -      -
  mirror   744G    60K   744G         -     0%     0%
    xvdd      -      -      -         -      -      -
    xvde      -      -      -         -      -      -
  mirror   744G      0   744G         -     0%     0%
    xvdf      -      -      -         -      -      -
    xvdg      -      -      -         -      -      -
  mirror   744G   112K   744G         -     0%     0%
    xvdh      -      -      -         -      -      -
    xvdi      -      -      -         -      -      -

I set the recordsize to 4K and ran my test:

$ sudo zfs set recordsize=4k testpool
$ sudo fio --name randwrite --ioengine=libaio --iodepth=2 --rw=randwrite --bs=4k --size=400G --numjobs=8 --runtime=300 --group_reporting --filename=/testpool/testfile --fallocate=none
[trimmed]
  write: io=61500MB, bw=209919KB/s, iops=52479, runt=300001msec
    slat (usec): min=13, max=155081, avg=145.24, stdev=901.21
    clat (usec): min=3, max=155089, avg=154.37, stdev=930.54
     lat (usec): min=35, max=155149, avg=300.91, stdev=1333.81
[trimmed]

I get only 52K IOPS on this ZFS pool. That’s actually slightly worse than one SSD itself.

I don’t understand what I’m doing wrong here. Have I configured ZFS incorrectly, or is this a poor test of ZFS performance?

Note I’m using the official 64-bit CentOS 7 HVM image, though I’ve upgraded to the 4.4.5 elrepo kernel:

$ uname -a
Linux ip-172-31-43-196.ec2.internal 4.4.5-1.el7.elrepo.x86_64 #1 SMP Thu Mar 10 11:45:51 EST 2016 x86_64 x86_64 x86_64 GNU/Linux

I installed ZFS from the zfs repo listed here. I have version 0.6.5.5 of the zfs package.

UPDATE: Per @ewwhite’s suggestion I tried ashift=12 and ashift=13:

$ sudo zpool create testpool mirror xvdb xvdc mirror xvdd xvde mirror xvdf xvdg mirror xvdh xvdi -o ashift=12 -f

and

$ sudo zpool create testpool mirror xvdb xvdc mirror xvdd xvde mirror xvdf xvdg mirror xvdh xvdi -o ashift=13 -f

Neither of these made any difference. From what I understand the latest ZFS bits are smart enough identifying 4K SSDs and using reasonable defaults.

I did notice however that CPU usage is spiking. @Tim suggested this but I dismissed it however I think I wasn’t watching the CPU long enough to notice. There are something like 30 CPU cores on this instance, and CPU usage is spiking up as high as 80%. The hungry process? z_wr_iss, lots of instances of it.

I confirmed compression is off, so it’s not the compression engine.

I’m not using raidz, so it shouldn’t be the parity computation.

I did a perf top and it shows most of the kernel time spent in _raw_spin_unlock_irqrestore in z_wr_int_4 and osq_lock in z_wr_iss.

I now believe there is a CPU component to this performance bottleneck, though I’m no closer to figuring out what it might be.

UPDATE 2: Per @ewwhite and others’ suggestion that it’s the virtualized nature of this environment that creates performance uncertainty, I used fio to benchmark random 4K writes spread across four of the SSDs in the environment. Each SSD by itself gives ~55K IOPS, so I expected somewhere around 240K IOs across four of them. That’s more or less what I got:

$ sudo fio --name randwrite --ioengine=libaio --iodepth=8 --rw=randwrite --bs=4k --size=398G --numjobs=8 --runtime=300 --group_reporting --filename=/dev/xvdb:/dev/xvdc:/dev/xvdd:/dev/xvde
randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
...
randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
fio-2.1.5
Starting 8 processes
[trimmed]
  write: io=288550MB, bw=984860KB/s, iops=246215, runt=300017msec
    slat (usec): min=1, max=24609, avg=30.27, stdev=566.55
    clat (usec): min=3, max=2443.8K, avg=227.05, stdev=1834.40
     lat (usec): min=27, max=2443.8K, avg=257.62, stdev=1917.54
[trimmed]

This clearly shows the environment, virtualized though it may be, can sustain the IOPS much higher than what I’m seeing. Something about the way ZFS is implemented is keeping it from hitting the top speed. I just can’t figure out what that is.

This setup may not be tuned well. There are parameters needed for both the /etc/modprobe/zfs.conf file and the ashift value when using SSDs

Try ashift=12 or 13 and test again.


Edit:

This is still a virtualized solution, so we don’t know too much about the underlying hardware or how everything is interconnected. I don’t know that you’ll get better performance out of this solution.


Edit:

I guess I don’t see the point of trying to optimize a cloud instance in this manner. Because if top performance were the aim, you’d be using hardware, right?

But remember that ZFS has a lot of tunable settings, and what you get by default isn’t anywhere close to your use case.

Try the following in your /etc/modprobe.d/zfs.conf and reboot. It’s what I use in my all-SSD data pools for application servers. Your ashift should be 12 or 13. Benchmark with compression=off, but use compression=lz4 in production. Set atime=off. I’d leave recordsize as default (128K).

options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
options zfs zfs_vdev_sync_write_min_active=64
options zfs zfs_vdev_sync_write_max_active=128
options zfs zfs_vdev_sync_read_min_active=64
options zfs zfs_vdev_sync_read_max_active=128
options zfs zfs_vdev_async_read_min_active=64
options zfs zfs_vdev_async_read_max_active=128
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=30
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=64
options zfs zfs_prefetch_disable=1

It seems likely that you’re waiting on a Linux kernel mutex lock that in turn may be waiting on a Xen ring buffer. I can’t be certain of this without access to a similar machine, but I’m not interested in paying Amazon $7/hour for that privilege.

Longer write-up is here: https://www.reddit.com/r/zfs/comments/4b4r1y/why_is_zfs_on_linux_unable_to_fully_utilize_8x/d1e91wo ; I’d rather it be in one place than two.

I’ve spent a decent amount of time trying to track this down. My specific challenge: a Postgres server and I want to use ZFS for its data volume. The baseline is XFS.

First and foremost, my trials tell me that ashift=12 is wrong. If there’s some magic ashift number it’s not 12. I’m using 0 and I’m getting very good results.

I’ve also experimented with a bunch of zfs options and the ones that give me the results below are:

atime=off – I don’t need access times

checksum=off – I’m striping, not mirroring

compression=lz4 – Performance is better with compression (cpu tradeoff?)

exec=off – This is for data, not executables

logbias=throughput – Read on the interwebs this is better for Postgres

recordsize=8k – PG specific 8k blocksize

sync=standard – tried to turn sync off; didn’t see much benefit

My tests below show better than XFS performance (please comment if you see errors in my tests!).

With this my next step is try Postgres running on a 2 x EBS ZFS filesystem.

My specific setup:

EC2: m4.xlarge instance

EBS: 250GB gp2 volumes

kernel: Linux […] 3.13.0-105-generic #152-Ubuntu SMP […] x86_64 x86_64 x86_64 GNU/Linux *

First, I wanted to test the raw EBS performance. Using a variation of the fio command above, I came up with the incantation below. Note: I’m using 8k blocks because that’s what I’ve read PostgreSQL writes are:

ubuntu@ip-172-31-30-233:~$ device=/dev/xvdbd; sudo dd if=/dev/zero of=${device} bs=1M count=100 && sudo fio --name randwrite --ioengine=libaio --iodepth=4 --rw=randwrite --bs=8k --size=400G --numjobs=4 --runtime=60 --group_reporting --fallocate=none --filename=${device}
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.250631 s, 418 MB/s
randwrite: (g=0): rw=randwrite, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=4
...
randwrite: (g=0): rw=randwrite, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=4
fio-2.1.3
Starting 4 processes
Jobs: 4 (f=4): [wwww] [100.0% done] [0KB/13552KB/0KB /s] [0/1694/0 iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=4): err= 0: pid=18109: Tue Feb 14 19:13:53 2017
  write: io=3192.2MB, bw=54184KB/s, iops=6773, runt= 60327msec
    slat (usec): min=2, max=805209, avg=585.73, stdev=6238.19
    clat (usec): min=4, max=805236, avg=1763.29, stdev=10716.41
     lat (usec): min=15, max=805241, avg=2349.30, stdev=12321.43
    clat percentiles (usec):
     |  1.00th=[   15],  5.00th=[   16], 10.00th=[   17], 20.00th=[   19],
     | 30.00th=[   23], 40.00th=[   24], 50.00th=[   25], 60.00th=[   26],
     | 70.00th=[   27], 80.00th=[   29], 90.00th=[   36], 95.00th=[15808],
     | 99.00th=[31872], 99.50th=[35584], 99.90th=[99840], 99.95th=[199680],
     | 99.99th=[399360]
    bw (KB  /s): min=  156, max=1025440, per=26.00%, avg=14088.05, stdev=67584.25
    lat (usec) : 10=0.01%, 20=20.53%, 50=72.20%, 100=0.86%, 250=0.17%
    lat (usec) : 500=0.13%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.59%, 20=2.01%, 50=3.29%
    lat (msec) : 100=0.11%, 250=0.05%, 500=0.02%, 750=0.01%, 1000=0.01%
  cpu          : usr=0.22%, sys=1.34%, ctx=9832, majf=0, minf=114
  IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=408595/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=3192.2MB, aggrb=54184KB/s, minb=54184KB/s, maxb=54184KB/s, mint=60327msec, maxt=60327msec

Disk stats (read/write):
  xvdbd: ios=170/187241, merge=0/190688, ticks=180/8586692, in_queue=8590296, util=99.51%

Raw performance for the EBS volume is WRITE: io=3192.2MB.

Now, testing XFS with the same fio command:

Jobs: 4 (f=4): [wwww] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=4): err= 0: pid=17441: Tue Feb 14 19:10:27 2017
  write: io=3181.9MB, bw=54282KB/s, iops=6785, runt= 60024msec
    slat (usec): min=3, max=21077K, avg=587.19, stdev=76081.88
    clat (usec): min=4, max=21077K, avg=1768.72, stdev=131857.04
     lat (usec): min=23, max=21077K, avg=2356.23, stdev=152444.62
    clat percentiles (usec):
     |  1.00th=[   29],  5.00th=[   40], 10.00th=[   46], 20.00th=[   52],
     | 30.00th=[   56], 40.00th=[   59], 50.00th=[   63], 60.00th=[   69],
     | 70.00th=[   79], 80.00th=[   99], 90.00th=[  137], 95.00th=[  274],
     | 99.00th=[17024], 99.50th=[25472], 99.90th=[70144], 99.95th=[120320],
     | 99.99th=[1564672]
    bw (KB  /s): min=    2, max=239872, per=66.72%, avg=36217.04, stdev=51480.84
    lat (usec) : 10=0.01%, 20=0.03%, 50=15.58%, 100=64.51%, 250=14.55%
    lat (usec) : 500=1.36%, 750=0.33%, 1000=0.25%
    lat (msec) : 2=0.68%, 4=0.67%, 10=0.71%, 20=0.58%, 50=0.59%
    lat (msec) : 100=0.10%, 250=0.02%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%, >=2000=0.01%
  cpu          : usr=0.43%, sys=4.81%, ctx=269518, majf=0, minf=110
  IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=407278/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=3181.9MB, aggrb=54282KB/s, minb=54282KB/s, maxb=54282KB/s, mint=60024msec, maxt=60024msec

Disk stats (read/write):
  xvdbd: ios=4/50983, merge=0/319694, ticks=0/2067760, in_queue=2069888, util=26.21%

Our baseline is WRITE: io=3181.9MB; really close to raw disk speed.

Now, onto ZFS with WRITE: io=3181.9MB as the reference:

ubuntu@ip-172-31-30-233:~$ sudo zpool create testpool xvdbd -f && (for option in atime=off checksum=off compression=lz4 exec=off logbias=throughput recordsize=8k sync=standard; do sudo zfs set $option testpool; done;) && sudo fio --name randwrite --ioengine=libaio --iodepth=4 --rw=randwrite --bs=8k --size=400G --numjobs=4 --runtime=60 --group_reporting --fallocate=none --filename=/testpool/testfile; sudo zpool destroy testpool
randwrite: (g=0): rw=randwrite, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=4
...
randwrite: (g=0): rw=randwrite, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=4
fio-2.1.3
Starting 4 processes
randwrite: Laying out IO file(s) (1 file(s) / 409600MB)
randwrite: Laying out IO file(s) (1 file(s) / 409600MB)
randwrite: Laying out IO file(s) (1 file(s) / 409600MB)
randwrite: Laying out IO file(s) (1 file(s) / 409600MB)
Jobs: 4 (f=4): [wwww] [100.0% done] [0KB/41328KB/0KB /s] [0/5166/0 iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=4): err= 0: pid=18923: Tue Feb 14 19:17:18 2017
  write: io=4191.7MB, bw=71536KB/s, iops=8941, runt= 60001msec
    slat (usec): min=10, max=1399.9K, avg=442.26, stdev=4482.85
    clat (usec): min=2, max=1400.4K, avg=1343.38, stdev=7805.37
     lat (usec): min=56, max=1400.4K, avg=1786.61, stdev=9044.27
    clat percentiles (usec):
     |  1.00th=[   62],  5.00th=[   75], 10.00th=[   87], 20.00th=[  108],
     | 30.00th=[  122], 40.00th=[  167], 50.00th=[  620], 60.00th=[ 1176],
     | 70.00th=[ 1496], 80.00th=[ 2320], 90.00th=[ 2992], 95.00th=[ 4128],
     | 99.00th=[ 6816], 99.50th=[ 9536], 99.90th=[30592], 99.95th=[66048],
     | 99.99th=[185344]
    bw (KB  /s): min= 2332, max=82848, per=25.46%, avg=18211.64, stdev=15010.61
    lat (usec) : 4=0.01%, 50=0.09%, 100=14.60%, 250=26.77%, 500=5.96%
    lat (usec) : 750=5.27%, 1000=4.24%
    lat (msec) : 2=20.96%, 4=16.74%, 10=4.93%, 20=0.30%, 50=0.08%
    lat (msec) : 100=0.04%, 250=0.03%, 500=0.01%, 2000=0.01%
  cpu          : usr=0.61%, sys=9.48%, ctx=177901, majf=0, minf=107
  IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=536527/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=4191.7MB, aggrb=71535KB/s, minb=71535KB/s, maxb=71535KB/s, mint=60001msec, maxt=60001msec

Notice, this performed better than XFS WRITE: io=4191.7MB. I’m pretty sure this is due to compression.

For kicks, I’m going to add a second volume:

ubuntu@ip-172-31-30-233:~$ sudo zpool create testpool xvdb{c,d} -f && (for option in atime=off checksum=off compression=lz4 exec=off logbias=throughput recordsize=8k sync=standard; do sudo zfs set $option testpool; done;) && sudo fio --name randwrite --ioengine=libaio --iodepth=4 --rw=randwrite --bs=8k --size=400G --numjobs=4 --runtime=60 --group_reporting --fallocate=none --filename=/testpool/testfile; sudo zpool destroy testpool
randwrite: (g=0): rw=randwrite, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=4
...
randwrite: (g=0): rw=randwrite, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=4
fio-2.1.3
Starting 4 processes
randwrite: Laying out IO file(s) (1 file(s) / 409600MB)
randwrite: Laying out IO file(s) (1 file(s) / 409600MB)
randwrite: Laying out IO file(s) (1 file(s) / 409600MB)
randwrite: Laying out IO file(s) (1 file(s) / 409600MB)
Jobs: 4 (f=4): [wwww] [100.0% done] [0KB/71936KB/0KB /s] [0/8992/0 iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=4): err= 0: pid=20901: Tue Feb 14 19:23:30 2017
  write: io=5975.9MB, bw=101983KB/s, iops=12747, runt= 60003msec
    slat (usec): min=10, max=1831.2K, avg=308.61, stdev=4419.95
    clat (usec): min=3, max=1831.6K, avg=942.64, stdev=7696.18
     lat (usec): min=58, max=1831.8K, avg=1252.25, stdev=8896.67
    clat percentiles (usec):
     |  1.00th=[   70],  5.00th=[   92], 10.00th=[  106], 20.00th=[  129],
     | 30.00th=[  386], 40.00th=[  490], 50.00th=[  692], 60.00th=[  796],
     | 70.00th=[  932], 80.00th=[ 1160], 90.00th=[ 1624], 95.00th=[ 2256],
     | 99.00th=[ 5344], 99.50th=[ 8512], 99.90th=[30592], 99.95th=[60672],
     | 99.99th=[117248]
    bw (KB  /s): min=   52, max=112576, per=25.61%, avg=26116.98, stdev=15313.32
    lat (usec) : 4=0.01%, 10=0.01%, 50=0.04%, 100=7.17%, 250=19.04%
    lat (usec) : 500=14.36%, 750=15.36%, 1000=17.41%
    lat (msec) : 2=20.28%, 4=4.82%, 10=1.13%, 20=0.25%, 50=0.08%
    lat (msec) : 100=0.04%, 250=0.02%, 2000=0.01%
  cpu          : usr=1.05%, sys=15.14%, ctx=396649, majf=0, minf=103
  IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=764909/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=5975.9MB, aggrb=101982KB/s, minb=101982KB/s, maxb=101982KB/s, mint=60003msec, maxt=60003msec

With a second volume, WRITE: io=5975.9MB, so ~1.8X the writes.

A third volume gives us WRITE: io=6667.5MB, so ~2.1X the writes.

And a fourth volume gives us WRITE: io=6552.9MB. For this instance type, it looks like I almost cap the EBS network with two volumes, definitely with three and it’s no better with 4 (750 * 3 = 2250 IOPS).

* From this video make sure you are using 3.8+ linux kernel to get all the EBS goodness.