In 2021, Fastmail began moving to a new storage architecture built on top of OpenZFS RAIDz2 pools on top of banks of NVMe drives. During shakedown I discovered a performance bottleneck, and soon after entering production, a machine panicked. Both these required some study to understand the causes, and as a result, both issues are now resolved upstream.

IO is bottlenecked through a single lock

See openzfs/zfs#12121 for the original issue on the issue tracker.

Describe the problem you're observing

tl;dr: IO is bottlenecked through the kstat lock in vdev_queue.c.

I have a large pool of 24x 8T NVMe devices (Intel DC P4510), configured as a single pool of 2x12 RAIDZ2.

 NAME           STATE     READ WRITE CKSUM
    i44cyrus       ONLINE       0     0     0
      raidz2-0     ONLINE       0     0     0
        i44disk1   ONLINE       0     0     0
...
        i44disk11  ONLINE       0     0     0
        i44disk12  ONLINE       0     0     0
      raidz2-1     ONLINE       0     0     0
        i44disk13  ONLINE       0     0     0
...
        i44disk24  ONLINE       0     0     0

Yesterday I started bringing in some 40T of data from a different machine using a normal send | recv over a direct 10G connection. It appears that we're getting a sustained throughput of ~250M/s. We haven't done a lot of tuning yet so I wasn't entirely sure if this was "good" or not, but it felt low, so I went digging.

(note: there is nothing else of any consequence running on this machine; just basic OS services)

On the receiving side, perf top reported that 25% of its time is spent in osq_lock, which is the kernel spinlock. So I fired up bpftrace and got to work.

Bucketing by the lock pointer shows that there's a single lock receiving a lot more attention than the others:

[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:osq_lock { @[arg0] = count(); }'
Attaching 1 probe...
^C
@[-117676299645548]: 1
@[-117669704412716]: 1
@[-117485265829068]: 1
@[-117625234665004]: 1
@[-117458348531436]: 1
...
@[-117723405763756]: 3502
@[-117723442693292]: 3509
@[-117716388578476]: 3540
@[-117723405747372]: 3573
@[-117719621901892]: 11432
@[-117711093901308]: 144076

Inspecting this single lock:

[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:osq_lock /arg0 == -117711093901308/ { @[kstack(3)] = count() }'
stdin:1:23-25: WARNING: comparison of integers of different signs: 'unsigned int64' and 'int64' can lead to undefined behavior
kprobe:osq_lock /arg0 == -117711093901308/ { @[kstack(3)] = count() }
                      ~~
Attaching 1 probe...
^C

@[
    osq_lock+1
    __mutex_lock.isra.9+137
    vdev_queue_io_to_issue+2778
]: 93753
@[
    osq_lock+1
    __mutex_lock.isra.9+137
    vdev_queue_io_done+235
]: 120459
@[
    osq_lock+1
    __mutex_lock.isra.9+137
    vdev_queue_io+273
]: 205863
@[
    osq_lock+1
    __mutex_lock.isra.9+137
    vdev_queue_io_remove+124
]: 208130

I didn't know very much about the internals at this point, and I was assuming there was a single lock around all devices or something like that. After studying vdev_queue_io and the headers I saw that a struct vdev has vdev_queue and vq_lock inlined, so the vdev is a reasonable proxy for the lock for tracing purposes, and so:

[fastmail root(robn)@imap44 ~]# cat zio.bt
#include <sys/zio.h>

kprobe:vdev_queue_io
{
  @[((struct zio *)arg0)->io_vd] = count()
}
[fastmail root(robn)@imap44 ~]# bpftrace -I/usr/include/libspl -I/usr/include/libzfs zio.bt
Attaching 1 probe...
^C

@[0xffff94ee60054000]: 14228
@[0xffff94edce118000]: 14288
@[0xffff94f06863c000]: 14316
@[0xffff94f0034a8000]: 14342
@[0xffff94edfb11c000]: 14491
@[0xffff94edce9f4000]: 14560
@[0xffff94f004790000]: 14703
@[0xffff94ee62378000]: 14785
@[0xffff94ee6d3a8000]: 14799
@[0xffff94edce11c000]: 14855
@[0xffff94edfc550000]: 14867
@[0xffff94f0034ac000]: 14954
@[0xffff94edce358000]: 14996
@[0xffff94ee60040000]: 15017
@[0xffff94ee60050000]: 15065
@[0xffff94f0034a0000]: 15066
@[0xffff94f004794000]: 15110
@[0xffff94ee6d3ac000]: 15184
@[0xffff94edfb118000]: 15191
@[0xffff94edfc554000]: 15199
@[0xffff94f0034a4000]: 15232
@[0xffff94ee6237c000]: 15255
@[0xffff94edce9f0000]: 15255
@[0xffff94edce35c000]: 15273

24 separate vdevs, each receiving roughly the same amount of traffic, and none of them with an outlier. So vdev locks are not responsible!

To read it, vdev_queue_io (and friends) only work on the vdev lock, so I couldn't understand how the single busy lock was matching the traces that I had. But then I saw that vdev_queue_io_add is only called from one place and is static, so it has been inlined. But lo, there's another lock, and a global lock at that.

We can't probe an inlined function, but we can probe the thing it calls:

[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:kstat_waitq_enter { @[kstack(3)] = count() }'
Attaching 1 probe...
^C

@[
    kstat_waitq_enter+1
    vdev_queue_io+304
    zio_vdev_io_start+156
]: 270092

And there it is: all the vdev ops are waiting on a single lock, for a stats counter no less! Oh no!

I don't have a great theory but I suspect this perhaps isn't as noticeable on higher-latency devices (spinners) because they're slower than the lock wait, and perhaps isn't as noticeable on NVMe mirrors because probably most of the time there's only two drives in the mirror, so it doesn't get amplified quite so much.

Describe how to reproduce the problem

It should be enough to have a wide RAIDZ vdev on top of low-latency devices. And then just pour data into with zfs recv.

(I can reproduce this reliably, however I'm on tight deadlines so I don't immediately have much time for experimenting. But I would like to help, and will have a lot more hardware available to work with in the next week or two).


ARC Panic analysis

See openzfs/zfs#11531 (comment) for a summary of this on the issue tracker.

What is a ARC block header?

The crash involves the ARC, which is all block headers, so we need to know about them!

Why did we crash?

[Wed Jun  9 18:42:10 2021] VERIFY3(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT) failed (36028797018963967 < 32768)
[Wed Jun  9 18:42:10 2021] PANIC at zio.c:314:zio_data_buf_alloc()
[Wed Jun  9 18:42:10 2021]  zio_data_buf_alloc+0x52/0x60 [zfs]
[Wed Jun  9 18:42:10 2021]  abd_alloc_linear+0x7f/0xc0 [zfs]
[Wed Jun  9 18:42:10 2021]  arc_hdr_alloc_abd+0x56/0xb0 [zfs]
[Wed Jun  9 18:42:10 2021]  arc_hdr_alloc+0xed/0x160 [zfs]
[Wed Jun  9 18:42:10 2021]  arc_alloc_buf+0x45/0xd0 [zfs]
[Wed Jun  9 18:42:10 2021]  dbuf_alloc_arcbuf_from_arcbuf+0xf7/0x190 [zfs]
[Wed Jun  9 18:42:10 2021]  ? _cond_resched+0x15/0x30
[Wed Jun  9 18:42:10 2021]  ? mutex_lock+0xe/0x30
[Wed Jun  9 18:42:10 2021]  dbuf_hold_copy.isra.21+0x31/0xa0 [zfs]
[Wed Jun  9 18:42:10 2021]  dbuf_hold_impl+0x43e/0x600 [zfs]
[Wed Jun  9 18:42:10 2021]  dbuf_hold+0x2c/0x60 [zfs]
[Wed Jun  9 18:42:10 2021]  dmu_buf_hold_array_by_dnode+0xda/0x4a0 [zfs]
[Wed Jun  9 18:42:10 2021]  dmu_read_uio_dnode+0x47/0xf0 [zfs]
[Wed Jun  9 18:42:10 2021]  ? zfs_rangelock_enter_impl+0x25b/0x560 [zfs]
[Wed Jun  9 18:42:10 2021]  dmu_read_uio_dbuf+0x3f/0x60 [zfs]
[Wed Jun  9 18:42:10 2021]  zfs_read+0x135/0x3b0 [zfs]
[Wed Jun  9 18:42:10 2021]  zpl_iter_read+0xe3/0x180 [zfs]
[Wed Jun  9 18:42:10 2021]  new_sync_read+0x118/0x1a0
[Wed Jun  9 18:42:10 2021]  vfs_read+0xf1/0x180
[Wed Jun  9 18:42:10 2021]  ksys_read+0x59/0xd0
[Wed Jun  9 18:42:10 2021]  do_syscall_64+0x33/0x80

Call stack reconstruction (filling in as we trace code, see below):

zio_data_buf_alloc(size=0)
abd_alloc_linear(size=0, is_metadata=false)
abd_alloc(size=0, is_metadata=false) [tailcall]
arc_get_data_abd(hdr, size=0, [tag=hdr], do_adapttrue [inlined]
arc_hdr_alloc_abd(hdr, alloc_flags=DO_ADAPT)
arc_hdr_alloc(spa, psize=0, lsize=0, protected=false, compression_type=off, complevel=0, type=ARC_BUFC_DATA, alloc_rdata=false)
arc_alloc_buf(spa, tag, type=ARC_BUFC_DATA, size=0)
dbuf_alloc_arcbuf_from_arcbuf(db, arc_buf_t *data)

dbuf_alloc_arcbuf_from_arcbuf has:

psize = arc_buf_size(data)
if (arc_is_encrypted(data)) {
    ...
} else if (arc_get_compression(data) != ZFS_COMPRESS_OFF) {
    ...
} else {
    data = arc_alloc_buf(spa, db, type, psize)
}

The only direct call to arc_alloc_buf is in the non-encrypted, non-compressed case.

Ok this is weird. Yesterday we swore up and down that PSIZE=0 & ALLOC_RDATA caused the crash, but it seems like ALLOC_RDATA can't possibly be set? I will go back over it.

Yeah, Pretty sure impossible, because this is an unencrypted block.

Aha, the R in RDATA is for raw, which is the name for hdr->b_crypt_hdr.b_rabd per the arc.c comments.

Earlier versions (before DO_ADAPT was added) actually called it encrypted_read, so that confirms it.

Ok, arc_hdr_alloc_abd.

Whoa. ASSERT3U(HDR_GET_LSIZE(hdr), >, 0); We assert that LSIZE > 0. BUT, arc_alloc_buf calls arc_hdr_alloc with the same size var for psize and lsize. Therefore PSIZE must be > 0 ?!


ALL STOP

We've been assuming (ish) since the start that the assert was tripped because size is 0, but lets go back and look at it because that's not exactly what it said.

[Wed Jun  9 18:42:10 2021] VERIFY3(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT) failed (36028797018963967 < 32768)

Big number! Is there some chance that we're over/underflowing in all this?

void *
zio_data_buf_alloc(size_t size)
{
        size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;

        VERIFY3U(c, <, SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);

According to the assert, c is:

-> 36028797018963967
~= 0x7fffffffffffff

#define  SPA_MINBLOCKSHIFT   9

(uint32_t) 0 - 1 = 0xffff ffff ffff ffff
             >>9 = 0x007f ffff ffff ffff

Fine, that conversion is ok.

#define  SPA_MAXBLOCKSHIFT   24
#define SPA_MAXBLOCKSIZE    (1ULL << SPA_MAXBLOCKSHIFT)

SPA_MAXBLOCKSIZE = 1 << 24 = 0x0100 0000

SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT = 0x0100 0000 >> 9 = 0x8000 = 32768

Ok, this all checks out. I feel like zero is a bit of special case that might warrant its own check, but this is ok.


Ahh no, we assert, we do not verify. Dammit. Ok, so a zero can sneak through.

IMPLY(alloc_rdata, HDR_PROTECTED(hdr)); again we see that alloc_rdata is all about encrypted blocks, which this is not.

And then we end up in the non-rdata path, and fall into the allocator, and crash.

Claim: You cannot ask the ARC to allocate a zero-sized block

This intuitively makes sense. What would it even mean for a zero-sized block to be cached?

Can a zero-sized block even exist?

Assuming no memory corruption, what could make a zero-sized block?

zfs/arc.c
1745:   HDR_SET_LSIZE(hdr, size);
1746:   HDR_SET_PSIZE(hdr, psize);
3306:   HDR_SET_PSIZE(hdr, psize);
3307:   HDR_SET_LSIZE(hdr, lsize);
6780:   HDR_SET_PSIZE(hdr, psize);

arc_write_ready appears to be something that populates the ARC after a write. It has handling for a zero-sized block (a "hole"). So to some extent a zero-sized ARC block header can exist, though its not clear to me if its sort of semi-detached thing wrapped in an arc_buf_hdr_t, or actually "in" the ARC.

I don't know enough yet to know if it makes any sense for the dbuf stuff to hold a reference to one of these things - I don't yet know what that subsystem is and I would like to delay finding out. Lets try another approach.

Who tried to allocate a zero-sized block anyway?

For us, it was dbuf_alloc_arcbuf_from_arcbuf. Which takes an existing arc_buf_t, and gets a new one of the same kind... but not with the same data? Just an empty buffer? What does that even mean?

dbuf_alloc_arcbuf_from_arcbuf was introduced in 13fac098, to consolidate three very similar bits of arcbuf-copying blocks. In both cases, they seek to clone the arcbuf inside the dbuf db->db_buf.

In the first and third cases (dbuf_fix_old_data and dbuf_sync_leaf), the new (non-encrypted, non-compressed) arcbuf size calc appears to be the same (from arc_buf_size(...))

In the second case in dbuf_hold_copy, the old version used db->db.db_size for the new arcbuf, not the arc_buf_size of the old one. Is this significant?

Possibly! It's not easy to tell, but whatever this db_size business is, there's a lot of things checking to see that its non-zero (sometimes explicitly, sometimes some multiple of blocksizes).

It seems clear from the surrounding code (including the calling dbuf_hold_copy) that expects to be copying data.

Best guess: 13fac098 introduces a bug, and should be creating a new arcbuf with the size of the dbuf, not the size of the old arcbuf (which might be zero).