In 2021, Fastmail began moving to a new storage architecture built on top of OpenZFS RAIDz2 pools on top of banks of NVMe drives. During shakedown I discovered a performance bottleneck, and soon after entering production, a machine panicked. Both these required some study to understand the causes, and as a result, both issues are now resolved upstream.
IO is bottlenecked through a single lock
See openzfs/zfs#12121 for the original issue on the issue tracker.
Describe the problem you're observing
tl;dr: IO is bottlenecked through the kstat lock in vdev_queue.c
.
I have a large pool of 24x 8T NVMe devices (Intel DC P4510), configured as a single pool of 2x12 RAIDZ2.
NAME STATE READ WRITE CKSUM
i44cyrus ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
i44disk1 ONLINE 0 0 0
...
i44disk11 ONLINE 0 0 0
i44disk12 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
i44disk13 ONLINE 0 0 0
...
i44disk24 ONLINE 0 0 0
Yesterday I started bringing in some 40T of data from a different
machine using a normal send | recv
over a direct 10G connection. It
appears that we're getting a sustained throughput of ~250M/s. We
haven't done a lot of tuning yet so I wasn't entirely sure if this was
"good" or not, but it felt low, so I went digging.
(note: there is nothing else of any consequence running on this machine; just basic OS services)
On the receiving side, perf top
reported that 25% of its time is spent
in osq_lock
, which is the kernel spinlock. So I fired up bpftrace
and got to work.
Bucketing by the lock pointer shows that there's a single lock receiving a lot more attention than the others:
[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:osq_lock { @[arg0] = count(); }'
Attaching 1 probe...
^C
@[-117676299645548]: 1
@[-117669704412716]: 1
@[-117485265829068]: 1
@[-117625234665004]: 1
@[-117458348531436]: 1
...
@[-117723405763756]: 3502
@[-117723442693292]: 3509
@[-117716388578476]: 3540
@[-117723405747372]: 3573
@[-117719621901892]: 11432
@[-117711093901308]: 144076
Inspecting this single lock:
[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:osq_lock /arg0 == -117711093901308/ { @[kstack(3)] = count() }'
stdin:1:23-25: WARNING: comparison of integers of different signs: 'unsigned int64' and 'int64' can lead to undefined behavior
kprobe:osq_lock /arg0 == -117711093901308/ { @[kstack(3)] = count() }
~~
Attaching 1 probe...
^C
@[
osq_lock+1
__mutex_lock.isra.9+137
vdev_queue_io_to_issue+2778
]: 93753
@[
osq_lock+1
__mutex_lock.isra.9+137
vdev_queue_io_done+235
]: 120459
@[
osq_lock+1
__mutex_lock.isra.9+137
vdev_queue_io+273
]: 205863
@[
osq_lock+1
__mutex_lock.isra.9+137
vdev_queue_io_remove+124
]: 208130
I didn't know very much about the internals at this point, and I was
assuming there was a single lock around all devices or something like
that. After studying vdev_queue_io
and the headers I saw that a
struct vdev
has vdev_queue
and vq_lock
inlined, so the vdev is a
reasonable proxy for the lock for tracing purposes, and so:
[fastmail root(robn)@imap44 ~]# cat zio.bt
#include <sys/zio.h>
kprobe:vdev_queue_io
{
@[((struct zio *)arg0)->io_vd] = count()
}
[fastmail root(robn)@imap44 ~]# bpftrace -I/usr/include/libspl -I/usr/include/libzfs zio.bt
Attaching 1 probe...
^C
@[0xffff94ee60054000]: 14228
@[0xffff94edce118000]: 14288
@[0xffff94f06863c000]: 14316
@[0xffff94f0034a8000]: 14342
@[0xffff94edfb11c000]: 14491
@[0xffff94edce9f4000]: 14560
@[0xffff94f004790000]: 14703
@[0xffff94ee62378000]: 14785
@[0xffff94ee6d3a8000]: 14799
@[0xffff94edce11c000]: 14855
@[0xffff94edfc550000]: 14867
@[0xffff94f0034ac000]: 14954
@[0xffff94edce358000]: 14996
@[0xffff94ee60040000]: 15017
@[0xffff94ee60050000]: 15065
@[0xffff94f0034a0000]: 15066
@[0xffff94f004794000]: 15110
@[0xffff94ee6d3ac000]: 15184
@[0xffff94edfb118000]: 15191
@[0xffff94edfc554000]: 15199
@[0xffff94f0034a4000]: 15232
@[0xffff94ee6237c000]: 15255
@[0xffff94edce9f0000]: 15255
@[0xffff94edce35c000]: 15273
24 separate vdevs, each receiving roughly the same amount of traffic, and none of them with an outlier. So vdev locks are not responsible!
To read it, vdev_queue_io
(and friends) only work on the vdev lock, so
I couldn't understand how the single busy lock was matching the traces
that I had. But then I saw that vdev_queue_io_add
is only called from
one place and is static
, so it has been inlined. But lo, there's
another lock, and a global lock at that.
We can't probe an inlined function, but we can probe the thing it calls:
[fastmail root(robn)@imap44 ~]# bpftrace -e 'kprobe:kstat_waitq_enter { @[kstack(3)] = count() }'
Attaching 1 probe...
^C
@[
kstat_waitq_enter+1
vdev_queue_io+304
zio_vdev_io_start+156
]: 270092
And there it is: all the vdev ops are waiting on a single lock, for a stats counter no less! Oh no!
I don't have a great theory but I suspect this perhaps isn't as noticeable on higher-latency devices (spinners) because they're slower than the lock wait, and perhaps isn't as noticeable on NVMe mirrors because probably most of the time there's only two drives in the mirror, so it doesn't get amplified quite so much.
Describe how to reproduce the problem
It should be enough to have a wide RAIDZ vdev on top of low-latency
devices. And then just pour data into with zfs recv
.
(I can reproduce this reliably, however I'm on tight deadlines so I don't immediately have much time for experimenting. But I would like to help, and will have a lot more hardware available to work with in the next week or two).
ARC Panic analysis
See openzfs/zfs#11531 (comment) for a summary of this on the issue tracker.
What is a ARC block header?
The crash involves the ARC, which is all block headers, so we need to know about them!
- the ARC is a read cache
- the blocks in it are variable size (512-128K) (from description at
top of
arc.c
) - the block header
arc_buf_hdr_t
has either al1arc_buf_hdr_t
for an in-memory block, orl2arc_buf_hdr_t
for an on-disk L2ARC block l1arc_buf_hdr_t
has a copy of the ondisk data in itsb_pabd
b_pabd
can be compressed or uncompressed (its a copy of the on-disk data)- there's also
arc_buf_t
which is handed out to consumers, which has its own decompressed buffer (unless the consumer wanted a compressed buffer egzfs send -c
) - writes are (naturally) a little complicated, but seem to mostly be
doing the work to make sure the above holds (esp wrt compressed data
on disk;
b_pabd
is recreated after the write to disk with the compressed form) - encrypted data is stored in the L1ARC with a layer of indirection.
the encrypted form is in
hdr->b_crypt_hdr.b_rabd
, the decrypted goes intob_pabd
, and if a caller wants the encrypted form, thearc_buf_t
gets its own copy of the encrypted form
Why did we crash?
[Wed Jun 9 18:42:10 2021] VERIFY3(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT) failed (36028797018963967 < 32768)
[Wed Jun 9 18:42:10 2021] PANIC at zio.c:314:zio_data_buf_alloc()
[Wed Jun 9 18:42:10 2021] zio_data_buf_alloc+0x52/0x60 [zfs]
[Wed Jun 9 18:42:10 2021] abd_alloc_linear+0x7f/0xc0 [zfs]
[Wed Jun 9 18:42:10 2021] arc_hdr_alloc_abd+0x56/0xb0 [zfs]
[Wed Jun 9 18:42:10 2021] arc_hdr_alloc+0xed/0x160 [zfs]
[Wed Jun 9 18:42:10 2021] arc_alloc_buf+0x45/0xd0 [zfs]
[Wed Jun 9 18:42:10 2021] dbuf_alloc_arcbuf_from_arcbuf+0xf7/0x190 [zfs]
[Wed Jun 9 18:42:10 2021] ? _cond_resched+0x15/0x30
[Wed Jun 9 18:42:10 2021] ? mutex_lock+0xe/0x30
[Wed Jun 9 18:42:10 2021] dbuf_hold_copy.isra.21+0x31/0xa0 [zfs]
[Wed Jun 9 18:42:10 2021] dbuf_hold_impl+0x43e/0x600 [zfs]
[Wed Jun 9 18:42:10 2021] dbuf_hold+0x2c/0x60 [zfs]
[Wed Jun 9 18:42:10 2021] dmu_buf_hold_array_by_dnode+0xda/0x4a0 [zfs]
[Wed Jun 9 18:42:10 2021] dmu_read_uio_dnode+0x47/0xf0 [zfs]
[Wed Jun 9 18:42:10 2021] ? zfs_rangelock_enter_impl+0x25b/0x560 [zfs]
[Wed Jun 9 18:42:10 2021] dmu_read_uio_dbuf+0x3f/0x60 [zfs]
[Wed Jun 9 18:42:10 2021] zfs_read+0x135/0x3b0 [zfs]
[Wed Jun 9 18:42:10 2021] zpl_iter_read+0xe3/0x180 [zfs]
[Wed Jun 9 18:42:10 2021] new_sync_read+0x118/0x1a0
[Wed Jun 9 18:42:10 2021] vfs_read+0xf1/0x180
[Wed Jun 9 18:42:10 2021] ksys_read+0x59/0xd0
[Wed Jun 9 18:42:10 2021] do_syscall_64+0x33/0x80
Call stack reconstruction (filling in as we trace code, see below):
zio_data_buf_alloc(size=0)
abd_alloc_linear(size=0, is_metadata=false)
abd_alloc(size=0, is_metadata=false) [tailcall]
arc_get_data_abd(hdr, size=0, [tag=hdr], do_adapttrue [inlined]
arc_hdr_alloc_abd(hdr, alloc_flags=DO_ADAPT)
arc_hdr_alloc(spa, psize=0, lsize=0, protected=false, compression_type=off, complevel=0, type=ARC_BUFC_DATA, alloc_rdata=false)
arc_alloc_buf(spa, tag, type=ARC_BUFC_DATA, size=0)
dbuf_alloc_arcbuf_from_arcbuf(db, arc_buf_t *data)
dbuf_alloc_arcbuf_from_arcbuf
has:
psize = arc_buf_size(data)
if (arc_is_encrypted(data)) {
...
} else if (arc_get_compression(data) != ZFS_COMPRESS_OFF) {
...
} else {
data = arc_alloc_buf(spa, db, type, psize)
}
The only direct call to arc_alloc_buf
is in the non-encrypted,
non-compressed case.
-
arc_alloc_buf
callsarc_hdr_alloc
wanting no trimmings, no allocations, PSIZE=LSIZE. Hmm. -
flags = DO_ADAPT, but definitely not ALLOC_RDATA. Ok...
-
Created header gets psize & lsize set same.
-
What's the hdr type? This seems important.
- We know that the later allocations are
is_metadata=false
, andarc_hdr_alloc
asserts that its eitherARC_BUFC_DATA
orARC_BUFC_METADATA
... - yeah, see
arc_hdr_alloc_abd
. it has to beARC_BUFC_DATA
- We know that the later allocations are
-
created
hdr
is an empty L1ARC, no encryption, no compression. Size currently unknown except by inference (the zero alloc).
Ok this is weird. Yesterday we swore up and down that PSIZE=0 & ALLOC_RDATA caused the crash, but it seems like ALLOC_RDATA can't possibly be set? I will go back over it.
Yeah, Pretty sure impossible, because this is an unencrypted block.
Aha, the R
in RDATA
is for raw, which is the name for
hdr->b_crypt_hdr.b_rabd
per the arc.c
comments.
Earlier versions (before DO_ADAPT
was added) actually called it
encrypted_read
, so that confirms it.
Ok, arc_hdr_alloc_abd
.
Whoa. ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
We assert that LSIZE > 0.
BUT, arc_alloc_buf
calls arc_hdr_alloc
with the same size
var for
psize
and lsize
. Therefore PSIZE must be > 0 ?!
ALL STOP
We've been assuming (ish) since the start that the assert was tripped because size is 0, but lets go back and look at it because that's not exactly what it said.
[Wed Jun 9 18:42:10 2021] VERIFY3(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT) failed (36028797018963967 < 32768)
Big number! Is there some chance that we're over/underflowing in all this?
void *
zio_data_buf_alloc(size_t size)
{
size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
VERIFY3U(c, <, SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
According to the assert, c
is:
-> 36028797018963967
~= 0x7fffffffffffff
#define SPA_MINBLOCKSHIFT 9
(uint32_t) 0 - 1 = 0xffff ffff ffff ffff
>>9 = 0x007f ffff ffff ffff
Fine, that conversion is ok.
#define SPA_MAXBLOCKSHIFT 24
#define SPA_MAXBLOCKSIZE (1ULL << SPA_MAXBLOCKSHIFT)
SPA_MAXBLOCKSIZE = 1 << 24 = 0x0100 0000
SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT = 0x0100 0000 >> 9 = 0x8000 = 32768
Ok, this all checks out. I feel like zero is a bit of special case that might warrant its own check, but this is ok.
Ahh no, we assert, we do not verify. Dammit. Ok, so a zero can sneak through.
IMPLY(alloc_rdata, HDR_PROTECTED(hdr));
again we see that
alloc_rdata
is all about encrypted blocks, which this is not.
And then we end up in the non-rdata path, and fall into the allocator, and crash.
Claim: You cannot ask the ARC to allocate a zero-sized block
This intuitively makes sense. What would it even mean for a zero-sized block to be cached?
Can a zero-sized block even exist?
Assuming no memory corruption, what could make a zero-sized block?
zfs/arc.c
1745: HDR_SET_LSIZE(hdr, size);
1746: HDR_SET_PSIZE(hdr, psize);
3306: HDR_SET_PSIZE(hdr, psize);
3307: HDR_SET_LSIZE(hdr, lsize);
6780: HDR_SET_PSIZE(hdr, psize);
arc_buf_alloc_l2only
: separate lsize/psize, lsize under ASSERTarc_hdr_alloc
: separate lsize/psizearc_write_ready
: ooh interesting
arc_write_ready
appears to be something that populates the ARC after a
write. It has handling for a zero-sized block (a "hole"). So to some
extent a zero-sized ARC block header can exist, though its not clear to
me if its sort of semi-detached thing wrapped in an arc_buf_hdr_t
,
or actually "in" the ARC.
I don't know enough yet to know if it makes any sense for the dbuf stuff to hold a reference to one of these things - I don't yet know what that subsystem is and I would like to delay finding out. Lets try another approach.
Who tried to allocate a zero-sized block anyway?
For us, it was dbuf_alloc_arcbuf_from_arcbuf
. Which takes an existing
arc_buf_t
, and gets a new one of the same kind... but not with the
same data? Just an empty buffer? What does that even mean?
dbuf_alloc_arcbuf_from_arcbuf
was introduced in 13fac098
, to
consolidate three very similar bits of arcbuf-copying blocks. In both
cases, they seek to clone the arcbuf inside the dbuf db->db_buf
.
In the first and third cases (dbuf_fix_old_data
and dbuf_sync_leaf
),
the new (non-encrypted, non-compressed) arcbuf size calc appears to be
the same (from arc_buf_size(...)
)
In the second case in dbuf_hold_copy
, the old version used
db->db.db_size
for the new arcbuf, not the arc_buf_size
of the old
one. Is this significant?
Possibly! It's not easy to tell, but whatever this db_size
business
is, there's a lot of things checking to see that its non-zero
(sometimes explicitly, sometimes some multiple of blocksizes).
It seems clear from the surrounding code (including the calling
dbuf_hold_copy
) that expects to be copying data.