In 2024, a Klara customer was having a problem where their pools would sometimes suspend while receiving a replication stream. No hardware errors were ever seen, clearing the pool error state would get them going again, and resending the stream would usually work correctly. There was no obvious reason.
We gave the customer a trace script that would report the contents of the IO object that triggered the suspension. Armed with its output and the source code, I got to work following the breadcrumbs to understand what went wrong.
Typical trace output:
2024 Oct 6 08:37:48 zio_done, kernel (0):
zio addr: 0xfffff81e6dfb59a0
io_error: 28
io_stage: 0x1000000
io_pipeline: 0x1080000
io_pipeline_trace: 0x10980f5
io_flags: 0x1804000
io_type: 2
io_child_type: 3
io_child_error[0]: 0
io_child_error[1]: 0
io_child_error[2]: 0
io_child_error[3]: 0
io_size: 0
io_lsize: 16384
io_txg: 40374717
io_priority: 3
io_offset: 0x0
io_state[0]: 1
io_state[1]: 0
io_state[2]: 0
io_state[3]: 189
io_orig_size: 0
io_orig_flags: 0x1800000
io_orig_stage: 0x1
io_orig_pipeline: 0x17980f4
zb_objset: 371010
zb_object: 0
zb_level: 0
zb_blkid: 18075
abd_size: 16384
abd_flags: 0x201
io_executor: kernel
bp prop: 0xa00a0702001f001f
Annotated:
2024 Oct 6 08:37:48 zio_done, kernel (0):
zio addr: 0xfffff81e6dfb59a0
io_error: 28 ENOSPC
io_stage: 0x1000000 DONE
io_pipeline: 0x1080000 READY DONE
io_pipeline_trace: 0x10980f5 OPEN WRITE_BP_INIT ISSUE_ASYNC WRITE_COMPRESS
ENCRYPT CHECKSUM_GENERATE DVA_THROTTLE
DVA_ALLOCATE READY DONE
io_flags: 0x1804000 IO_ALLOCATING RAW_COMPRESS RAW_ENCRYPT
io_type: 2 WRITE
io_child_type: 3 LOGICAL
io_child_error[0]: 0
io_child_error[1]: 0
io_child_error[2]: 0
io_child_error[3]: 0
io_size: 0 0
io_lsize: 16384 16384
io_txg: 40374717
io_priority: 3 ASYNC_WRITE
io_offset: 0x0
io_state[0]: 1
io_state[1]: 0
io_state[2]: 0
io_state[3]: 189
io_orig_size: 0 0
io_orig_flags: 0x1800000 RAW_COMPRESS RAW_ENCRYPT
io_orig_stage: 0x1 OPEN
io_orig_pipeline: 0x17980f4 WRITE_BP_INIT ISSUE_ASYNC WRITE_COMPRESS
ENCRYPT CHECKSUM_GENERATE DVA_THROTTLE
DVA_ALLOCATE READY VDEV_IO_START VDEV_IO_DONE
VDEV_IO_ASSESS DONE
zb_objset: 371010
zb_object: 0
zb_level: 0
zb_blkid: 18075
abd_size: 16384 16K
abd_flags: 0x201 LINEAR ALLOCD
io_executor: kernel
bp prop: 0xa00a0702001f001f
Interesting facts:
- This is coming from an incoming stream
- size 0
- raw compress & raw encrypt
- FreeBSD 13 (OpenZFS 2.1)
- Pool is
data1
, 7x 12-wideraidz3
, slog 2-widemirror
, singlecache
Breadcrumbs
Pool suspension
In zio_done()
, above state sets ZIO_REEXECUTE_SUSPEND
.
if (zio->io_error && zio == zio->io_logical) {
if (IO_IS_ALLOCATING(zio) &&
!(zio->io_flags & ZIO_FLAG_CANFAIL)) {
if (zio->io_error != ENOSPC)
zio->io_reexecute |= ZIO_REEXECUTE_NOW;
else
zio->io_reexecute |= ZIO_REEXECUTE_SUSPEND;
}
Crucially, error is ENOSPC
. What sets this?
We didn’t get to VDEV_IO_START
in the original pipeline, so the last pre-DONE
stage must have changed pipeline to INTERLOCK
. zio_ready()
:
if (zio->io_error != 0) {
zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
So the error is before that. Presumable the immediate next, DVA_ALLOCATE
. zio_dva_allocate()
sets io_error
to any of its accumulated errors. io_size
is zero, so it can’t have come from ganging (min 512b), so it must have come from one of two metaslab_alloc()
calls. These differ only in their choice of class, which for now I think is irrelevant.
error = metaslab_alloc(spa, mc, zio->io_size, bp,
zio->io_prop.zp_copies, zio->io_txg, NULL, flags,
&zio->io_alloc_list, zio, zio->io_allocator);
Allocations are done in a loop up to number of wanted of DVAs. First sign of trouble, what we’ve done is unwound and error is returned.
error = metaslab_alloc_dva(spa, mc, psize, dva, d, hintdva,
txg, flags, zal, allocator);
Aborts early with ENOSPC
for 3% of calls if metaslab_force_ganging
is lower than psize
. psize
is 0 here, so we move on.
A metaslab group is selected. We check to see if its allocatable. It doesn’t care about psize
almost at all. For our purposes, I’m assuming that whatever it chose can be allocated from, that is, exists, not full, etc.
We get an allocation size appropriate for the vdev:
uint64_t asize = vdev_psize_to_asize(vd, psize);
ASSERT(P2PHASE(asize, 1ULL << vd->vdev_ashift) == 0);
That bounces through to the vdev’s vdev_op_asize()
. On this pool that will be a vdev_raidz_asize()
.
static uint64_t
vdev_raidz_asize(vdev_t *vd, uint64_t psize)
{
vdev_raidz_t *vdrz = vd->vdev_tsd;
uint64_t asize;
uint64_t ashift = vd->vdev_top->vdev_ashift;
uint64_t cols = vdrz->vd_logical_width;
uint64_t nparity = vdrz->vd_nparity;
asize = ((psize - 1) >> ashift) + 1;
asize += nparity * ((asize + cols - nparity - 1) / (cols - nparity));
asize = roundup(asize, nparity + 1) << ashift;
return (asize);
}
And that’s kind of DOA. With psize == 0
, it immediately overflows, and there’s no way back. If I got the math right, on these vdevs that returns an asize of 4468.36 PiB. (Yes: 4.5 EXABYTES).
There’s no point reading on. As far as metaslab_alloc()
is concerned, ENOSPC
is the right answer here.
Perhaps if we’d caught the zero earlier, we could have returned something else (EINVAL
)? That would have got us some dbgmsg output from zio_dva_allocate()
, but we would still suspend in zio_done()
, because REEXECUTE_SUSPEND
is the default for any unknown write error.
TODO
- Assert
psize > 0
and/or returnEINVAL
in:vdev_raidz_asize()
metaslab_alloc()
zio_dva_allocate()
(curious, already assertsio_size == BP_GET_PSIZE(bp)
)
Zero psize
io_size
is 0. This is not illegal in all cases, but certainly is unusual. There aren’t many places that set io_size
directly, and even fewer that set io_orig_size
directly.
The only place outside of the IO pipeline that does this is in the L2ARC read paths (l2arc_read_done()
, to move a finished L2 read buffer into L1, and l2arc_untransform()
, which is called by l2arc_read_done()
to handle decrypt and decompress on an ARC buffer . This pool does have a cache device, so we can’t totally ignore this, but because its all read paths it seems highly unlikely that it is involved.
zio_create()
is the most obvious place. It takes separate lsize
and psize
parameters, and is the only place in the system that creates ZIOs for execution.
It has this interesting and relevant assertion:
IMPLY(lsize != psize, (flags & ZIO_FLAG_RAW_COMPRESS) != 0);
Granted, this covers our case, but it also shows that its a rare case.
Now, zio_create()
cannot be called externally, so we need to look at what can call it and get different lsize & psize through. Unsurprisingly, that’s only zio_write()
. It does no extra relevant sanitising, which is good for our looking around.
Where are the relevant calls to zio_write()
? We’re looking for any that can plausibly have different lsize and psize.
dbuf_sync_lightweight()
is the only obvious one: lsize comes from dn->dn_datablksz
, psize from abd_get_size(dr->dt.dll.dr_abd)
. But we need psize
of 0, which I don’t think is possible, because dr_abd
(from the dirty record ie, what we’re trying to write) is 16K (at least, I am assuming it is, from abd_size
in the trace, though that comes from our own dtrace work).
noinline static void
dbuf_sync_lightweight(dbuf_dirty_record_t *dr, dmu_tx_t *tx)
{
dnode_t *dn = dr->dr_dnode;
...
dr->dr_zio = zio_write(pio, dmu_objset_spa(dn->dn_objset),
dmu_tx_get_txg(tx), &dr->dr_bp_copy, dr->dt.dll.dr_abd,
dn->dn_datablksz, abd_get_size(dr->dt.dll.dr_abd),
&dr->dt.dll.dr_props, dbuf_lightweight_ready, NULL,
dbuf_lightweight_physdone, dbuf_lightweight_done, dr,
ZIO_PRIORITY_ASYNC_WRITE,
ZIO_FLAG_MUSTSUCCEED | dr->dt.dll.dr_flags, &zb);
I am however a little uncertain about the size args here. The arg order to zio_write()
are data, lsize, psize
. dn_datablksz
is usually set through dnode_setdblksz()
, which asserts it must be at at least SPA_MINBLOCKSIZE
. That’s fine, but there is a path through dnode_create()
where it can be zero, and I’m not sure if its possible for that to remain that way.
This is interesting to me from two angles:
- I don’t understand why lsize take the the final on-disk size, while psize is given the size of the data proper. Is that not the wrong way around, like,
dr_abd
has the pre-compression payload, ie, it’s the logical size? - In any case, from our trace,
psize
clearly starts larger. Maybelsize
will be used differently once compression transforms are added?
I’ll look into those more later.
The only other place where lsize and psize may not match is in arc_write()
:
zio = zio_write(pio, spa, txg, bp,
abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),
HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,
(children_ready != NULL) ? arc_write_children_ready : NULL,
arc_write_physdone, arc_write_done, callback,
priority, zio_flags, zb);
arc_write()
doesn’t take separate sizes, but rather, a single arc_buf_t *buf
, and works out what to do from there. So the lsize here is HDR_GET_LSIZE
, fine. The psize, however, is either HDR_GET_PSIZE
or HDR_GET_LSIZE
, using PSIZE if the buf is compressed.
So I suppose the question for that one is, can the data compress down to zero?
I’ll consider both parts separately. Really, we need to figure out which ones might get used from a stream receive, which hopefully narrows the path somewhat.
dbuf_sync_lightweight()
?
What even is this codepath though? Is this a noun (“lightweight dbuf”) or a verb? (“lightweight sync”).
Right, so when syncing a dirty record, dbuf_sync_lightweight()
is used when the dirty record has no attached dr_dbuf
. Commentary for dmu_lightweight_write_by_dnode()
is englightening:
/*
* A "lightweight" write is faster than a regular write (e.g.
* dmu_write_by_dnode() or dmu_assign_arcbuf_by_dnode()), because it avoids the
* CPU cost of creating a dmu_buf_impl_t and arc_buf_[hdr_]_t. However, the
* data can not be read or overwritten until the transaction's txg has been
* synced. This makes it appropriate for workloads that are known to be
* (temporarily) write-only, like "zfs receive".
*
* A single block is written, starting at the specified offset in bytes. If
* the call is successful, it returns 0 and the provided abd has been
* consumed (the caller should not free it).
*/
int
dmu_lightweight_write_by_dnode(dnode_t *dn, uint64_t offset, abd_t *abd,
const zio_prop_t *zp, enum zio_flag flags, dmu_tx_t *tx)
{
dbuf_dirty_record_t *dr =
dbuf_dirty_lightweight(dn, dbuf_whichblock(dn, 0, offset), tx);
if (dr == NULL)
return (SET_ERROR(EIO));
dr->dt.dll.dr_abd = abd;
dr->dt.dll.dr_props = *zp;
dr->dt.dll.dr_flags = flags;
return (0);
}
Alright, this looks pretty plausibly like our entry point! And that would mean the passed-in abd
would need to have abd_get_size(…) == 0
.
Incoming stream
The only caller to dmu_lightweight_write_by_dnode()
is in flush_write_batch_impl()
in dmu_recv.c
. Alright, getting closer!
The relevant parts appear to be:
static int
flush_write_batch_impl(struct receive_writer_arg *rwa)
{
dnode_t *dn;
int err;
if (dnode_hold(rwa->os, rwa->last_object, FTAG, &dn) != 0)
return (SET_ERROR(EINVAL));
...
struct receive_record_arg *rrd;
while ((rrd = list_head(&rwa->write_batch)) != NULL) {
struct drr_write *drrw = &rrd->header.drr_u.drr_write;
abd_t *abd = rrd->abd;
ASSERT3U(drrw->drr_object, ==, rwa->last_object);
if (drrw->drr_logical_size != dn->dn_datablksz) {
/*
* The WRITE record is larger than the object's block
* size. We must be receiving an incremental
* large-block stream into a dataset that previously did
* a non-large-block receive. Lightweight writes must
* be exactly one block, so we need to decompress the
* data (if compressed) and do a normal dmu_write().
*/
...
} else {
...
/*
* Since this data can't be read until the receive
* completes, we can do a "lightweight" write for
* improved performance.
*/
err = dmu_lightweight_write_by_dnode(dn,
drrw->drr_offset, abd, &zp, zio_flags, tx);
}
(incidentally, this explains why the its ok to use the ABD size as the psize
in dbuf_sync_lightweight()
->zio_write()
: this is the raw block direct off the stream, and is already compressed etc. It is the physical data).
To get to dmu_lightweight_write_by_dnode()
:
- The dnode must already exist (or
dnode_hold()
would fail). drr_logical_size
==dn_datablksz
, otherwise it would enter the “split the block” path.
However, drr_logical_size
is never passed to dmu_lightweight_write_by_dnode()
, instead, dbuf_sync_lightweight()
uses dn_datablksz
as the logical size. Now they must be equal, or we wouldn’t end up in this path.
But! Recall earlier I noticed a path through dnode_create()
where it dn_datablksz
can be zero? What if this is a freshly-created dnode that took that path, and dn_datablksz
is zero, and drr_logical_size
is zero here too. it would get to dmu_lightweight_write_by_dnode()
intact, and a dirty record would be entered. Sometime later, dn_datablksz
is expanded to its proper 16K. Then the txg sync goes, and we issue zio_write()
with lsize == 16K
(from dn_datablksz
) and psize == 0
(from the ABD), and all hell breaks loose.
I’ll come back to that. First I want to find out where drr_logical_size
and the ABD come from. Pretty much straight off the stream, but lets confirm.
receive_read_record()
is where records are unpacked from the incoming stream. For a write:
case DRR_WRITE:
{
struct drr_write *drrw = &drc->drc_rrd->header.drr_u.drr_write;
int size = DRR_WRITE_PAYLOAD_SIZE(drrw);
abd_t *abd = abd_alloc_linear(size, B_FALSE);
DRR_WRITE_PAYLOAD_SIZE
is:
#define DRR_WRITE_PAYLOAD_SIZE(drrw) \
(DRR_WRITE_COMPRESSED(drrw) ? (drrw)->drr_compressed_size : \
(drrw)->drr_logical_size)
So that tracks. If the payload size is zero, then the ABD will also be zero.
Hmm, but what does abd_alloc_linear()
do when zero is requested? It does no min size checks, and for non-metadata allocations (second arg false), ends up in zio_data_buf_alloc(0)
.
void *
zio_data_buf_alloc(size_t size)
{
size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
VERIFY3U(c, <, SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
return (kmem_cache_alloc(zio_data_buf_cache[c], KM_PUSHPAGE));
}
On the surface, that looks like an ordinary overflow. (size_t)-1
is Is that an overflow? (size_t)-1
is often a uint64_t
, so lets say it wraps to UINT64_MAX
. Shift down 9 gets us to ~32PiB. Obviously, that is higher than SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT
at a mere 32K, which means we should have tripped the VERIFY
.
Which means either we haven’t tried to allocate zero here, or the compiler has used some kind of unsigned math for 0-1
, since it knows size_t
is unsigned.
(Short pause while a FreeBSD 13 VM is spun up and a disassembler pointed at zfs.ko
).
No, not that. Ultimately that -1
is just dec
, which won’t be weird. So I feel fairly confident that had we called zio_data_buf_alloc
with zero size, it would have panicked.
So, we can’t have called abd_alloc_linear()
with zero size. Which makes things interesting indeed.
TODO
dmu_lightweight_write_by_dnode()
: returnEINVAL
ifabd_get_size() == 0
?- Explicit zero-size assertions:
zio_data_buf_alloc()
abd_alloc_linear()
Dnode origin stories
Ok, so lets say the ABD is real, and the initial psize
is not zero. What if we were passing down a zero lsize
instead? Yes, they’re apparently the wrong way around, but I know that IO transforms can mess with them in flight, and our trace is once the IO is actually done, not what got passed in.
We know drr_logical_size != dn_datablksz
when the record arrives. We know dn_datablksz
is used as the lsize in the call to zio_write()
. And we know there’s a path where it looks like it can be zero after create, and we know flush_write_batch_impl()
expects the dnode to already be created. Is it possible that it exists, but whatever inflates its size hasn’t happened yet? ie, a race in syncing context or something, nothing directly to do with the incoming stream?
static dnode_t *
dnode_create(objset_t *os, dnode_phys_t *dnp, dmu_buf_impl_t *db,
uint64_t object, dnode_handle_t *dnh)
{
dnode_t *dn;
dn = kmem_cache_alloc(dnode_cache, KM_SLEEP);
...
if (dnp->dn_datablkszsec) {
dnode_setdblksz(dn, dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT);
} else {
dn->dn_datablksz = 0;
dn->dn_datablkszsec = 0;
dn->dn_datablkshift = 0;
}
So, if dnp->dn_datablkszsec
is set, then the dnode gets zeroed. Where does dnp
come from?
[and then I went to dinner]
Vacation time 🏝️
Late on Thursday, here’s the what I know:
- the failure occurs because by the time we get to
zio_dva_allocate
,io_size
is zero, and attempting a zero-byte allocation on a raidz vdev causes an underflow that is interpreted as a request for a ~4.5EiB allocation, which fails withENOSPC
. - The trace reports
io_size = 0
andio_lsize = 16K
. It’s possible for these to change during the pipeline (transforms). However,io_orig_size = 0
, which can only mean thatpsize = 0
was passed tozio_create()
, and sozio_write()
(no other zio creation methods can supply separatepsize
andlsize
). - Further
io_lsize
is never modified again after being set (except in a rare ZIL-related codepath,zio_shrink()
). So we can be certain that these are the original values passed in tozio_create()
. - By assertion in
zio_create()
,lsize
andpsize
must equal unlessZIO_FLAG_RAW_COMPRESS
is set, which means eitherZIO_FLAG_RAW_COMPRESS
orZIO_FLAG_RAW
in the incoming flags. - Only two call sites can possibly call
zio_write()
with differentlsize
andpsize
:dbuf_sync_lightweight()
andarc_write()
. “lightweight sync” is expressly designed for a kind of pass-through write and is only used by the stream reciever. dbuf_sync_lightweight()
writes out from dirty records, setting lsize todn_datablksz
and psize toabd_get_size(abd)
.abd
is held on the dirty record.- The
abd
and ZIO flags for the dirty record are filled bydmu_lightweight_write_by_dnode()
, which is only called from one place,flush_write_batch_impl()
indmu_recv.c
. - This loops over the “write batch”, a list of
DRR_WRITE
records from the stream. Thedmu_lightweight_write_by_dnode()
can only be reached whendrr_logical_size == dn_datablksz
. Note thatdn_datablksz
is known to be 16K by the time we get todbuf_sync_lightweight()
, because it is the lsize passed down tozio_write()
. - The
abd
passed todmu_lightweight_write_by_dnode()
must haveabd_size = 0
, because nothing else can change it. However, theabd_size
in the trace still says 16K. That means thatio_abd
must have been changed along the way?- Nothing can modify
dr->dt.dll.dr_abd
after its set indmu_lightweight_write_by_dnode()
- But this might be ok, as the zio pipeline can swap abds in and out via transforms.
- Nothing can modify
- The
abd
passed down comes fromrrd->abd
, in the receive record arg. These are allocated as the records arrive. They are sized by eitherdrr_logical_size
ordrr_compressed_size
, as appropriate. In our case, these are compressed, so they will bedrr_compressed_size
, but that’s by the by, asabd_alloc_linear(0)
will overflow inzio_data_buf_alloc()
, tripping a panic. - The only possible thing could be something that internally changes the abd to have size
abd_size = 0
.
Unexplored directions
- It’s very hard to make a an abd with
abd_size = 0
.abd_get_from_buf()
can, and I thinkabd_get_offset_struct()
andabd_get_offset_size()
both can, by requesting a zero-sized segment of an existing abd. But I don’t see where else it can be asked for directly. - But, I can’t see what could be doing that. There’s nothing manipulating the abd before it goes to
dmu_lightweight_write_by_dnode()
. - Could there be something issuing new IO from inside the pipeline that runs afoul of this? Ganging? Aggregation? I cannot see how.
Hello again
Its been a few weeks now. Tracing on dmu_lightweight_write_by_dnode()
seems to indicate that it isn’t involved - when the ENOSPC
write is triggered, it does not return EINVAL
. So lets go back to this.
arc_write()
Lets have a look at the other possible path identified above, through arc_write()
.
zio = zio_write(pio, spa, txg, bp,
abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),
HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,
(children_ready != NULL) ? arc_write_children_ready : NULL,
arc_write_physdone, arc_write_done, callback,
priority, zio_flags, zb);
Recall we’re looking for psize == 0
. The three interesting arguments here are:
abd_t *data
:abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr))
uint64_t size
:HDR_GET_LSIZE(hdr)
uint64_t psize
:arc_buf_size(buf)
So here, arc_buf_size(buf)
must have returned 0. It is:
uint64_t
arc_buf_size(arc_buf_t *buf)
{
return (ARC_BUF_COMPRESSED(buf) ?
HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr));
}
HDR_GET_XSIZE
return hdr->b_xsize
(for logical/physical). arc_hdr_realloc_crypt()
is the only things that sets them directly, for copying to a new hdr
. While these can be set to zero, it would require hitting a microscopic window and be a rookie-level locking bug, so I doubt its implicated. Assuming not for now.
All other setting is via HDR_SET_XSIZE
, and only from inside arc.c
. Meanwhile, ARC_BUF_COMPRESSED
is checking buf->b_flags & ARC_BUF_FLAG_COMPRESSED
. ARC_BUF_FLAG_COMPRESSED
is only modified in arc.c
. This is good, narrows the scope of our search.
Back to arc_write()
. It takes an arc_buf_t
as its input, which is then exploded to a abd, lsize and psize as above. The sanity checks are minimal, and it doesn’t look like this case will hit them.
We know from the IO dump that ZIO_FLAG_RAW_COMPRESS
and ZIO_FLAG_RAW_ENCRYPT
are set. arc_write()
takes a zio_flags
, but there are only three call sites: dmu_sync()
sets ZIO_FLAG_CANFAIL
(0x8), while dmu_objset_sync()
and dbuf_write()
set ZIO_FLAG_MUSTSUCCEED
(0x0). Inside arc_write()
, zio_flags
is only added to, and since the exemplar doesn’t have CANFAIL
, it must have come from one of the latter two. It is almost certainly dbuf_write()
(dmu_objset_sync()
is for updating the objset head structure), but at least there aren’t any other paths through here.
Inside arc_write()
, only two flag sets are added (and corresponding write policy):
if (ARC_BUF_ENCRYPTED(buf)) {
ASSERT(ARC_BUF_COMPRESSED(buf));
localprop.zp_encrypt = B_TRUE;
localprop.zp_compress = HDR_GET_COMPRESS(hdr);
localprop.zp_complevel = hdr->b_complevel;
...
zio_flags |= ZIO_FLAG_RAW;
} else if (ARC_BUF_COMPRESSED(buf)) {
ASSERT3U(HDR_GET_LSIZE(hdr), !=, arc_buf_size(buf));
localprop.zp_compress = HDR_GET_COMPRESS(hdr);
localprop.zp_complevel = hdr->b_complevel;
zio_flags |= ZIO_FLAG_RAW_COMPRESS;
}
For both RAW_COMPRESS
and RAW_ENCRYPT
to be set, it must have gone through the top path. ARC_BUF_ENCRYPTED
tests ARC_BUF_FLAG_ENCRYPTED
, so whatever created this buf must have set both that and ARC_BUF_FLAG_COMPRESSED
.
(technically, ARC_BUF_FLAG_COMPRESSED
may not be set here, because the assertion is compiled out; this would indicate a bug. We know it is set though, because we know that lsize and psize can only be different if arc_buf_size()
returned the psize).
Only two places set ENCRYPTED
, arc_buf_alloc_impl()
and arc_convert_to_raw()
. arc_convert_to_raw()
is promising, because it sets COMPRESSED | ENCRYPTED
unconditionally.
I’ll be honest, the meanings of these flags aren’t totally clear to me, but it seems they’re more of an indicator of what this buffer could be used for, not what’s actually in there. At least for ENCRYPTED
, it looks like the header uses a larger allocation that can hold the crypto params etc.
That’s one of many things that make arc_convert_to_raw()
interesting. Also interesting are its comment, the fact that its args may be NULL, and that curious object type assertion at the top:
/*
* This function is used by the send / receive code to convert a newly
* allocated arc_buf_t to one that is suitable for a raw encrypted write. It
* is also used to allow the root objset block to be updated without altering
* its embedded MACs. Both block types will always be uncompressed so we do not
* have to worry about compression type or psize.
*/
void
arc_convert_to_raw(arc_buf_t *buf, uint64_t dsobj, boolean_t byteorder,
dmu_object_type_t ot, const uint8_t *salt, const uint8_t *iv,
const uint8_t *mac)
{
arc_buf_hdr_t *hdr = buf->b_hdr;
ASSERT(ot == DMU_OT_DNODE || ot == DMU_OT_OBJSET);
ASSERT(HDR_HAS_L1HDR(hdr));
ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
buf->b_flags |= (ARC_BUF_FLAG_COMPRESSED | ARC_BUF_FLAG_ENCRYPTED);
if (!HDR_PROTECTED(hdr))
hdr = arc_hdr_realloc_crypt(hdr, B_TRUE);
hdr->b_crypt_hdr.b_dsobj = dsobj;
hdr->b_crypt_hdr.b_ot = ot;
hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ?
DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot);
if (!arc_hdr_has_uncompressed_buf(hdr))
arc_cksum_free(hdr);
if (salt != NULL)
bcopy(salt, hdr->b_crypt_hdr.b_salt, ZIO_DATA_SALT_LEN);
if (iv != NULL)
bcopy(iv, hdr->b_crypt_hdr.b_iv, ZIO_DATA_IV_LEN);
if (mac != NULL)
bcopy(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN);
}
Might as well put a big siren on this, because:
- it’s “used by the send / receive code”
- it sets
COMPRESSED
andENCRYPTED
- crypto params are clearly optional
- by assertion, object type can only be
DNODE
orOBJSET
Why is that last interesting? Well, because of another little nugget of info…
Object 0
Something I missed initially is that this is a metadnode (object 0) write. That explains why we never saw it on the stream: metadnode blocks don’t come from the stream, they’re freshly created!
zb_objset: 371010
zb_object: 0
zb_level: 0
zb_blkid: 18075
A cursory glance and the 16K size makes sense: that’s the standard data block size for object 0.
L0 block 18075 contains objects 578400-578431 inclusive. Replication guarantees the object IDs will remain the same, so a write on that block would indicate a modification in one of those objects. Alas, this is not immediately interesting, because we would need to look inside a stream from the same dataset at around the time it failed, which we don’t currently have (but could probably get!)
Still, we can guess that for psize == 0
, the mostly likely source is deleting the last object in the block. That would render the entire block empty, and so collapse it to a hole.
A hole here would of course mean lsize = 16K
, psize = 0
. And hole does not exist on disk; it’s just a particular configuration of the block pointer. They should never create real IO.
No holes
Two questions at this point:
- what’s the path to
arc_convert_to_raw()
, and what should it be doing with holes? - how is converting a metadnode block to a hole handled in the non-receive case?
The interesting call to arc_convert_to_raw()
is dbuf_prepare_encrypted_dnode_leaf()
:
/*
* When syncing out a blocks of dnodes, adjust the block to deal with
* encryption. Normally, we make sure the block is decrypted before writing
* it. If we have crypt params, then we are writing a raw (encrypted) block,
* from a raw receive. In this case, set the ARC buf's crypt params so
* that the BP will be filled with the correct byteorder, salt, iv, and mac.
*/
static void
dbuf_prepare_encrypted_dnode_leaf(dbuf_dirty_record_t *dr)
{
int err;
dmu_buf_impl_t *db = dr->dr_dbuf;
ASSERT(MUTEX_HELD(&db->db_mtx));
ASSERT3U(db->db.db_object, ==, DMU_META_DNODE_OBJECT);
ASSERT3U(db->db_level, ==, 0);
if (!db->db_objset->os_raw_receive && arc_is_encrypted(db->db_buf)) {
...
err = arc_untransform(db->db_buf, db->db_objset->os_spa,
&zb, B_TRUE);
...
} else if (dr->dt.dl.dr_has_raw_params) {
(void) arc_release(dr->dt.dl.dr_data, db);
arc_convert_to_raw(dr->dt.dl.dr_data,
dmu_objset_id(db->db_objset),
dr->dt.dl.dr_byteorder, DMU_OT_DNODE,
dr->dt.dl.dr_salt, dr->dt.dl.dr_iv, dr->dt.dl.dr_mac);
}
}
I note this only because of the interesting conditional setup. The effective conditional for the second block is
dr->dt.dl.dr_has_raw_params &&
(db->db_objset->os_raw_receive || !arc_is_encrypted(db->db_buf))
Which seems like its saying, if we have raw params and the buffer is not encrypted, apply the params, in the process upgrading the block to encrypted?
Feel’s like we’re closing in anyway! What happens for metadnode hole blocks normally?
Nothing special! They get zeroed in dnode_sync_free()
:
bzero(dn->dn_phys, sizeof (dnode_phys_t) * dn->dn_num_slots);
dnode_free_interior_slots(dn);
And I think that’s all there is to it!
So all this means is that is a regular zero-sized block that gets converted to a hole, with sole difference that it gets encryption magic applied to it unconditionally. So somewhere, something must be detecting psize going to zero and unsetting some flags.
At this point I stopped writing and looked for recent changes in this part of the code, and found that GH#14358 had seen and solved this bug. We helped the customer with a patch, and it hasn’t been since.
It was a little frustrating that I’d spent all this time chasing something that had already been fixed. On the other hand, the symptom was miles away from the cause, there’s a lot of places before getting this far where we could perhaps have caught a problem earlier, and I’ve come to think that the “fix” here is really more of a workaround. Lots learned, but more importantly, the customer has what they need.