OpenZFS: a pool that suspends for no reason

In 2024, a Klara customer was having a problem where their pools would sometimes suspend while receiving a replication stream. No hardware errors were ever seen, clearing the pool error state would get them going again, and resending the stream would usually work correctly. There was no obvious reason.

We gave the customer a trace script that would report the contents of the IO object that triggered the suspension. Armed with its output and the source code, I got to work following the breadcrumbs to understand what went wrong.

Typical trace output:

2024 Oct 6 08:37:48 zio_done, kernel (0):
zio addr: 0xfffff81e6dfb59a0
io_error: 28
io_stage: 0x1000000
io_pipeline: 0x1080000
io_pipeline_trace: 0x10980f5
io_flags: 0x1804000
io_type: 2
io_child_type: 3
io_child_error[0]: 0
io_child_error[1]: 0
io_child_error[2]: 0
io_child_error[3]: 0
io_size: 0
io_lsize: 16384
io_txg: 40374717
io_priority: 3
io_offset: 0x0
io_state[0]: 1
io_state[1]: 0
io_state[2]: 0
io_state[3]: 189
io_orig_size: 0
io_orig_flags: 0x1800000
io_orig_stage: 0x1
io_orig_pipeline: 0x17980f4
zb_objset: 371010
zb_object: 0
zb_level: 0
zb_blkid: 18075
abd_size: 16384
abd_flags: 0x201
io_executor: kernel
bp prop: 0xa00a0702001f001f

Annotated:

2024 Oct 6 08:37:48 zio_done, kernel (0):
zio addr: 0xfffff81e6dfb59a0
io_error: 28                    ENOSPC
io_stage: 0x1000000             DONE
io_pipeline: 0x1080000          READY DONE
io_pipeline_trace: 0x10980f5    OPEN WRITE_BP_INIT ISSUE_ASYNC WRITE_COMPRESS
                                ENCRYPT CHECKSUM_GENERATE DVA_THROTTLE
                                DVA_ALLOCATE READY DONE
io_flags: 0x1804000             IO_ALLOCATING RAW_COMPRESS RAW_ENCRYPT
io_type: 2                      WRITE
io_child_type: 3                LOGICAL
io_child_error[0]: 0
io_child_error[1]: 0
io_child_error[2]: 0
io_child_error[3]: 0
io_size: 0                      0
io_lsize: 16384                 16384
io_txg: 40374717
io_priority: 3                  ASYNC_WRITE
io_offset: 0x0
io_state[0]: 1
io_state[1]: 0
io_state[2]: 0
io_state[3]: 189
io_orig_size: 0                 0
io_orig_flags: 0x1800000        RAW_COMPRESS RAW_ENCRYPT
io_orig_stage: 0x1              OPEN
io_orig_pipeline: 0x17980f4     WRITE_BP_INIT ISSUE_ASYNC WRITE_COMPRESS
                                ENCRYPT CHECKSUM_GENERATE DVA_THROTTLE
                                DVA_ALLOCATE READY VDEV_IO_START VDEV_IO_DONE
                                VDEV_IO_ASSESS DONE
zb_objset: 371010
zb_object: 0
zb_level: 0
zb_blkid: 18075
abd_size: 16384                 16K
abd_flags: 0x201                LINEAR ALLOCD
io_executor: kernel
bp prop: 0xa00a0702001f001f

Interesting facts:

This is coming from an incoming stream
size 0
raw compress & raw encrypt
FreeBSD 13 (OpenZFS 2.1)
Pool is data1, 7x 12-wide raidz3, slog 2-wide mirror, single cache

Breadcrumbs

Pool suspension

In zio_done(), above state sets ZIO_REEXECUTE_SUSPEND.

if (zio->io_error && zio == zio->io_logical) {
                if (IO_IS_ALLOCATING(zio) &&
                    !(zio->io_flags & ZIO_FLAG_CANFAIL)) {
                        if (zio->io_error != ENOSPC)
                                zio->io_reexecute |= ZIO_REEXECUTE_NOW;
                        else
                                zio->io_reexecute |= ZIO_REEXECUTE_SUSPEND;
                }

Crucially, error is ENOSPC. What sets this?

We didn’t get to VDEV_IO_START in the original pipeline, so the last pre-DONE stage must have changed pipeline to INTERLOCK. zio_ready():

        if (zio->io_error != 0) {
                zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;

So the error is before that. Presumable the immediate next, DVA_ALLOCATE. zio_dva_allocate() sets io_error to any of its accumulated errors. io_size is zero, so it can’t have come from ganging (min 512b), so it must have come from one of two metaslab_alloc() calls. These differ only in their choice of class, which for now I think is irrelevant.

        error = metaslab_alloc(spa, mc, zio->io_size, bp,
            zio->io_prop.zp_copies, zio->io_txg, NULL, flags,
            &zio->io_alloc_list, zio, zio->io_allocator);

Allocations are done in a loop up to number of wanted of DVAs. First sign of trouble, what we’ve done is unwound and error is returned.

                error = metaslab_alloc_dva(spa, mc, psize, dva, d, hintdva,
                    txg, flags, zal, allocator);

Aborts early with ENOSPC for 3% of calls if metaslab_force_ganging is lower than psize. psize is 0 here, so we move on.

A metaslab group is selected. We check to see if its allocatable. It doesn’t care about psize almost at all. For our purposes, I’m assuming that whatever it chose can be allocated from, that is, exists, not full, etc.

We get an allocation size appropriate for the vdev:

                uint64_t asize = vdev_psize_to_asize(vd, psize);
                ASSERT(P2PHASE(asize, 1ULL << vd->vdev_ashift) == 0);

That bounces through to the vdev’s vdev_op_asize(). On this pool that will be a vdev_raidz_asize().

static uint64_t
vdev_raidz_asize(vdev_t *vd, uint64_t psize)
{
        vdev_raidz_t *vdrz = vd->vdev_tsd;
        uint64_t asize;
        uint64_t ashift = vd->vdev_top->vdev_ashift;
        uint64_t cols = vdrz->vd_logical_width;
        uint64_t nparity = vdrz->vd_nparity;

        asize = ((psize - 1) >> ashift) + 1;
        asize += nparity * ((asize + cols - nparity - 1) / (cols - nparity));
        asize = roundup(asize, nparity + 1) << ashift;

        return (asize);
}

And that’s kind of DOA. With psize == 0, it immediately overflows, and there’s no way back. If I got the math right, on these vdevs that returns an asize of 4468.36 PiB. (Yes: 4.5 EXABYTES).

There’s no point reading on. As far as metaslab_alloc() is concerned, ENOSPC is the right answer here.

Perhaps if we’d caught the zero earlier, we could have returned something else (EINVAL)? That would have got us some dbgmsg output from zio_dva_allocate(), but we would still suspend in zio_done(), because REEXECUTE_SUSPEND is the default for any unknown write error.

TODO

Assert psize > 0 and/or return EINVAL in:
- vdev_raidz_asize()
- metaslab_alloc()
- zio_dva_allocate() (curious, already asserts io_size == BP_GET_PSIZE(bp))

Zero psize

io_size is 0. This is not illegal in all cases, but certainly is unusual. There aren’t many places that set io_size directly, and even fewer that set io_orig_size directly.

The only place outside of the IO pipeline that does this is in the L2ARC read paths (l2arc_read_done(), to move a finished L2 read buffer into L1, and l2arc_untransform(), which is called by l2arc_read_done() to handle decrypt and decompress on an ARC buffer . This pool does have a cache device, so we can’t totally ignore this, but because its all read paths it seems highly unlikely that it is involved.

zio_create() is the most obvious place. It takes separate lsize and psize parameters, and is the only place in the system that creates ZIOs for execution.

It has this interesting and relevant assertion:

IMPLY(lsize != psize, (flags & ZIO_FLAG_RAW_COMPRESS) != 0);

Granted, this covers our case, but it also shows that its a rare case.

Now, zio_create() cannot be called externally, so we need to look at what can call it and get different lsize & psize through. Unsurprisingly, that’s only zio_write(). It does no extra relevant sanitising, which is good for our looking around.

Where are the relevant calls to zio_write()? We’re looking for any that can plausibly have different lsize and psize.

dbuf_sync_lightweight() is the only obvious one: lsize comes from dn->dn_datablksz, psize from abd_get_size(dr->dt.dll.dr_abd). But we need psize of 0, which I don’t think is possible, because dr_abd (from the dirty record ie, what we’re trying to write) is 16K (at least, I am assuming it is, from abd_size in the trace, though that comes from our own dtrace work).

noinline static void
dbuf_sync_lightweight(dbuf_dirty_record_t *dr, dmu_tx_t *tx)
{
        dnode_t *dn = dr->dr_dnode;
...
        dr->dr_zio = zio_write(pio, dmu_objset_spa(dn->dn_objset),
            dmu_tx_get_txg(tx), &dr->dr_bp_copy, dr->dt.dll.dr_abd,
            dn->dn_datablksz, abd_get_size(dr->dt.dll.dr_abd),
            &dr->dt.dll.dr_props, dbuf_lightweight_ready, NULL,
            dbuf_lightweight_physdone, dbuf_lightweight_done, dr,
            ZIO_PRIORITY_ASYNC_WRITE,
            ZIO_FLAG_MUSTSUCCEED | dr->dt.dll.dr_flags, &zb);

I am however a little uncertain about the size args here. The arg order to zio_write() are data, lsize, psize. dn_datablksz is usually set through dnode_setdblksz(), which asserts it must be at at least SPA_MINBLOCKSIZE. That’s fine, but there is a path through dnode_create() where it can be zero, and I’m not sure if its possible for that to remain that way.

This is interesting to me from two angles:

I don’t understand why lsize take the the final on-disk size, while psize is given the size of the data proper. Is that not the wrong way around, like, dr_abd has the pre-compression payload, ie, it’s the logical size?
In any case, from our trace, psize clearly starts larger. Maybe lsize will be used differently once compression transforms are added?

I’ll look into those more later.

The only other place where lsize and psize may not match is in arc_write():

        zio = zio_write(pio, spa, txg, bp,
            abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),
            HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,
            (children_ready != NULL) ? arc_write_children_ready : NULL,
            arc_write_physdone, arc_write_done, callback,
            priority, zio_flags, zb);

arc_write() doesn’t take separate sizes, but rather, a single arc_buf_t *buf, and works out what to do from there. So the lsize here is HDR_GET_LSIZE, fine. The psize, however, is either HDR_GET_PSIZE or HDR_GET_LSIZE, using PSIZE if the buf is compressed.

So I suppose the question for that one is, can the data compress down to zero?

I’ll consider both parts separately. Really, we need to figure out which ones might get used from a stream receive, which hopefully narrows the path somewhat.

`dbuf_sync_lightweight()`?

What even is this codepath though? Is this a noun (“lightweight dbuf”) or a verb? (“lightweight sync”).

Right, so when syncing a dirty record, dbuf_sync_lightweight() is used when the dirty record has no attached dr_dbuf. Commentary for dmu_lightweight_write_by_dnode() is englightening:

/*
 * A "lightweight" write is faster than a regular write (e.g.
 * dmu_write_by_dnode() or dmu_assign_arcbuf_by_dnode()), because it avoids the
 * CPU cost of creating a dmu_buf_impl_t and arc_buf_[hdr_]_t.  However, the
 * data can not be read or overwritten until the transaction's txg has been
 * synced.  This makes it appropriate for workloads that are known to be
 * (temporarily) write-only, like "zfs receive".
 *
 * A single block is written, starting at the specified offset in bytes.  If
 * the call is successful, it returns 0 and the provided abd has been
 * consumed (the caller should not free it).
 */
 int
dmu_lightweight_write_by_dnode(dnode_t *dn, uint64_t offset, abd_t *abd,
    const zio_prop_t *zp, enum zio_flag flags, dmu_tx_t *tx)
{
        dbuf_dirty_record_t *dr =
            dbuf_dirty_lightweight(dn, dbuf_whichblock(dn, 0, offset), tx);
        if (dr == NULL)
                return (SET_ERROR(EIO));
        dr->dt.dll.dr_abd = abd;
        dr->dt.dll.dr_props = *zp;
        dr->dt.dll.dr_flags = flags;
        return (0);
}

Alright, this looks pretty plausibly like our entry point! And that would mean the passed-in abd would need to have abd_get_size(…) == 0.

Incoming stream

The only caller to dmu_lightweight_write_by_dnode() is in flush_write_batch_impl() in dmu_recv.c. Alright, getting closer!

The relevant parts appear to be:

static int
flush_write_batch_impl(struct receive_writer_arg *rwa)
{
        dnode_t *dn;
        int err;

        if (dnode_hold(rwa->os, rwa->last_object, FTAG, &dn) != 0)
                return (SET_ERROR(EINVAL));
...
        struct receive_record_arg *rrd;
        while ((rrd = list_head(&rwa->write_batch)) != NULL) {
                struct drr_write *drrw = &rrd->header.drr_u.drr_write;
                abd_t *abd = rrd->abd;

                ASSERT3U(drrw->drr_object, ==, rwa->last_object);

                if (drrw->drr_logical_size != dn->dn_datablksz) {
                        /*
                         * The WRITE record is larger than the object's block
                         * size.  We must be receiving an incremental
                         * large-block stream into a dataset that previously did
                         * a non-large-block receive.  Lightweight writes must
                         * be exactly one block, so we need to decompress the
                         * data (if compressed) and do a normal dmu_write().
                         */
...
                } else {
...
                        /*
                         * Since this data can't be read until the receive
                         * completes, we can do a "lightweight" write for
                         * improved performance.
                         */
                        err = dmu_lightweight_write_by_dnode(dn,
                            drrw->drr_offset, abd, &zp, zio_flags, tx);
                }

(incidentally, this explains why the its ok to use the ABD size as the psize in dbuf_sync_lightweight()->zio_write(): this is the raw block direct off the stream, and is already compressed etc. It is the physical data).

To get to dmu_lightweight_write_by_dnode():

The dnode must already exist (or dnode_hold() would fail).
drr_logical_size == dn_datablksz, otherwise it would enter the “split the block” path.

However, drr_logical_size is never passed to dmu_lightweight_write_by_dnode(), instead, dbuf_sync_lightweight() uses dn_datablksz as the logical size. Now they must be equal, or we wouldn’t end up in this path.

But! Recall earlier I noticed a path through dnode_create() where it dn_datablksz can be zero? What if this is a freshly-created dnode that took that path, and dn_datablksz is zero, and drr_logical_size is zero here too. it would get to dmu_lightweight_write_by_dnode() intact, and a dirty record would be entered. Sometime later, dn_datablksz is expanded to its proper 16K. Then the txg sync goes, and we issue zio_write() with lsize == 16K (from dn_datablksz) and psize == 0 (from the ABD), and all hell breaks loose.

I’ll come back to that. First I want to find out where drr_logical_size and the ABD come from. Pretty much straight off the stream, but lets confirm.

receive_read_record() is where records are unpacked from the incoming stream. For a write:

        case DRR_WRITE:
        {
                struct drr_write *drrw = &drc->drc_rrd->header.drr_u.drr_write;
                int size = DRR_WRITE_PAYLOAD_SIZE(drrw);
                abd_t *abd = abd_alloc_linear(size, B_FALSE);

DRR_WRITE_PAYLOAD_SIZE is:

#define        DRR_WRITE_PAYLOAD_SIZE(drrw) \
        (DRR_WRITE_COMPRESSED(drrw) ? (drrw)->drr_compressed_size : \
        (drrw)->drr_logical_size)

So that tracks. If the payload size is zero, then the ABD will also be zero.

Hmm, but what does abd_alloc_linear() do when zero is requested? It does no min size checks, and for non-metadata allocations (second arg false), ends up in zio_data_buf_alloc(0).

void *
zio_data_buf_alloc(size_t size)
{
        size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;

        VERIFY3U(c, <, SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);

        return (kmem_cache_alloc(zio_data_buf_cache[c], KM_PUSHPAGE));
}

On the surface, that looks like an ordinary overflow. (size_t)-1 is Is that an overflow? (size_t)-1 is often a uint64_t, so lets say it wraps to UINT64_MAX. Shift down 9 gets us to ~32PiB. Obviously, that is higher than SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT at a mere 32K, which means we should have tripped the VERIFY.

Which means either we haven’t tried to allocate zero here, or the compiler has used some kind of unsigned math for 0-1, since it knows size_t is unsigned.

(Short pause while a FreeBSD 13 VM is spun up and a disassembler pointed at zfs.ko).

No, not that. Ultimately that -1 is just dec, which won’t be weird. So I feel fairly confident that had we called zio_data_buf_alloc with zero size, it would have panicked.

So, we can’t have called abd_alloc_linear() with zero size. Which makes things interesting indeed.

TODO

dmu_lightweight_write_by_dnode(): return EINVAL if abd_get_size() == 0?
Explicit zero-size assertions:
- zio_data_buf_alloc()
- abd_alloc_linear()

Dnode origin stories

Ok, so lets say the ABD is real, and the initial psize is not zero. What if we were passing down a zero lsize instead? Yes, they’re apparently the wrong way around, but I know that IO transforms can mess with them in flight, and our trace is once the IO is actually done, not what got passed in.

We know drr_logical_size != dn_datablksz when the record arrives. We know dn_datablksz is used as the lsize in the call to zio_write(). And we know there’s a path where it looks like it can be zero after create, and we know flush_write_batch_impl() expects the dnode to already be created. Is it possible that it exists, but whatever inflates its size hasn’t happened yet? ie, a race in syncing context or something, nothing directly to do with the incoming stream?

static dnode_t *
dnode_create(objset_t *os, dnode_phys_t *dnp, dmu_buf_impl_t *db,
    uint64_t object, dnode_handle_t *dnh)
{
        dnode_t *dn;

        dn = kmem_cache_alloc(dnode_cache, KM_SLEEP);
...
        if (dnp->dn_datablkszsec) {
                dnode_setdblksz(dn, dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT);
        } else {
                dn->dn_datablksz = 0;
                dn->dn_datablkszsec = 0;
                dn->dn_datablkshift = 0;
        }

So, if dnp->dn_datablkszsec is set, then the dnode gets zeroed. Where does dnp come from?

[and then I went to dinner]

Vacation time 🏝️

Late on Thursday, here’s the what I know:

the failure occurs because by the time we get to zio_dva_allocate, io_size is zero, and attempting a zero-byte allocation on a raidz vdev causes an underflow that is interpreted as a request for a ~4.5EiB allocation, which fails with ENOSPC.
The trace reports io_size = 0 and io_lsize = 16K. It’s possible for these to change during the pipeline (transforms). However, io_orig_size = 0, which can only mean that psize = 0 was passed to zio_create(), and so zio_write() (no other zio creation methods can supply separate psize and lsize).
Further io_lsize is never modified again after being set (except in a rare ZIL-related codepath, zio_shrink()). So we can be certain that these are the original values passed in to zio_create().
By assertion in zio_create(), lsize and psize must equal unless ZIO_FLAG_RAW_COMPRESS is set, which means either ZIO_FLAG_RAW_COMPRESS or ZIO_FLAG_RAW in the incoming flags.
Only two call sites can possibly call zio_write() with different lsize and psize: dbuf_sync_lightweight() and arc_write(). “lightweight sync” is expressly designed for a kind of pass-through write and is only used by the stream reciever.
dbuf_sync_lightweight() writes out from dirty records, setting lsize to dn_datablksz and psize to abd_get_size(abd). abd is held on the dirty record.
The abd and ZIO flags for the dirty record are filled by dmu_lightweight_write_by_dnode(), which is only called from one place, flush_write_batch_impl() in dmu_recv.c.
This loops over the “write batch”, a list of DRR_WRITE records from the stream. The dmu_lightweight_write_by_dnode() can only be reached when drr_logical_size == dn_datablksz. Note that dn_datablksz is known to be 16K by the time we get to dbuf_sync_lightweight(), because it is the lsize passed down to zio_write().
The abd passed to dmu_lightweight_write_by_dnode() must have abd_size = 0, because nothing else can change it. However, the abd_size in the trace still says 16K. That means that io_abd must have been changed along the way?
- Nothing can modify dr->dt.dll.dr_abd after its set in dmu_lightweight_write_by_dnode()
- But this might be ok, as the zio pipeline can swap abds in and out via transforms.
The abd passed down comes from rrd->abd, in the receive record arg. These are allocated as the records arrive. They are sized by either drr_logical_size or drr_compressed_size, as appropriate. In our case, these are compressed, so they will be drr_compressed_size, but that’s by the by, as abd_alloc_linear(0) will overflow in zio_data_buf_alloc(), tripping a panic.
The only possible thing could be something that internally changes the abd to have size abd_size = 0.

Unexplored directions

It’s very hard to make a an abd with abd_size = 0. abd_get_from_buf() can, and I think abd_get_offset_struct() and abd_get_offset_size() both can, by requesting a zero-sized segment of an existing abd. But I don’t see where else it can be asked for directly.
But, I can’t see what could be doing that. There’s nothing manipulating the abd before it goes to dmu_lightweight_write_by_dnode().
Could there be something issuing new IO from inside the pipeline that runs afoul of this? Ganging? Aggregation? I cannot see how.

Hello again

Its been a few weeks now. Tracing on dmu_lightweight_write_by_dnode() seems to indicate that it isn’t involved - when the ENOSPC write is triggered, it does not return EINVAL. So lets go back to this.

`arc_write()`

Lets have a look at the other possible path identified above, through arc_write().

        zio = zio_write(pio, spa, txg, bp,
            abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),
            HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,
            (children_ready != NULL) ? arc_write_children_ready : NULL,
            arc_write_physdone, arc_write_done, callback,
            priority, zio_flags, zb);

Recall we’re looking for psize == 0. The three interesting arguments here are:

abd_t *data: abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr))
uint64_t size: HDR_GET_LSIZE(hdr)
uint64_t psize: arc_buf_size(buf)

So here, arc_buf_size(buf) must have returned 0. It is:

uint64_t
arc_buf_size(arc_buf_t *buf)
{
        return (ARC_BUF_COMPRESSED(buf) ?
            HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr));
}

HDR_GET_XSIZE return hdr->b_xsize (for logical/physical). arc_hdr_realloc_crypt() is the only things that sets them directly, for copying to a new hdr. While these can be set to zero, it would require hitting a microscopic window and be a rookie-level locking bug, so I doubt its implicated. Assuming not for now.

All other setting is via HDR_SET_XSIZE, and only from inside arc.c. Meanwhile, ARC_BUF_COMPRESSED is checking buf->b_flags & ARC_BUF_FLAG_COMPRESSED. ARC_BUF_FLAG_COMPRESSED is only modified in arc.c. This is good, narrows the scope of our search.

Back to arc_write(). It takes an arc_buf_t as its input, which is then exploded to a abd, lsize and psize as above. The sanity checks are minimal, and it doesn’t look like this case will hit them.

We know from the IO dump that ZIO_FLAG_RAW_COMPRESS and ZIO_FLAG_RAW_ENCRYPT are set. arc_write() takes a zio_flags, but there are only three call sites: dmu_sync() sets ZIO_FLAG_CANFAIL (0x8), while dmu_objset_sync() and dbuf_write() set ZIO_FLAG_MUSTSUCCEED (0x0). Inside arc_write(), zio_flags is only added to, and since the exemplar doesn’t have CANFAIL, it must have come from one of the latter two. It is almost certainly dbuf_write() (dmu_objset_sync() is for updating the objset head structure), but at least there aren’t any other paths through here.

Inside arc_write(), only two flag sets are added (and corresponding write policy):

        if (ARC_BUF_ENCRYPTED(buf)) {
                ASSERT(ARC_BUF_COMPRESSED(buf));
                localprop.zp_encrypt = B_TRUE;
                localprop.zp_compress = HDR_GET_COMPRESS(hdr);
                localprop.zp_complevel = hdr->b_complevel;
        ...
                zio_flags |= ZIO_FLAG_RAW;
        } else if (ARC_BUF_COMPRESSED(buf)) {
                ASSERT3U(HDR_GET_LSIZE(hdr), !=, arc_buf_size(buf));
                localprop.zp_compress = HDR_GET_COMPRESS(hdr);
                localprop.zp_complevel = hdr->b_complevel;
                zio_flags |= ZIO_FLAG_RAW_COMPRESS;
        }

For both RAW_COMPRESS and RAW_ENCRYPT to be set, it must have gone through the top path. ARC_BUF_ENCRYPTED tests ARC_BUF_FLAG_ENCRYPTED, so whatever created this buf must have set both that and ARC_BUF_FLAG_COMPRESSED.

(technically, ARC_BUF_FLAG_COMPRESSED may not be set here, because the assertion is compiled out; this would indicate a bug. We know it is set though, because we know that lsize and psize can only be different if arc_buf_size() returned the psize).

Only two places set ENCRYPTED, arc_buf_alloc_impl() and arc_convert_to_raw(). arc_convert_to_raw() is promising, because it sets COMPRESSED | ENCRYPTED unconditionally.

I’ll be honest, the meanings of these flags aren’t totally clear to me, but it seems they’re more of an indicator of what this buffer could be used for, not what’s actually in there. At least for ENCRYPTED, it looks like the header uses a larger allocation that can hold the crypto params etc.

That’s one of many things that make arc_convert_to_raw() interesting. Also interesting are its comment, the fact that its args may be NULL, and that curious object type assertion at the top:

/*
 * This function is used by the send / receive code to convert a newly
 * allocated arc_buf_t to one that is suitable for a raw encrypted write. It
 * is also used to allow the root objset block to be updated without altering
 * its embedded MACs. Both block types will always be uncompressed so we do not
 * have to worry about compression type or psize.
 */
void
arc_convert_to_raw(arc_buf_t *buf, uint64_t dsobj, boolean_t byteorder,
    dmu_object_type_t ot, const uint8_t *salt, const uint8_t *iv,
    const uint8_t *mac)
{
        arc_buf_hdr_t *hdr = buf->b_hdr;

        ASSERT(ot == DMU_OT_DNODE || ot == DMU_OT_OBJSET);
        ASSERT(HDR_HAS_L1HDR(hdr));
        ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);

        buf->b_flags |= (ARC_BUF_FLAG_COMPRESSED | ARC_BUF_FLAG_ENCRYPTED);
        if (!HDR_PROTECTED(hdr))
                hdr = arc_hdr_realloc_crypt(hdr, B_TRUE);
        hdr->b_crypt_hdr.b_dsobj = dsobj;
        hdr->b_crypt_hdr.b_ot = ot;
        hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ?
            DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot);
        if (!arc_hdr_has_uncompressed_buf(hdr))
                arc_cksum_free(hdr);

        if (salt != NULL)
                bcopy(salt, hdr->b_crypt_hdr.b_salt, ZIO_DATA_SALT_LEN);
        if (iv != NULL)
                bcopy(iv, hdr->b_crypt_hdr.b_iv, ZIO_DATA_IV_LEN);
        if (mac != NULL)
                bcopy(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN);
}

Might as well put a big siren on this, because:

it’s “used by the send / receive code”
it sets COMPRESSED and ENCRYPTED
crypto params are clearly optional
by assertion, object type can only be DNODE or OBJSET

Why is that last interesting? Well, because of another little nugget of info…

Object 0

Something I missed initially is that this is a metadnode (object 0) write. That explains why we never saw it on the stream: metadnode blocks don’t come from the stream, they’re freshly created!

zb_objset: 371010
zb_object: 0
zb_level: 0
zb_blkid: 18075

A cursory glance and the 16K size makes sense: that’s the standard data block size for object 0.

L0 block 18075 contains objects 578400-578431 inclusive. Replication guarantees the object IDs will remain the same, so a write on that block would indicate a modification in one of those objects. Alas, this is not immediately interesting, because we would need to look inside a stream from the same dataset at around the time it failed, which we don’t currently have (but could probably get!)

Still, we can guess that for psize == 0, the mostly likely source is deleting the last object in the block. That would render the entire block empty, and so collapse it to a hole.

A hole here would of course mean lsize = 16K, psize = 0. And hole does not exist on disk; it’s just a particular configuration of the block pointer. They should never create real IO.

No holes

Two questions at this point:

what’s the path to arc_convert_to_raw() , and what should it be doing with holes?
how is converting a metadnode block to a hole handled in the non-receive case?

The interesting call to arc_convert_to_raw() is dbuf_prepare_encrypted_dnode_leaf():

/*
 * When syncing out a blocks of dnodes, adjust the block to deal with
 * encryption.  Normally, we make sure the block is decrypted before writing
 * it.  If we have crypt params, then we are writing a raw (encrypted) block,
 * from a raw receive.  In this case, set the ARC buf's crypt params so
 * that the BP will be filled with the correct byteorder, salt, iv, and mac.
 */
static void
dbuf_prepare_encrypted_dnode_leaf(dbuf_dirty_record_t *dr)
{
        int err;
        dmu_buf_impl_t *db = dr->dr_dbuf;

        ASSERT(MUTEX_HELD(&db->db_mtx));
        ASSERT3U(db->db.db_object, ==, DMU_META_DNODE_OBJECT);
        ASSERT3U(db->db_level, ==, 0);

        if (!db->db_objset->os_raw_receive && arc_is_encrypted(db->db_buf)) {
                ...
                err = arc_untransform(db->db_buf, db->db_objset->os_spa,
                    &zb, B_TRUE);
                ...
        } else if (dr->dt.dl.dr_has_raw_params) {
                (void) arc_release(dr->dt.dl.dr_data, db);
                arc_convert_to_raw(dr->dt.dl.dr_data,
                    dmu_objset_id(db->db_objset),
                    dr->dt.dl.dr_byteorder, DMU_OT_DNODE,
                    dr->dt.dl.dr_salt, dr->dt.dl.dr_iv, dr->dt.dl.dr_mac);
        }
}

I note this only because of the interesting conditional setup. The effective conditional for the second block is

                dr->dt.dl.dr_has_raw_params &&
                        (db->db_objset->os_raw_receive || !arc_is_encrypted(db->db_buf))

Which seems like its saying, if we have raw params and the buffer is not encrypted, apply the params, in the process upgrading the block to encrypted?

Feel’s like we’re closing in anyway! What happens for metadnode hole blocks normally?

Nothing special! They get zeroed in dnode_sync_free():

        bzero(dn->dn_phys, sizeof (dnode_phys_t) * dn->dn_num_slots);
        dnode_free_interior_slots(dn);

And I think that’s all there is to it!

So all this means is that is a regular zero-sized block that gets converted to a hole, with sole difference that it gets encryption magic applied to it unconditionally. So somewhere, something must be detecting psize going to zero and unsetting some flags.

At this point I stopped writing and looked for recent changes in this part of the code, and found that GH#14358 had seen and solved this bug. We helped the customer with a patch, and it hasn’t been since.

It was a little frustrating that I’d spent all this time chasing something that had already been fixed. On the other hand, the symptom was miles away from the cause, there’s a lot of places before getting this far where we could perhaps have caught a problem earlier, and I’ve come to think that the “fix” here is really more of a workaround. Lots learned, but more importantly, the customer has what they need.

Breadcrumbs

Pool suspension

TODO

Zero psize

dbuf_sync_lightweight()?

Incoming stream

TODO

Dnode origin stories

Vacation time 🏝️

Unexplored directions

Hello again

arc_write()

Object 0

No holes

`dbuf_sync_lightweight()`?

`arc_write()`