This is the text of an email sent to one of Fastmail's internal engineering lists, and has been sanitised slightly to remove private information and internal links.
Fastmail has an internal component that moves users and groups of users between storage shards ("stores"). After the data is copied, a sanity checker called "CheckReplication" runs to ensure the copy matches the original before the copy is made the original and the old copy is destroyed. Since 2020 the check occasionally fails for no obvious reason. One Monday I decided I'd had enough and set out to understand it and fix it, which led to some surprising discoveries.
CheckReplication: why “quota mismatch” happens
After I wrote about failing user moves, I resolved to myself that I would some time soon spent a good deal of time going deep on CheckReplication in particular to understand what its actually saying, and why.
I started on quota mismatches. I’ve spent literally every day this week reading code, testing and probing, asking questions to myself and answering them, and it turns out, there’s a bug in Cyrus.
My entire analysis is here: [internal link]. It’s kind of a stream of thought, so I will write it a bit more of a explanation way here. It might not be strictly necessary to understand it all; I’ve created tasks and MRs for all the actionable things. Ask me if anything isn’t clear.
Background
There’s a library in hm called ME::CheckReplication. Its purpose is to connect to two Cyrus instances, consider a user on both, and point out any differences. We user this in conjunction with user moves; after we’ve finished copying them to another store, we compare to make sure we got it right. If its all good, we point the system at the new store and delete them from the old store, otherwise we delete all the stuff we just copied into the new store, and throw an error.
A relatively common error is something like:
Quota mismatch: 557940/29297664 vs 446362/29297664
This is used/max on the original vs used/max on the target. Since they don’t match, it stops. Its this difference that I’ve been working to understand.
Quota
CheckReplication calls the IMAP GETQUOTAROOT command to get the quota details.
Cyrus appears to track quota usage in two ways. There’s the traditional
“quotalegacy” database that is pretty much just the raw count of
messages (objects?) and bytes, and a newer conversation-aware count of
same which is rather more complicated (at least, it doesn’t count things
with the IMAP \Deleted
flags, and possibly not even non-email objects
at all).
Before March 2020, Cyrus used the “raw” quota in its response to GETQUOTAROOT (and GETQUOTA). Then, we landed the “convquota” work, which as far as I can tell was part of a bundle of changes to give Cyrus everything it needed to support JMAP (I have no memory of this, which I put down to March 2020 being right on the cusp of some major distraction). Those threads mention GETQUOTAROOT and friends as clients of the new quota system, and indeed, they were changed to use it in e6b251f0 but I’m unsure if they were just incidentally changed as a “change everything” or explicitly done.
In any case, that means that since then the quota counts used by CheckReplication have been the conversation quota. I suggest that probably it would be better to use the raw quota for this particular case, if not for all IMAP (IMAP doesn’t do conversations; would it necessarily expect to receive a conversationally-filtered version of its quota data). Alas, there’s no way to get the raw values via IMAP. I’ve created a task, CYR-269, seeking clarification on all this and, perhaps, an alternate quota key or root to get the raw values.
This is all somewhat unrelated to the real problem though. Regardless of where the quota counts come from, they should be the same after making a copy of a user, and they’re not. So we have to dig deeper to understand that.
Metadata storage
Cyrus stores the user-global metadata like the conversations database, the DAV database, subscriptions, etc under its config dir, in a directory named for the UUID of the user’s INBOX, eg for my test user:
# ls /mnt/i43n03/sloti43n03/store23/conf/user/uuid/b/2/b23610d5-9e7d-407a-b2c9-5c5d5a61b1a6
conversations.db counters.db dav.db seen.db sub.db xapianactive.db
The quota is recorded in the conversations.db as discussed, in the Q
key:
# cyr sloti43n03 cyr_dbtool /mnt/i43n03/sloti43n03/store23/conf/user/uuid/8/e/8e26043b-7b5a-4254-bb64-5f07cdcdc035/conversations.db twoskip show Q
Q 1 %(E 2422 S 244177031)
E
is the number emails, while S
is the byte count.
Part of my testing was manually replicate this user to another slot that had never seen it before, to see what came out:
# cyr sloti43n03 sync_client -v -S sloti36d1t19 -u [email protected]
ENABLE 0x800
USER [email protected]
MAILBOX fastmail.fm!user.robntest
MAILBOX fastmail.fm!user.robntest.#addressbooks
MAILBOX fastmail.fm!user.robntest.#addressbooks.Default
...
Checking on the target, we see:
# cyr sloti36d1t19 cyr_dbtool /mnt/ssd36d1/sloti36d1t19/store254/conf/user/uuid/8/e/8e26043b-7b5a-4254-bb64-5f07cdcdc035/conversations.db twoskip show Q
Q 1 %(E 2413 S 237456390)
That is, nine fewer messages. Now that meant something to me, because I happen to know this account has nine messages in its inbox. After a bit of digging, I found another conversations database:
# cyr sloti36d1t19 cyr_dbtool /mnt/ssd36d1/sloti36d1t19/store254/conf/domain/f/fastmail.fm/user/r/robntest.conversations twoskip show Q
Q 1 %(E 9 S 6720641)
This is the old pre-UUID path for these meta files. So somehow my conversations database is split across two files, which is a problem, because Cyrus internally believes it only has one, so when it reads the quota back, it gets the wrong values back, and CheckReplication complains!
The problem comes down to how Cyrus chooses the location for its user metadata files. If the INBOX exists, and has a UUID, it will use the user/uuid path, otherwise, it will use the domain/user path.
When doesn’t the INBOX exists? When the user hasn’t been created yet.
Creating a user
A “user” in the Cyrus world is kind of a lie; its mostly about the presence of an inbox. Its easy to forget if you’re look at sync_client output a lot, which for a user replication always starts:
USER [email protected]
MAILBOX fastmail.fm!user.robntest
Turns out USER
is a an indicator of the client asking the server for
the current state of that user. It doesn’t actually do anything to
“create” that user (if that was a thing). All the lifting happens in
MAILBOX. There’s a few commands on the replication stream under that,
but the interesting one for us is APPLY MAILBOX:
APPLY MAILBOX %(UNIQUEID 8e26043b-7b5a-4254-bb64-5f07cdcdc035 MBOXNAME fastmail.fm!user.robntest SYNC_CRC 335266584 SYNC_CRC_ANNOT 1655153741 LAST_UID 2965 HIGHESTMODSEQ 17365878007025886115 RECENTUID 2965 RECENTTIME 1655790721 LAST_APPENDDATE 1655614386 POP3_LAST_LOGIN 1538029761 POP3_SHOW_AFTER 0 XCONVMODSEQ 17365878007025886115 RACLMODSEQ 17365878007025877997 UIDVALIDITY 1511918898 PARTITION default ACL "admin lrswipkxtecdan anyone p [email protected] lrswipkxtecdn " OPTIONS P QUOTAROOT fastmail.fm!user.robntest FOLDERMODSEQ 17365878007025876273 RECORD (%(UID 2951 MODSEQ 17365878007025877904 LAST_UPDATED 1620908632 FLAGS (\Seen $X-ME-Annot-2) INTERNALDATE 1617070803 SIZE 55898 GUID 2918bc98e1fdda22a4df9493a1a2d12fc90d6bf5 ANNOTATIONS (%(ENTRY /vendor/messagingengine.com/preview USERID "" MODSEQ 17365878007025877558 VALUE {382+}
...
OK success
Which is, create a mailbox with all this metadata, and then link all the
messages listed in RECORD
into it (these have been previously
replicated and are sitting in the staging dir).
When this happens, the following call sequence occurs, most of which are obvious from their name:
sync_apply
-> sync_apply_mailbox
-> mboxlist_createmailbox
-> mailbox_create
| -> mailbox_commit
| -> mailbox_update_xconvmodseq
| -> mailbox_get_cstate_full
| -> conversations_open_mbox
| -> conversations_get_mboxpath
| -> conversations_path
| -> mboxname_conf_getpath
| -> mboxlist_lookup_allow_all
| -> mboxid_conf_getpath
| [or]
| -> mboxname_conf_getpath_legacy
-> mboxlist_update_entry
The INBOX mailbox is created (mailbox_create
), and as part of that, a
conversations database is opened to update some metadata
(mailbox_update_xconvmodseq
). To get the database path, the mboxlist
is consulted to get the UUID of the INBOX (mboxlist_lookup_allow_all
).
But the INBOX isn’t in the mboxlist yet (mboxlist_update_entry
hasn’t
been called yet), so it goes in the legacy path.
For the next mailbox though, the first after the INBOX, the INBOX is in the mboxlist, so another new conversations database is opened in the right place, and the original one is lost forever. And so the counts are off!
I’ve put this into CYR-270, as regardless of what’s happening with quotas and CheckReplication, this is clearly a bug.
Why don’t we we notice lost inbox metadata?
As it turns out, ctl_conversationsdb -R [email protected]
is all
you need to “fix” this. It notices the missing inbox stuff and shoves it
all into the correct conversations database (the INBOX exists now!) and
updates the quota.
# cyr sloti36d1t19 ctl_conversationsdb -R [email protected]
# cyr sloti36d1t19 cyr_dbtool /mnt/ssd36d1/sloti36d1t19/store254/conf/user/uuid/8/e/8e26043b-7b5a-4254-bb64-5f07cdcdc035/conversations.db twoskip show Q
Q 1 %(E 2422 S 244177031)
And, after every move, we run fixaccount, which runs the conversations recount as part of its work. So it ends up right in the end.
Why doesn’t every move fail if the inbox is never counted properly?
This one I’m less sure of. Obviously most of the time moves do work, and I doubt that most of our users have empty inboxes. I have seen multiple runs of a full move run into the same store (with appropriate cleanup in between) fail the first time, then succeed, or succeed then fail the next time. I confess, by this time I got very tired - there is a lot of code here that’s unfamiliar to me. My guess is that there’s plenty of other ways and times a conversation database could be created after the INBOX exists if something races slightly, and we do do multiple concurrent replications when moving a user, so maybe’s there’s something. Either way though, the above described bug is real, and fixing it sanely should fix any related ordering problems, so I’m content to leave this for the moment.
Can we work around it?
Yes. Ultimately the conversations database is in good shape by the end of the move because of the fixaccount. All we need to do is recount it just before we run CheckReplication, which I’ve done in hm!11090. If there’s other underlying issues, like messages missing outright, then this won’t hide that; it’ll just make sure that everything we do have is counted properly.
I would remove this once the above is fixed and/or we switch to using raw quotas.
Conclusion
Well, there is none yet, I just had to round this off somewhere. Tomorrow I will proceed to understanding the means by which subscriptions don’t always line up during CheckReplication.
Cheers,
Rob N.