The last MySQL upgrade
Some of my best work in my professional career includes a recent database upgrade at Fastmail. This was not because it was especially interesting or exciting, but because success required careful planning and flawless execution, and I managed to pull it off.
Background
Since 1999, Fastmail has used MySQL as its internal database. In those days it was famous for its lack of robustness and was also for many years difficult to scale out. But, it was a safe and obvious choice for the time, and so got the nod.
As time went on, significant amounts of middleware code was written to work around various bugs and shortcomings. Along the way, Fastmail wrote a custom replication engine to support multi-master replication and work around numerous bugs with native MySQL replication.
Over time, it has become the central store for almost everything except the user mail content itself, so nearly every product feature and system management component relies on it in some way.
I joined Fastmail in 2012, by which time database work at the company had largely stopped. MySQL was stuck at version 5.1, and a test upgrade to 5.6 had not gone well, so the database infrastructure had been largely set aside, only receiving enough work to keep it running through hardware failures and OS upgrades. This state of affairs continued until 2021, mostly because the damn thing kept working well enough and so wasn’t in anyone’s way.
By 2021, it had really started to show its age. In the preceding decade, Fastmail had grown from six to almost 50 staff, taken on countless new customers, and made significant investments in application development, standards work, support, marketing and other areas are that are closer to the core business. More was being asked of the database system, both in terms of load (more customers means more database work) and query complexity (the establishment of a data analysis group really brought things to the fore). Our ancient MySQL, with its poor support for modern hardware (especially NUMA), poor subquery performance and other quirks was now holding us back.
In late 2021 I was tasked with upgrading to MySQL 8.x.
Managing the unknown
The database infrastructure is more than just MySQL; it includes its configuration, support services like replication and backups, service management hooks, the database access middleware, and various operational tools. All of them had accumulated cruft and entanglement over the years so the initial task was to simply evaluate everything that existed and figure out the minimum amount of change necessary to complete the upgrade, pushing the rest into the future.
The key observations I made from this analysis:
- the system as a whole (applications and service management) had to remain up throughout; a user-facing outage was not acceptable.
- it was not going to be possible to be 100% confident in its operation before introducing it into production, as the age of some of the code and the data and lack of test coverage just wasn’t going to show everything beforehand.
- some elements, like replication and backups, simply could not be tested in a test environment without significant development work.
Due to the uncertainty and the need to develop and test “in production”, the method I settled on for the entire project was to introduce small changes to the existing production environment frequently to see what falls out, and to enlist as many subject-matter experts as possible to help close the gaps in test coverage.
The initial work was to just get it running in development VMs, which was mostly a matter of installing an upgraded MySQL, adjusting the config until it stopped complaining, then starting the system up, running the test suite, and experimenting with different product features and making the necessary tweaks until it mostly seemed to work. Naturally, 90% of the work was done up front just getting MySQL to start and getting programs to authenticate properly. The important deliverable here was a script to upgrade a development VM to use the new version, so the engineering team could start experimenting with it and submitting code changes and fixes that could support both old and new databases. Most problems were straightforward, though there were a couple, like this Unicode round-tripping quirk that required some deep digging.
With the obvious issues out of the way, I set about getting a database into production. I did not dare to upgrade a live database yet, so instead I added a MySQL 8 instance on a fresh server, removed from service discovery but receiving replicated changes. This gave me a deploy target for the replication and backup systems. Backups are mostly just scripts, and were rewritten and, in the process, improved significantly. Replication was rather more involved.
The “default” replication system had changed a lot in MySQL, but could mostly be configured back to its “classic” form. A new element was the addition of checksums to messages on the replication stream. While they can be turned off, that’s not the default, so I decided that I should see if I could add checksum support to the replication engine rather than disable a safety feature. I did so, and brought it to production, and promptly corrupted the data stream. Investigating further, it turns out there are fundamental and unfixable problems in the way the replication engine manages its IO buffers that has managed to avoid detection for over a decade. This was one of those cases where I was sorely tempted to just rewrite it all, but a significant programming project in the middle of an already risky upgrade is more than I felt comfortable with, so instead I disabled checksums and put that on the “future work” list.
And then, I let this off-to-the-side database run for a few weeks, as a fully-production database replica, just not available for service. I supposed that this would give every code path a chance to run at least one, so over that time the new database would “see” every possible query (most notably those from billing and account management tasks, which run monthly and are complex).
To my great satisfaction, nothing interesting happened: it just ticked along happily. Since the test suite was by now passing for both databases, and with the whole project being heavily socialised and considered within the engineering team, it was easy to feel like we’d done as much as we could to feel confident about moving the upgrade to production.
And so, in March 2022, I finally deployed the upgrade to all production database nodes, one per day.
Conclusion
The entire process was entirely boring and went off without a hitch, which is an unsatisfying conclusion to a story but a very satisfying conclusion to a high-risk change.
Since then, we’ve really started considering putting MySQL in the logic path for some new projects. For example, we have an experiment happening that involves triggers, and another that wants a database plugin. These are not notable by themselves, but wouldn’t have even been considered previously, as MySQL was very much just a data store and it was best not to try anything clever with it lest it misbehave.
For me, the entire project is something I’m proud of because I managed to take some difficult constraints and “unknown unknowns” and find a path to successful delivery by moving slowly, taking small and deliberate steps, and getting others involved in the process early.