This is the text of an email sent to one of Fastmail's internal engineering lists, and has been sanitised slightly to remove private information and internal links.
Since 1999, Fastmail has used MySQL as its internal database. In those days it was famous for its lack of robustness, and over the years Fastmail has built up a layer of workarounds to correct for this. In 2021 I began a project to upgrade from MySQL 5.1 (2008!) to MySQL 8. MySQL these days is good enough that most of the workarounds were now in the way, and so had to be carefully removed or worked around. This is the story of just one of them.
MySQL character sets and storing “high” characters
Summary: the strange way we store UTF-8 data in the database is extremely sensitive to character encoding settings in the MySQL client, connection, server and schema. If they’re off, then non-ASCII values won’t round-trip correctly. We don’t really have test coverage for our storage mechanism, and I think we should, and also, I will need to work harder to vet the MySQL 8 config before it goes to production.
The rest of this email is about the debugging session I spent Friday and today on. You don’t have to read it, but you might be interested!
I’ve been working on the MySQL 8 deployment plan and have decided (for various reasons I won’t go into here) to upgrade the database servers to bullseye, and then upgrade MySQL to 8.x.
So on Friday, I got 8.x running on a bullseye inabox, and did a test suite run. Everything passed except t/helpdesk.t, which I thought was odd since it doesn’t really doing anything interesting with the database. So, late Friday and all of today (maybe eight hours all up) I’ve been slowly peeling back the layers to figure out what’s up.
I won’t go into the details of every layer, but the upshot is this.
helpdesk.t tests the helpdesk “service” that sits behind the ticket self-service UI (I only vaguely understand its construction). It opens by creating a user with a “high” character in its name:
my $user = $self->new_user({ full_name => "Bob ☃" });
later, it gets a ticket from the API, and makes sure that among other thing, the right author name was seen:
jcmp_deeply(
$check_res->sentence_named('Ticket/get')->arguments->{list}[0],
superhashof({
comments => [
superhashof({
authorName => 'Bob ☃',
}),
],
}),
'displayName passed through properly from auth session'
);
This was failing:
[ FAIL ] job 116 | +~single user + uparam - displayName is correct
[ FAIL ] job 116 | | + displayName passed through properly from auth session
( DIAG ) job 116 | | | Failed test 'displayName passed through properly from auth session'
( DIAG ) job 116 | | | at t/helpdesk.t line 90.
( DIAG ) job 116 | | | Comparing $data->{"comments"}[0]{"authorName"} as a string
( DIAG ) job 116 | | | got : 'Bob ?'
Wide character in print at /usr/share/perl5/Test2/Formatter/Test2.pm line 305, <__ANONIO__> line 14590.
( DIAG ) job 116 | | | expect : 'Bob ☃'
job 116 | | ^
( DIAG ) job 116 | | Failed test 'single user + uparam - displayName is correct'
( DIAG ) job 116 | | at t/helpdesk.t line 100.
My initial suspicion was that this was another back-compat thing that I hadn’t configured properly, and indeed, that was it in the end, though it took a very long time to understand all the pieces in play.
A digression on how we store “unicode” data): Historically, we’ve just
used the old MySQL defaults of latin1 for character sets. Of course
MySQL has had support for UTF-8 encoding for a very long time, but there
was a longstanding issue with Perl’s MySQL database driver where it
didn’t know if it was working with Unicode strings or not, and you
couldn’t tell what you were getting back (this was not fixed in
DBD::mysql until 2016). So FM’s workaround is to have special “Unicode”
versions of methods (eg DHashUpdateUnicode
, prepare_unicode
, etc)
that will handle UTF-8 encoding/decoding. The actual storage goes
straight into latin1 columns in the database, and that works because as
long as all the character set settings from client to storage are set
the same, MySQL will just pass it through, never trying to convert
anything. As a result, we have a lot of UTF-8 data stored in latin1
columns.
So! After some work, I reduced it down to a single, simple UPDATE:
UPDATE Users SET UserFullName = "☃" WHERE UserId = 6505;
Now, when executed from the MySQL shell against an 8.0 server, it would give the expected results:
[fminabox root(robn)@fastmail1 ~]# echo 'SELECT UserFullName,HEX(UserFullName) FROM Users where UserId = 6505\G' | sql
*************************** 1. row ***************************
UserFullName: ☃
HEX(UserFullName): E29883
But, when executed from Perl, it would be wrong:
[fminabox root(robn)@fastmail1 ~]# perl -MME::DB -E 'ME::DB->MainDB->do(qq{UPDATE Users SET UserFullName = "\N{SNOWMAN}" WHERE UserId = 6505})'
[fminabox root(robn)@fastmail1 ~]# echo 'SELECT UserFullName,HEX(UserFullName) FROM Users where UserId = 6505\G' | sql
*************************** 1. row ***************************
UserFullName: ?
HEX(UserFullName): 3F
But, when using the same Perl client on bullseye against the old 5.1 server, it actually worked fine! That suggested that something in the client was doing some sort of server version detection and changing modes or something, but I couldn’t find it in DBD::mysql or in libmariadb (I read a lot of code today).
So I went and had a late lunch and a cup of tea, and when I got back I
remembered that the normal sql
alias uses the 5.1 mysql
binary, but
we also have a 8.x client. And that showed something interesting:
mysql ([email protected]) [fastmail]> UPDATE Users SET UserFullName = "☃" WHERE UserId = 6505;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 1
mysql ([email protected]) [fastmail]> show warnings;
+---------+------+---------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+---------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xE2\x98\x83' for column 'UserFullName' at row 1 |
+---------+------+---------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql ([email protected]) [fastmail]> select UserFullName,HEX(UserFullName) from Users where UserId = 6505\G
*************************** 1. row ***************************
UserFullName: ?
HEX(UserFullName): 3F
1 row in set (0.01 sec)
And that’s where I came to learn about MySQL’s pipeline of character sets: client, connection, database and server, and how it all connects together.
There’s various character set settings:
- client: the client’s preferred character set
- connection: the current character set in use for this session
- server: the server’s preferred character set
- database: the stored character set for a given database
Consider this, using an 8.x client against a 8.x server:
mysql ([email protected]) [fastmail]> SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+---------+
| Variable_name | Value |
+--------------------------+---------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8mb3 |
+--------------------------+---------+
7 rows in set (0.00 sec)
mysql ([email protected]) [fastmail]> SHOW SESSION VARIABLES LIKE 'collation\_%';
+----------------------+--------------------+
| Variable_name | Value |
+----------------------+--------------------+
| collation_connection | utf8mb4_0900_ai_ci |
| collation_database | latin1_swedish_ci |
| collation_server | utf8mb4_0900_ai_ci |
+----------------------+--------------------+
3 rows in set (0.01 sec)
Whereas using the 5.1 client against a 8.x server yields:
mysql ([email protected]) [fastmail]> SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+---------+
| Variable_name | Value |
+--------------------------+---------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | utf8mb4 |
| character_set_system | utf8mb3 |
+--------------------------+---------+
7 rows in set (0.00 sec)
mysql ([email protected]) [fastmail]> SHOW SESSION VARIABLES LIKE 'collation\_%';
+----------------------+--------------------+
| Variable_name | Value |
+----------------------+--------------------+
| collation_connection | latin1_swedish_ci |
| collation_database | latin1_swedish_ci |
| collation_server | utf8mb4_0900_ai_ci |
+----------------------+--------------------+
3 rows in set (0.01 sec)
And in production, 5.1 client to 5.1 server:
mysql ([email protected]) [fastmail]> SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
+--------------------------+--------+
7 rows in set (0.00 sec)
mysql ([email protected]) [fastmail]> SHOW SESSION VARIABLES LIKE 'collation\_%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)
So there’s the problem. An 8.x client connection to a 8.x server will run in utf8mb4 by default. Since the database format is latin1 (defined in the schema), the server will try to convert what the client sends it, but since what we’re passing down isn’t actually utf8mb4, the conversion fails.
In bullseye, DBD::mysql is compiled against libmariadb3, which is a 8.x client. So it also gets this behaviour.
What about the 5.1 client to the 8.x server? In that case, the client’s preferred (default) character set is latin1 (ancient MySQL here), and it tells the server this, and the server switches into latin1 itself, and everything works fine.
And then there’s the other direction, 8.x client to 5.1 server (as found on stock bullseye inabox and production bullseye machines). Why does that work? The answer there is that the futuristic client sees that its talking to an ancient server, and reverts back to old-school mode, which is, do whatever the server says and don't argue. So it goes for latin1 there too.
Ok, so! All we have to do is set the server to latin1, and we should be fine, right?
character-set-server = latin1
Not quite. 8.x servers support a “character set negotiation” feature for clients that support it (that is, 8.x clients). In this case, the client sees the server’s suggestion of latin1, and counters with an offer of utf8mb4, which they both agree is better. So we need one last server config option to disable that:
skip-character-set-client-handshake
And then, finally, we get what we want:
mysql ([email protected]) [fastmail]> SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+---------+
| Variable_name | Value |
+--------------------------+---------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8mb3 |
+--------------------------+---------+
7 rows in set (0.01 sec)
mysql ([email protected]) [fastmail]> SHOW SESSION VARIABLES LIKE 'collation\_%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)
and
t/helpdesk.t .. ok
So that’s now baked into the upcoming MySQL 8 config, but also, there’s a bullet dodged there. If it wasn’t for that not-really-database-related test, I almost certainly would not have seen this before it hit production and things started blowing up on junk data coming back.
I don’t really know how to validate this config. Maybe this is the last one, but I do keep finding compat options that I need to flip the further I push on this. I thought about something like doing a test suite run against both database versions, dumping them both and comparing some elements, but that only checks stuff adjacent to tests anyway, which I doubt will show much.
I am about to make a task requesting a DB round-tripping test. Just something that exercises our “unicode” methods and makes sure what comes out is what went in. This will be useful in future MySQL upgrades, and if we ever want to try and switch to storing utf8mb4, now that the whole client stack will support it properly.
I think that’s everything. It was fun, even if its left me a little more horrified than when I started!
Rob N.