This is the text of an email sent to one of Fastmail's internal engineering lists, and has been sanitised slightly to remove private information and internal links.

Fastmail has run on its own hardware in colocated datacentres since its beginning in 1999. Since 2006, FAI has been used to install the OS and base system. By 2020 it had accumulated enough cruft to warrant rewriting the FAI config and provisioning scripts in preparation for upgrading to Debian 10, and I took the opportunity to document it as well.

How server installation works

For as long as you and I have worked at Fastmail, you’ve no doubt heard me grumble about our server installer. That’s because its been accumulating cruft over many many years while continuing to faithfully provide service, so its proven difficult to change.

But of course, we’ve an OS upgrade coming down the pipe, and that means we need our installer to support it. As I have many times previously, I attempted to add one more card to the stack but alas, this time it came tumbling down. It was finally time to remake the entire thing from the ground up.

I have spent the last week doing so, and its looking pretty good so far. The fruits are in hm!6139 which is now basically ready to go to review. I have a couple more machine installs to test (unfortunately its very difficult to test off the network, because you need machines to install!) and bit more commentary to add, and a reviewer to nominate (volunteers welcome), but I hope to land the whole thing early next week. That will get us Jessie installs but from a stable base. Adding Buster and beyond should be a very small task from there.

This email describes how the whole thing is put together, and will be useful background reading for the reviewer, and future generations looking to do the next OS upgrade. Note that I’m describing the world as it will exist once hm!6139 lands, which has some key differences to how we put this together in the past (mostly: we don’t generate the FAI NFS root anymore). These differences are not important and likely of no interest to anyone except me and maybe [author of original config].

Network installation

A typical network installation works as follows.

Operator tells the server to (re)boot, and requests a network boot by some method (IPMI, bang F12, etc)
After POST, the firmware starts some kind of network boot agent (PXE, either legacy or EFI-flavoured)
Agent send a DHCP discovery broadcast. The DHCP server replies with IP/mask/gateway, DNS servers, hostname and name of a bootstrap program to load from the network (for Linux, that’s pxelinux.0 or syslinux.efi)
Agent configures its TCP/IP stack with the returned info.
Agent makes a TFTP request to the DHCP server, requesting the named bootstrap program. TFTP server sends it, agent stores it in memory.
Agent hands control to the bootstrap program.
Bootstrap requests additional config and code from the TFTP server (for Linux, that’s a kernel and install environment) and gets them ready to run.
Bootstrap transfers control to the installer.
Installer goes to work, formatting disks, getting files into place, running programs and so on. Eventually its done and the server reboots.
Server boots from disk into the installed OS

So we have to provide a few things to make this work:

DHCP server
TFTP server
Bootstrap program
Kernel and install environment
NFS server (needed by the install environment, read on)

DHCP server

We run isc-dhcp-server on the root roles. The config is in conf/dhcpd and is fairly straightforward: some options and access controls, the name of the bootstrap program depending on the EFI flag of the incoming request (pxelinux.0 for legacy, syslinux.efi for EFI), and then a MAC/IP/host triple for all the hosts listed in fmvars.

Note that all servers are installed with a static network config, so the DHCP server is only used during installation.

TFTP server

We use atftpd, which is launched from inetd. Again, this runs on the root roles, and is only used during install. It serves files out of /local/tftp, which are populated from conf/fai.

The set of files here are minimal: pxelinux.0 and syslinux.efi and the libraries they depend on, the Linux kernel vmlinuz and the Linux liveboot initrd.img, and the generated host configs for the bootstrap. Apart from the configs, all these are symlinked into the TFTP dir from the FAI root dir (see below).

The configs are generated from conf/fai/make_pxelinux_cfg.pl and its pxelinux.cfg.tt2, which right now looks like this:

default [% HOST %]

label [% HOST %]
kernel vmlinuz
append nomodeset initrd=initrd.img root=nfs:10.202.5.15:/opt/fai/nfsroot:vers=3 rootovl FAI_FLAGS=initial,verbose,sshd,reboot FAI_ACTION=install FAI_CONFIG_SRC=nfs://10.202.5.15/local/fai/config

There’s almost nothing to it: the name of the kernel file, the name of the liveboot (initrd=), and the options needed to tell the installer where to find the FAI system root and config on the NFS server and what to do once it gets there.

Each config file receives a special name, which is a hex representation of the IP address configured by DHCP. This is how the bootstrap finds the right config for the system: it just requests one from the TFTP server using the IP as the filename.

FAI packaging

The installer is built around a system called FAI, the Fully Automatic Installer. This is a collection of shell and Perl of the finest tradition: gluing lots of other programs and scripts together. The idea is that you provide a bunch of configuration files and small programs/scripts that describe the details of a finished server in your environment, and it will produce an installer for it. It can output CD or USB images, VMs, package mirrors and/or network boot and install environments. We use it for the latter.

We generate a package called fai-fastmail (from a normally-shaped build_fai.sh script). It includes the necessary pieces from upstream FAI:

the NFS root, which is a fairly typical (minimal) Debian 10 environment including all the normal trimmings as well as the FAI “client” (the thing that runs the configs and sets up the target system). The root includes the kernel, initrd and bootstrap programs which we copy into the TFTP space at setup.
A set of “basefiles”, which are tarballs of the base systems we might want to install. We bundle two: JESSIE64 and BUSTER64, which are exactly what you’d expect, and are little more than an initial chroot base system generated by debootstrap.

We make one change to to the upstream as we build the package: the NFS root doesn’t ship with firmware for Broadcom network adapters, because they’re in Debian’s “non-free” archive. Since the root is just a Debian base system, we chroot into it, apt install firmware-bnx2, and then run dracut to splice the firmware into the initrd, which we then package.

NFS server

On the root servers, we run a nfs-kernel-server. This serves two paths: /opt/fai/nfsroot for the system base, and /local/fai/config which contains the FAI “config space”, the aforementioned set of config files and programs that describe a finished server.

FAI

With the above settings, we open up with the NFS root mounted at / in read-only mode, with an “overlay” ramdisk, such that the entire filesystem is writeable, but writes only go to memory and are lost on reboot. That’s great, because it means we can do real work, track state, download things and otherwise live a fairly unrestricted life without having to worry about tidying up after ourselves or exposing writeable holes in our install image and so on. And then we get into FAI proper.

FAI starts by mounting the config space from the FAI_CONFIG_SRC variable from the bootstrap config, which is just another NFS path. This is stored in /local/fai/config on the root roles, sourced from hm/conf/fai/config. This is everything needed to drive FAI.

I’m not going to really talk much about how FAI works in the main, because we don’t use most of its features. The FAI Guide is quite good, though a lot to chew off. The main things to understand are:

Every host has a number of classes (see below)
FAI runs through a fixed set of “tasks”, in order. Each of these does a different thing, like partitioning disks, installing packages or running scripts to modify the host in some way. Each task selects config to use or scripts to run based on the class list
Each task can have “hooks”, which are extra scripts that run before the task starts that can modify it in some way (most useful for tasks that don’t normally run scripts, like package installation). Again, hook selection is driven by class.

This is vague, and probably only useful if/when you’re poking around in the config dirs trying to understand how it hangs together. Some things are done in hooks, others in tasks proper, depending on what we need to achieve

Instead, I’ll talk about a more logical flow of the pieces we use, hopefully without going off into the weeds too much.

Host classes

The first thing FAI does is determine which “classes” to include when building this host. These are something like host roles; you add more and more classes to a host to get more and more stuff. It does this by running all the scripts in the class directory, and then collecting their output. The classes are listed one-per-line in the output.

We actually mostly map fmvars roles to FAI classes by generating a script called class/hostclass from hm/conf/fai/config/hostclass.tt2.

A typical set of classes for a typical FM machine would be:

DEFAULT
JESSIE64
DCNYI
FMBASE
MBR
LAST

There are some small variations depending on host type, usually around hardware types (eg, EFI hosts don’t get MBR because booting doesn’t work that way, and IMAP hosts get different disk layouts) and sometimes environmental (Seattle hosts will get DCWEST instead of DCNYI, and eventually, Buster hosts will get BUSTER64) but for the most part, this is it.

FAI uses the class list to decide which config fragments to use and which scripts to run as it does its work. The order of classes is important. In cases where multiple things are done (eg installing packages, running scripts), they will be run in class order. For things where only a single choice can be made (eg disk layout), the last matching class name wins.

Class variables

If a class/<CLASS>.var file exists, it gets loaded as shell variables. This is one of the ways a class can modify the FAI run. We don’t use this much; We only have a DEFAULT.var that everything gets, that sets things like the root password, the time zone, and some special FAI config.

This is a big thing to understand about FAI: for the most part, all the stuff runs as one single shell program that you add-ons contribute to and modify. Like all such things, its equal parts powerful and dangerous.

Partitioning

The next thing FAI does is to partition the disks. It selects a disk config from the disk_config/ directory using the “last” matching class name. Most of the time, this is going to match DEFAULT, which looks like this:

disk_config disk1      bootable:1
primary  /             16000        ext4     relatime              createopts="-L ROOT  -O ^metadata_csum"
primary  swap          2000         swap     sw                    createopts="-L SWAP"
logical  /local        1-           ext4     noatime,data=ordered  createopts="-L LOCAL -O ^metadata_csum"

There’s a lot going here, but its reasonably straightforward to follow. There’s options to set up the disklabel and flags, partition type, size, mountpoint, mount options (for setup; this doesn’t feed into fstab) and options to pass to mkfs.

A more complicated one is HWMICROCLOUD, which needs an extra partition for the EFI system partition, a GPT disklabel, and extra setup to do software RAID over the root filesystems:

disk_config nvme0n1 disklabel:gpt bootable:1
primary  -        200     - -
primary  -        16000   - -
primary  swap     2000    swap sw createopts="-L SWAP"
primary  -        1-      - -

disk_config nvme1n1 disklabel:gpt bootable:1 sameas:nvme0n1

disk_config raid fstabkey:uuid
raid1  /boot/fs  nvme0n1p1,nvme1n1p1  vfat  rw                    mdcreateopts="--bitmap=none --metadata=1.0"  createopts="-n BOOT"
raid1  /         nvme0n1p2,nvme1n1p2  ext4  relatime              mdcreateopts="--bitmap=none"                 createopts="-L ROOT  -O ^metadata_csum"
raid1  /local    nvme0n1p4,nvme1n1p4  ext4  noatime,data=ordered  mdcreateopts="--bitmap=none"                 createopts="-L LOCAL -O ^metadata_csum"

It’s not necessarily easily writeable, but once its working its pretty readable. Some of the options aren’t documented anywhere except the FAI code (preserve_lazy), so its not totally readable, but its not too bad. Fortunately we don’t have to do this very often.

An important point to note: this is not responsible for setting up Cyrus slot partitions - that’s done after the system starts, and is outside the scope of the installer. The only time that FAI gets involved with Cyrus (and DRBD) partitioning is when the layout shares the system disks, in which case our FAI disk configs will create empty partitions for those as necessary. This is most evident in the HWSUPERMICROIMAP3 disk config, which is used on the newer 2U IMAP servers. On those, the OS and the SSD mail storage is on the same set of disks, so we need to partition the whole lot in FAI. Actually creating those filesystems though is entirely a post-boot affair.

One small spanner is how the physical disks are selected. For HWMICROCLOUD, there are explicit names like nvme0n1p1, so its easy. For DEFAULT and others, it just says disk1. That’s an alias to “the first disk”, which FAI has particular ideas about that don’t match our (frankly bananas) world where in the case of an IMAP machine there might be eight or more disks available from the start. And you can’t bet on the first disk being called sda, because of Linux’s famously unstable device enumeration order.

To get around this, we use a hook before partitioning occurs to run scripts/find_drive.pl. This awful hack of a script uses host classes, hardware detection and some rubbish heuristics to determine which disk(s) are the root disk and feed them back into FAI. The method is fine even though the script is awful, but it only needs updating every couple of years when a new host class arrives, so its been mostly meh.

Once the partitions are create and the filesystems formatted, FAI mounts them all under /target. It’s here that we start to fill the new root filesystem with things.

Base system

Next up FAI has to choose a base system. For this it looks for basefiles/<class>.tar.xz, using the “last” class again. For us, that’s always JESSIE64.tar.xz but soon, BUSTER64.tar.xz as well (incidentally, you can put whatever you want in these files, which is how you’d use FAI to install a non-Debian or maybe even non-Linux system).

The file is unpacked straight into /target, the root of our new system, and we’re on our way!

Base files

If there are any matching classes under files/, they’re now copied into the target. This is where we set up apt config and repository details for the particular system (the JESSIE64 and soon BUSTER64 classes), and the right resolv.conf for the local datacentre (the DCNYI and DCWEST classes).

Packages

package_config/ contains a number of files, again named by class, that describe what packages to install into the target. Any matching <class>.asc files are added to the apt keystore, which is how we teach the new host about our internal package repository.

In here we arrange for the right kernel and support packages (firmware, microcode) to arrive, systemd to be removed and finally, fastmail-server to be installed which of course brings in the hundreds of packages that make a FM server.

Scripts

At this point we have all our packages installed, so its time to turn this from a plain old Debian computer into an FM clown car.

As with everything in FAI, we work through the scripts in the scripts/ directory in class order, and within that, in filename sort order (ie everything is prefixed with a number). They’re just programs, they just run and do stuff. There’s a wide variety of variables available, the most commonly seen are $target, which is the path to the system we’re installing (usually /target) and $ROOTCMD, which is chroot $target, that is, run a command inside the new system.

There’s not actually a lot to talk about here, it’s mostly the moral equivalent of:

git clone git@[gitserver]:fastmail/hm /home/mod_perl/hm
env REALLY=1 make -C /home/mod_perl/hm/conf install

It’s also very similar to what the fminabox provisioning scripts do (in fact, those were written in part based on the old FAI scripts).

Reboot

And that’s actually all of it. Once this all completes, the system is ready for service. FAI unmounts everything and reboots the system. The firmware finds an OS and boots it, a flag is posted in #alerts, and the system is ready for service. Most of the time, srv all start is all that’s necessary.

I have nothing else to add at this time, I think! Look for a code review request very soon!

Rob N.