This is the text of an email sent to one of Fastmail's internal engineering lists, and has been sanitised slightly to remove private information and internal links.
Fastmail has run on its own hardware in colocated datacentres since its beginning in 1999. Since 2006, FAI has been used to install the OS and base system. By 2020 it had accumulated enough cruft to warrant rewriting the FAI config and provisioning scripts in preparation for upgrading to Debian 10, and I took the opportunity to document it as well.
How server installation works
For as long as you and I have worked at Fastmail, you’ve no doubt heard me grumble about our server installer. That’s because its been accumulating cruft over many many years while continuing to faithfully provide service, so its proven difficult to change.
But of course, we’ve an OS upgrade coming down the pipe, and that means we need our installer to support it. As I have many times previously, I attempted to add one more card to the stack but alas, this time it came tumbling down. It was finally time to remake the entire thing from the ground up.
I have spent the last week doing so, and its looking pretty good so far. The fruits are in hm!6139 which is now basically ready to go to review. I have a couple more machine installs to test (unfortunately its very difficult to test off the network, because you need machines to install!) and bit more commentary to add, and a reviewer to nominate (volunteers welcome), but I hope to land the whole thing early next week. That will get us Jessie installs but from a stable base. Adding Buster and beyond should be a very small task from there.
This email describes how the whole thing is put together, and will be useful background reading for the reviewer, and future generations looking to do the next OS upgrade. Note that I’m describing the world as it will exist once hm!6139 lands, which has some key differences to how we put this together in the past (mostly: we don’t generate the FAI NFS root anymore). These differences are not important and likely of no interest to anyone except me and maybe [author of original config].
Network installation
A typical network installation works as follows.
- Operator tells the server to (re)boot, and requests a network boot by some method (IPMI, bang F12, etc)
- After POST, the firmware starts some kind of network boot agent (PXE, either legacy or EFI-flavoured)
- Agent send a DHCP discovery broadcast. The DHCP server replies with
IP/mask/gateway, DNS servers, hostname and name of a bootstrap
program to load from the network (for Linux, that’s
pxelinux.0
orsyslinux.efi
) - Agent configures its TCP/IP stack with the returned info.
- Agent makes a TFTP request to the DHCP server, requesting the named bootstrap program. TFTP server sends it, agent stores it in memory.
- Agent hands control to the bootstrap program.
- Bootstrap requests additional config and code from the TFTP server (for Linux, that’s a kernel and install environment) and gets them ready to run.
- Bootstrap transfers control to the installer.
- Installer goes to work, formatting disks, getting files into place, running programs and so on. Eventually its done and the server reboots.
- Server boots from disk into the installed OS
So we have to provide a few things to make this work:
- DHCP server
- TFTP server
- Bootstrap program
- Kernel and install environment
- NFS server (needed by the install environment, read on)
DHCP server
We run isc-dhcp-server
on the root roles. The config is in
conf/dhcpd
and is fairly straightforward: some options and access
controls, the name of the bootstrap program depending on the EFI flag of
the incoming request (pxelinux.0
for legacy, syslinux.efi
for EFI),
and then a MAC/IP/host triple for all the hosts listed in fmvars.
Note that all servers are installed with a static network config, so the DHCP server is only used during installation.
TFTP server
We use atftpd
, which is launched from inetd
. Again, this runs on the
root roles, and is only used during install. It serves files out of
/local/tftp
, which are populated from conf/fai
.
The set of files here are minimal: pxelinux.0
and syslinux.efi
and
the libraries they depend on, the Linux kernel vmlinuz
and the Linux
liveboot initrd.img
, and the generated host configs for the bootstrap.
Apart from the configs, all these are symlinked into the TFTP dir from
the FAI root dir (see below).
The configs are generated from conf/fai/make_pxelinux_cfg.pl
and its
pxelinux.cfg.tt2
, which right now looks like this:
default [% HOST %]
label [% HOST %]
kernel vmlinuz
append nomodeset initrd=initrd.img root=nfs:10.202.5.15:/opt/fai/nfsroot:vers=3 rootovl FAI_FLAGS=initial,verbose,sshd,reboot FAI_ACTION=install FAI_CONFIG_SRC=nfs://10.202.5.15/local/fai/config
There’s almost nothing to it: the name of the kernel file, the name of
the liveboot (initrd=
), and the options needed to tell the installer
where to find the FAI system root and config on the NFS server and what
to do once it gets there.
Each config file receives a special name, which is a hex representation of the IP address configured by DHCP. This is how the bootstrap finds the right config for the system: it just requests one from the TFTP server using the IP as the filename.
FAI packaging
The installer is built around a system called FAI, the Fully Automatic Installer. This is a collection of shell and Perl of the finest tradition: gluing lots of other programs and scripts together. The idea is that you provide a bunch of configuration files and small programs/scripts that describe the details of a finished server in your environment, and it will produce an installer for it. It can output CD or USB images, VMs, package mirrors and/or network boot and install environments. We use it for the latter.
We generate a package called fai-fastmail
(from a normally-shaped
build_fai.sh
script). It includes the necessary pieces from upstream
FAI:
- the NFS root, which is a fairly typical (minimal) Debian 10 environment including all the normal trimmings as well as the FAI “client” (the thing that runs the configs and sets up the target system). The root includes the kernel, initrd and bootstrap programs which we copy into the TFTP space at setup.
- A set of “basefiles”, which are tarballs of the base systems we
might want to install. We bundle two:
JESSIE64
andBUSTER64
, which are exactly what you’d expect, and are little more than an initial chroot base system generated bydebootstrap
.
We make one change to to the upstream as we build the package: the NFS
root doesn’t ship with firmware for Broadcom network adapters, because
they’re in Debian’s “non-free” archive. Since the root is just a Debian
base system, we chroot into it, apt install firmware-bnx2
, and then
run dracut
to splice the firmware into the initrd, which we then
package.
NFS server
On the root servers, we run a nfs-kernel-server
. This serves two
paths: /opt/fai/nfsroot
for the system base, and /local/fai/config
which contains the FAI “config space”, the aforementioned set of config
files and programs that describe a finished server.
FAI
With the above settings, we open up with the NFS root mounted at /
in
read-only mode, with an “overlay” ramdisk, such that the entire
filesystem is writeable, but writes only go to memory and are lost on
reboot. That’s great, because it means we can do real work, track state,
download things and otherwise live a fairly unrestricted life without
having to worry about tidying up after ourselves or exposing writeable
holes in our install image and so on. And then we get into FAI proper.
FAI starts by mounting the config space from the FAI_CONFIG_SRC
variable from the bootstrap config, which is just another NFS path. This
is stored in /local/fai/config
on the root roles, sourced from
hm/conf/fai/config
. This is everything needed to drive FAI.
I’m not going to really talk much about how FAI works in the main, because we don’t use most of its features. The FAI Guide is quite good, though a lot to chew off. The main things to understand are:
- Every host has a number of classes (see below)
- FAI runs through a fixed set of “tasks”, in order. Each of these does a different thing, like partitioning disks, installing packages or running scripts to modify the host in some way. Each task selects config to use or scripts to run based on the class list
- Each task can have “hooks”, which are extra scripts that run before the task starts that can modify it in some way (most useful for tasks that don’t normally run scripts, like package installation). Again, hook selection is driven by class.
This is vague, and probably only useful if/when you’re poking around in the config dirs trying to understand how it hangs together. Some things are done in hooks, others in tasks proper, depending on what we need to achieve
Instead, I’ll talk about a more logical flow of the pieces we use, hopefully without going off into the weeds too much.
Host classes
The first thing FAI does is determine which “classes” to include when
building this host. These are something like host roles; you add more
and more classes to a host to get more and more stuff. It does this by
running all the scripts in the class
directory, and then collecting
their output. The classes are listed one-per-line in the output.
We actually mostly map fmvars roles to FAI classes by generating a
script called class/hostclass
from hm/conf/fai/config/hostclass.tt2
.
A typical set of classes for a typical FM machine would be:
DEFAULT
JESSIE64
DCNYI
FMBASE
MBR
LAST
There are some small variations depending on host type, usually around
hardware types (eg, EFI hosts don’t get MBR
because booting doesn’t
work that way, and IMAP hosts get different disk layouts) and sometimes
environmental (Seattle hosts will get DCWEST
instead of DCNYI
, and
eventually, Buster hosts will get BUSTER64
) but for the most part,
this is it.
FAI uses the class list to decide which config fragments to use and which scripts to run as it does its work. The order of classes is important. In cases where multiple things are done (eg installing packages, running scripts), they will be run in class order. For things where only a single choice can be made (eg disk layout), the last matching class name wins.
Class variables
If a class/<CLASS>.var
file exists, it gets loaded as shell variables.
This is one of the ways a class can modify the FAI run. We don’t use
this much; We only have a DEFAULT.var
that everything gets, that sets
things like the root password, the time zone, and some special FAI
config.
This is a big thing to understand about FAI: for the most part, all the stuff runs as one single shell program that you add-ons contribute to and modify. Like all such things, its equal parts powerful and dangerous.
Partitioning
The next thing FAI does is to partition the disks. It selects a disk
config from the disk_config/
directory using the “last” matching class
name. Most of the time, this is going to match DEFAULT
, which looks
like this:
disk_config disk1 bootable:1
primary / 16000 ext4 relatime createopts="-L ROOT -O ^metadata_csum"
primary swap 2000 swap sw createopts="-L SWAP"
logical /local 1- ext4 noatime,data=ordered createopts="-L LOCAL -O ^metadata_csum"
There’s a lot going here, but its reasonably straightforward to follow.
There’s options to set up the disklabel and flags, partition type, size,
mountpoint, mount options (for setup; this doesn’t feed into fstab
)
and options to pass to mkfs
.
A more complicated one is HWMICROCLOUD
, which needs an extra partition
for the EFI system partition, a GPT disklabel, and extra setup to do
software RAID over the root filesystems:
disk_config nvme0n1 disklabel:gpt bootable:1
primary - 200 - -
primary - 16000 - -
primary swap 2000 swap sw createopts="-L SWAP"
primary - 1- - -
disk_config nvme1n1 disklabel:gpt bootable:1 sameas:nvme0n1
disk_config raid fstabkey:uuid
raid1 /boot/fs nvme0n1p1,nvme1n1p1 vfat rw mdcreateopts="--bitmap=none --metadata=1.0" createopts="-n BOOT"
raid1 / nvme0n1p2,nvme1n1p2 ext4 relatime mdcreateopts="--bitmap=none" createopts="-L ROOT -O ^metadata_csum"
raid1 /local nvme0n1p4,nvme1n1p4 ext4 noatime,data=ordered mdcreateopts="--bitmap=none" createopts="-L LOCAL -O ^metadata_csum"
It’s not necessarily easily writeable, but once its working its pretty
readable. Some of the options aren’t documented anywhere except the FAI
code (preserve_lazy
), so its not totally readable, but its not too
bad. Fortunately we don’t have to do this very often.
An important point to note: this is not responsible for setting up Cyrus
slot partitions - that’s done after the system starts, and is outside
the scope of the installer. The only time that FAI gets involved with
Cyrus (and DRBD) partitioning is when the layout shares the system
disks, in which case our FAI disk configs will create empty partitions
for those as necessary. This is most evident in the HWSUPERMICROIMAP3
disk config, which is used on the newer 2U IMAP servers. On those, the
OS and the SSD mail storage is on the same set of disks, so we need to
partition the whole lot in FAI. Actually creating those filesystems
though is entirely a post-boot affair.
One small spanner is how the physical disks are selected. For
HWMICROCLOUD
, there are explicit names like nvme0n1p1
, so its easy.
For DEFAULT
and others, it just says disk1
. That’s an alias to “the
first disk”, which FAI has particular ideas about that don’t match our
(frankly bananas) world where in the case of an IMAP machine there might
be eight or more disks available from the start. And you can’t bet on
the first disk being called sda
, because of Linux’s famously unstable
device enumeration order.
To get around this, we use a hook before partitioning occurs to run
scripts/find_drive.pl
. This awful hack of a script uses host classes,
hardware detection and some rubbish heuristics to determine which
disk(s) are the root disk and feed them back into FAI. The method is
fine even though the script is awful, but it only needs updating every
couple of years when a new host class arrives, so its been mostly meh.
Once the partitions are create and the filesystems formatted, FAI mounts
them all under /target
. It’s here that we start to fill the new root
filesystem with things.
Base system
Next up FAI has to choose a base system. For this it looks for
basefiles/<class>.tar.xz
, using the “last” class again. For us, that’s
always JESSIE64.tar.xz
but soon, BUSTER64.tar.xz
as well
(incidentally, you can put whatever you want in these files, which is
how you’d use FAI to install a non-Debian or maybe even non-Linux
system).
The file is unpacked straight into /target
, the root of our new
system, and we’re on our way!
Base files
If there are any matching classes under files/
, they’re now copied
into the target. This is where we set up apt config and repository
details for the particular system (the JESSIE64
and soon BUSTER64
classes), and the right resolv.conf
for the local datacentre (the
DCNYI
and DCWEST
classes).
Packages
package_config/
contains a number of files, again named by class, that
describe what packages to install into the target. Any matching
<class>.asc
files are added to the apt keystore, which is how we teach
the new host about our internal package repository.
In here we arrange for the right kernel and support packages (firmware,
microcode) to arrive, systemd to be removed and finally,
fastmail-server
to be installed which of course brings in the hundreds
of packages that make a FM server.
Scripts
At this point we have all our packages installed, so its time to turn this from a plain old Debian computer into an FM clown car.
As with everything in FAI, we work through the scripts in the scripts/
directory in class order, and within that, in filename sort order (ie
everything is prefixed with a number). They’re just programs, they just
run and do stuff. There’s a wide variety of variables available, the
most commonly seen are $target
, which is the path to the system we’re
installing (usually /target
) and $ROOTCMD
, which is
chroot $target
, that is, run a command inside the new system.
There’s not actually a lot to talk about here, it’s mostly the moral equivalent of:
git clone git@[gitserver]:fastmail/hm /home/mod_perl/hm
env REALLY=1 make -C /home/mod_perl/hm/conf install
It’s also very similar to what the fminabox provisioning scripts do (in fact, those were written in part based on the old FAI scripts).
Reboot
And that’s actually all of it. Once this all completes, the system is
ready for service. FAI unmounts everything and reboots the system. The
firmware finds an OS and boots it, a flag is posted in #alerts, and the
system is ready for service. Most of the time, srv all start
is all
that’s necessary.
I have nothing else to add at this time, I think! Look for a code review request very soon!
Rob N.