systemd-nspawn and Btrfs

2022-05-14

A few neat tricks to better leverage the capabilities of systemd-nspawn for blazing fast start times in containers launched on-demand.

ephemeral snapshots

For my particular use case I want to launch a container that is isolated from the host and retains nothing across executions. The idea is to allow modifications, so I won't use a read-only file system. Instead there is the options for "ephemeral" snapshots:

If specified, the container is run with a temporary snapshot of its file system that is removed immediately when the container terminates. ... taking the temporary snapshot is more efficient on file systems that support subvolume snapshots or 'reflinks' natively ("btrfs" or new "xfs") than on more traditional file systems that do not ("ext4").

Of note is the bit about efficiency on different file systems. The difference in speed is painfully obvious when comparing something like ext4 and btrfs. For my case it is not workable to use these ephemeral snapshot containers if each execution incurs the cost of copying the full file system first. In the case of many cheap VPS providers there is no option to select your file system when the server is created. It has been my experience across half a dozen different low end providers that you are instead given a single ext4 mount and repartitioning is fraught enough that I don't do it.

Here is how long it takes to launch the command ls inside an ephemeral container and exit, without so much as booting the machine via the init system:

# time systemd-nspawn --ephemeral -D /var/ext4-machines/f35 ls
Spawning container f35-e759d90ef620352d on /var/ext4-machines/.#machine.f35024c8c9bf4d24eec.
Press ^] three times within 1s to kill container.
bin  boot  dev  etc  home  lib  lib64  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var  venv
Container f35-e759d90ef620352d exited successfully.

real    0m11.125s
user    0m0.345s
sys     0m6.589s

Brutal, right? Such is the cost of first copying 900M of data before beginning execution. I should note this is not a particularly fast disk and RAM is limited at just 2G — these are real world constraints! I had thought that because the VPS is an ext4 file system I would be out of luck but sometimes it just takes me a while to think my way around a problem. The trick it turns out is not to repartition but instead mount a new file system image. In the same way you can create a big blob of data that is used by something like VirtualBox as a machine image it is possible to create a blob of data on the ext4 system, format it as a btrfs file system and then mount as if it were a separate disk. It sounds circuitous but seems to work well.

In my case I have ample disk space available so I made a 10G blob that will be my new file system:

# fallocate -l 10G /data/btrfs.raw

Format the 10G blob into a file system:

# mkfs.btrfs --csum=crc32c /data/btrfs.raw

Mount this image file to a directory and suddenly I have a btrfs file system that is immediately faster due to the copy-on-write semantics:

# mount /data/btrfs.raw /var/btrfs-machines -o loop

# cp -r /var/ext4-machines/f35 /var/btrfs-machines/f35

# time systemd-nspawn --ephemeral -D /var/btrfs-machines/f35 ls
Spawning container f35-a6e825741227dee1 on /var/btrfs-machines/.#machine.f3518c41ca5a563ae4c.
Press ^] three times within 1s to kill container.
bin  boot  dev  etc  home  lib  lib64  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var  venv
Container f35-a6e825741227dee1 exited successfully.

real    0m0.167s
user    0m0.058s
sys     0m0.038s

Done in 1.5% of the time! I think it is super neat that you can get this kind of improvement simply by configuring the machine to better leverage those capabilities it provides out of the box.

socket activation

With container start times improved so drastically there is no need to keep things running all the time. Instead it would be nice to only launch a container on-demand. This can be accomplished with socket activation, where systemd maintains a socket independent of your actual process in order to free up resources until there is demand for the socketed service.

In my case I have a networked thing running inside the container, say on port 8000. The idea is to define a systemd socket on the host machine, call it proxy.socket:

[Socket]
ListenStream=/run/proxy.sock

[Install]
WantedBy=sockets.target

There is a corresponding service for this socket (proxy.service):

[Unit]
Requires=my-container.service
After=my-container.service
Requires=proxy.socket
After=proxy.socket

[Service]
ExecStart=/usr/lib/systemd/systemd-socket-proxyd --exit-idle-time=30s 127.0.0.1:8000

To me this is the one bit of interesting configuration. While socket activation is designed to basically establish sockets and pass file descriptors to a process, the systemd developers understand that there are lots of programs that don't work this way and instead expect to run on a port directly. To accomplish this systemd-socket-proxyd acts as a shim layer to receive these opened descriptors and passes them to a socket-unaware process.

The Requires and After lines establish the dependency management, so activity on the socket will launch the proxy.service which will launch my-container.service, where that is the actual systemd-nspawn invocation:

[Unit]
Description=my container
StopWhenUnneeded=yes

[Service]
Type=notify
ExecStart=systemd-nspawn --ephemeral --boot -D /var/btrfs-machines/f35

The combination of --exit-idle-time=30s and StopWhenUnneeded=yes mean that when the connection that initiated the service is closed the container will stop 30 seconds later.

HAProxy Routing and Isolation

Of course, I don't run that exact configuration for my container. Not when I have so much exposure to systemd's isolation capabilities. I get to leverage private network namespaces, CPU limits, private mounts, and user namespacing. One particularly cool thing about all of the above is that I have found I don't end up sacrificing security and isolation in order to make on-demand start-up work. I have explained previously how I use HAProxy to route into network namespaces, it is also possible to configure these sockets as backends so that a web request received by HAProxy is routed to a systemd-socket-proxyd socket which triggers service start and launches a container which will service the request without ever being exposed to the network directly. The necessary configuration looks something like this (haproxy.cfg):

frontend http-in
    bind *:80
    acl is_my_container path_beg /my-container
    use_backend my-container-backend if is_my_container

backend my-container-backend
    server my_container     unix@/run/proxy.sock

trimming start times

The final piece I've been toying with as part of these on-demand container starts is the boot-time for the container init process. This is also the most specific to my particular uses and the exact changes are probably not widely applicable. I want the container to boot in order to take advantage of the full process management of systemd. The containers have services, sockets, etc. intended to be launched on start, more in the style of a full virtual machine than a single process container. The time taken to start is available via the command systemd-analyze time:

# systemd-analyze time
Startup finished in 408ms (userspace) 
graphical.target reached after 368ms in userspace

More granular information is available via systemd-analyze blame, which gives the time taken for start up services:

# systemd-analyze blame
118ms systemd-oomd.service
 69ms systemd-journal-flush.service
 35ms dbus-broker.service
 30ms systemd-user-sessions.service
 28ms systemd-journald.service
 21ms systemd-tmpfiles-setup.service
 21ms dracut-shutdown.service
 17ms systemd-remount-fs.service
 16ms systemd-tmpfiles-setup-dev.service
 12ms systemd-update-utmp-runlevel.service
 12ms systemd-update-utmp.service
 12ms systemd-network-generator.service

The most useful view, in my case, has been systemd-analyze critical-chain which provides the "time critical chain of units". It has highlighted several services that were both slow to start and not required for my uses (systemd-hwdb, unbound-anchor, etc.):

# systemd-analyze critical-chain
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

graphical.target @368ms
└─multi-user.target @368ms
  └─systemd-oomd.service @249ms +118ms
    └─systemd-tmpfiles-setup.service @225ms +21ms
      └─systemd-journal-flush.service @153ms +69ms
        └─systemd-journald.service @122ms +28ms
          └─systemd-journald.socket @120ms
            └─system.slice @112ms
              └─-.slice @112ms

The above has obviously been trimmed down pretty far, I do this by simply disabling unused services with systemctl disable. The end result in my case is sub-second launch times on-demand routed directly from a web request into a containerized process. How cool is that?