A few neat tricks to better leverage the capabilities of systemd-nspawn for blazing fast start times in containers launched on-demand.
For my particular use case I want to launch a container that is isolated from the host and retains nothing across executions. The idea is to allow modifications, so I won't use a read-only file system. Instead there is the options for "ephemeral" snapshots:
If specified, the container is run with a temporary snapshot of its file system that is removed immediately when the container terminates. ... taking the temporary snapshot is more efficient on file systems that support subvolume snapshots or 'reflinks' natively ("btrfs" or new "xfs") than on more traditional file systems that do not ("ext4").
Of note is the bit about efficiency on different file systems. The difference in speed is painfully obvious when comparing something like ext4 and btrfs. For my case it is not workable to use these ephemeral snapshot containers if each execution incurs the cost of copying the full file system first. In the case of many cheap VPS providers there is no option to select your file system when the server is created. It has been my experience across half a dozen different low end providers that you are instead given a single ext4 mount and repartitioning is fraught enough that I don't do it.
Here is how long it takes to launch the command ls
inside an ephemeral container and exit, without so much as booting
the machine via the init system:
# time systemd-nspawn --ephemeral -D /var/ext4-machines/f35 ls
Spawning container f35-e759d90ef620352d on /var/ext4-machines/.#machine.f35024c8c9bf4d24eec.
Press ^] three times within 1s to kill container.
bin boot dev etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var venv
Container f35-e759d90ef620352d exited successfully.
real 0m11.125s
user 0m0.345s
sys 0m6.589s
Brutal, right? Such is the cost of first copying 900M of data before beginning execution. I should note this is not a particularly fast disk and RAM is limited at just 2G — these are real world constraints! I had thought that because the VPS is an ext4 file system I would be out of luck but sometimes it just takes me a while to think my way around a problem. The trick it turns out is not to repartition but instead mount a new file system image. In the same way you can create a big blob of data that is used by something like VirtualBox as a machine image it is possible to create a blob of data on the ext4 system, format it as a btrfs file system and then mount as if it were a separate disk. It sounds circuitous but seems to work well.
In my case I have ample disk space available so I made a 10G blob that will be my new file system:
# fallocate -l 10G /data/btrfs.raw
Format the 10G blob into a file system:
# mkfs.btrfs --csum=crc32c /data/btrfs.raw
Mount this image file to a directory and suddenly I have a btrfs file system that is immediately faster due to the copy-on-write semantics:
# mount /data/btrfs.raw /var/btrfs-machines -o loop
# cp -r /var/ext4-machines/f35 /var/btrfs-machines/f35
# time systemd-nspawn --ephemeral -D /var/btrfs-machines/f35 ls
Spawning container f35-a6e825741227dee1 on /var/btrfs-machines/.#machine.f3518c41ca5a563ae4c.
Press ^] three times within 1s to kill container.
bin boot dev etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var venv
Container f35-a6e825741227dee1 exited successfully.
real 0m0.167s
user 0m0.058s
sys 0m0.038s
Done in 1.5% of the time! I think it is super neat that you can get this kind of improvement simply by configuring the machine to better leverage those capabilities it provides out of the box.
With container start times improved so drastically there is no need to keep things running all the time. Instead it would be nice to only launch a container on-demand. This can be accomplished with socket activation, where systemd maintains a socket independent of your actual process in order to free up resources until there is demand for the socketed service.
In my case I have a networked thing running inside the
container, say on port 8000. The idea is to define a systemd socket
on the host machine, call it proxy.socket
:
[Socket]
ListenStream=/run/proxy.sock
[Install]
WantedBy=sockets.target
There is a corresponding service for this socket
(proxy.service
):
[Unit]
Requires=my-container.service
After=my-container.service
Requires=proxy.socket
After=proxy.socket
[Service]
ExecStart=/usr/lib/systemd/systemd-socket-proxyd --exit-idle-time=30s 127.0.0.1:8000
To me this is the one bit of interesting configuration. While socket
activation is designed to basically establish sockets and pass file
descriptors to a process, the systemd developers understand that
there are lots of programs that don't work this way and instead
expect to run on a port directly. To accomplish
this systemd-socket-proxyd
acts as a shim layer to receive these opened descriptors and passes
them to a socket-unaware process.
The Requires
and After
lines establish the
dependency management, so activity on the socket will launch
the proxy.service
which will
launch my-container.service
, where that is the actual
systemd-nspawn invocation:
[Unit]
Description=my container
StopWhenUnneeded=yes
[Service]
Type=notify
ExecStart=systemd-nspawn --ephemeral --boot -D /var/btrfs-machines/f35
The combination of --exit-idle-time=30s
and StopWhenUnneeded=yes
mean that when the connection
that initiated the service is closed the container will stop 30
seconds later.
Of course, I don't run that exact configuration for my
container. Not when I have so
much exposure
to
systemd's isolation
capabilities. I get to leverage private network namespaces, CPU
limits, private mounts, and user namespacing. One particularly cool
thing about all of the above is that I have found I don't end up
sacrificing security and isolation in order to make on-demand
start-up work. I
have explained
previously how I use HAProxy to route into network namespaces,
it is also possible to configure these sockets as backends so that a
web request received by HAProxy is routed to
a systemd-socket-proxyd
socket which triggers service
start and launches a container which will service the request
without ever being exposed to the network directly. The necessary
configuration looks something like this (haproxy.cfg
):
frontend http-in
bind *:80
acl is_my_container path_beg /my-container
use_backend my-container-backend if is_my_container
backend my-container-backend
server my_container unix@/run/proxy.sock
The final piece I've been toying with as part of these on-demand
container starts is the boot-time for the container init
process. This is also the most specific to my particular uses and
the exact changes are probably not widely applicable. I want the
container to boot in order to take advantage of the full process
management of systemd. The containers have services, sockets,
etc. intended to be launched on start, more in the style of a full
virtual machine than a single process container. The time taken to
start is available via the command systemd-analyze
time
:
# systemd-analyze time
Startup finished in 408ms (userspace)
graphical.target reached after 368ms in userspace
More granular information is available via systemd-analyze
blame
, which gives the time taken for start up services:
# systemd-analyze blame
118ms systemd-oomd.service
69ms systemd-journal-flush.service
35ms dbus-broker.service
30ms systemd-user-sessions.service
28ms systemd-journald.service
21ms systemd-tmpfiles-setup.service
21ms dracut-shutdown.service
17ms systemd-remount-fs.service
16ms systemd-tmpfiles-setup-dev.service
12ms systemd-update-utmp-runlevel.service
12ms systemd-update-utmp.service
12ms systemd-network-generator.service
The most useful view, in my case, has been systemd-analyze
critical-chain
which provides the "time critical chain of
units". It has highlighted several services that were both slow to
start and not required for my uses (systemd-hwdb, unbound-anchor,
etc.):
# systemd-analyze critical-chain
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.
graphical.target @368ms
└─multi-user.target @368ms
└─systemd-oomd.service @249ms +118ms
└─systemd-tmpfiles-setup.service @225ms +21ms
└─systemd-journal-flush.service @153ms +69ms
└─systemd-journald.service @122ms +28ms
└─systemd-journald.socket @120ms
└─system.slice @112ms
└─-.slice @112ms
The above has obviously been trimmed down pretty far, I do this by
simply disabling unused services with systemctl
disable
. The end result in my case is sub-second launch times
on-demand routed directly from a web request into a containerized
process. How cool is that?