Zero Downtime Upgrades

2022-09-03

I upgraded one of my machines to the development version of Fedora Linux and thought it may prove fruitful to try out some of the experimental features. While I haven't entirely grasped how to use sysupdate I did learn a few neat tricks.

sysupdate is a new feature (as of May at least) that is considered experimental. The intent seems to be partially automated updates via systemd timers, it can pull from remote HTTP servers, local directories etc. It seems to be a pretty obvious pull-based mechanism that allows for maintaining multiple versions on the same machine (but only allows running a single version at a time). It looks like the intent is to "pull" some software package/image/container as package_1 and later when an update is available sysupdate will pull package_2 and you would maintain both copies locally (the exact number is configurable) with the name package being symlinked to the newest version, based on the trailing numeric value.

For me personally I was interested in the overlap between this and portable services which I took a look at for a dumb April Fool's Day joke. My idea (and I think the intent in their design) is to package all sorts of typically standalone software in something container-like (and sometimes an actual container). The integration with systemd means things like service files, sockets and timers can all be included and when the portable service is "attached" they are all automatically plugged in. My limited experience has been that things just work and it is pretty easy to get a reasonably secure plug-and-play setup for a few different server-type softwares.

There is a minor annoyance with the portable services setup though that I was a little flummoxed by which is the manner in which socket files are unloaded and thus defeat some of the benefits of socket activation. I mostly understand why this is the case but it means a socket-activated server packaged with the binary, service file and socket file will suffer some small amount of downtime when things are detached and reattached. There is however this quote in the portable services documentation:

The portablectl reattach command combines a detach with an attach. It is useful in case an image gets upgraded, as it allows performing a restart operation on the units instead of stop plus start, thus providing lower downtime and avoiding losing runtime state associated with the unit such as the file descriptor store.

It at least sounded like a better software upgrade path was available.

One Solution

While it breaks some of the encapsulation built into portable services I found that by moving the socket files associated with my server out of the portable service I was able to guarantee better availability during software upgrades. The server in question is taken nearly directly from this introductory post from Lennart Poettering, it is a simple little HTTP server that uses a local file to count visitors. All it does is accept an HTTP request, open a file and increment a counter before responding with "Hello! You are visitor #%d!". The server isn't the interesting part of this! The necessary service file included with the portable service looks like this:

[Unit]
Description=Portable Walkthrough Go Edition Service
Wants=portable-walkthrough-go.socket
After=portable-walkthrough-go.socket

[Service]
ExecStart=/usr/bin/portable-walkthrough-go
Type=exec
StateDirectory=walkthrough-go

Key to my finding though is that portable-walkthrough-go.socket is not included in the portable service package. I wrote it to the host server directly and it looks like this:

[Unit]
Description=Portable Walkthrough Go Edition Socket

[Socket]
ListenStream=8080

[Install]
WantedBy=sockets.target

With this one modification it is possible for systemd to continue listening on the designated port and buffer while the backing server is restarted. For my own testing I created a second version of the server that responds with "Hola! You are visitor #%d!" so that it is obvious if and when the new software is working.

Because I am a dinosaur I still use apache bench for this sort of testing. Specifically I set apache bench to run a basic load test for a fixed amount of time, pointed at my systemd socket. While the load test executed I performed the sort of upgrade described above, symlinking my newer software and then running portablectl reattach --now portable-walkthrough-go.

The results were pretty gratifying, mid-test the logs show the following:

LOG: header received:
HTTP/1.0 200 OK
Date: Sat, 03 Sep 2022 23:06:32 GMT
Content-Length: 32
Content-Type: text/plain; charset=utf-8

Hello! You are visitor #120221!

LOG: header received:
HTTP/1.0 200 OK
Date: Sat, 03 Sep 2022 23:06:32 GMT
Content-Length: 31
Content-Type: text/plain; charset=utf-8

Hola! You are visitor #120222!

With no service interruption the new software version rolls out and quietly starts working. The test report reiterates that no failures occurred as a result of the deployment:

Concurrency Level:      10
Time taken for tests:   30.002 seconds
Complete requests:      28988
Failed requests:        0
Total transferred:      4296068 bytes
HTML transferred:       904472 bytes
Requests per second:    966.20 [#/sec] (mean)

Caveats

As with all things, there are a few gotchas. Firstly, the above test does not make use of TCP keepalive, doing so caused ~16 failed requests. The good news though is this is fixable by sticking it all behind a load balancer like HAProxy. By fronting the requests with the load balancer keepalives will be answered while the server restarts and the socket can buffer requests so none are dropped. This is super cool to me.

In the same vein long-lived connections like a websocket would likely suffer a similar failure without such an easy fix. This isn't a go/haproxy/systemd problem though it's just how long-lived connections tend to work in my experience. The only work around I'm familiar with tends to leave worker processes running so long as connections exist and serve new requests from the upgraded software, which has always been fraught with operational snags for me.

While I am happy with my findings so far I was stumped by how such a workflow of reattaching a portable service was intended to be integrated with sysupdate. The image/package is pulled but then what? Truth be told I haven't figured it out. It seems I'm not the only one with this question though, there is an unanswered email to the systemd-devel mailing list asking as much. It seems there isn't a totally automatic way to roll out updates to portable services but this is rapidly approaching good enough for me — perfect is the enemy of good after all.

I will keep an eye out for automated techniques for triggering restarts for portable services but in the mean time I can take my networking Swiss army knife, a stock Linux machine and get zero-downtime, reasonably secure internet services with a simple deployment story.