Zero Downtime Upgrades2022-09-03
I upgraded one of my machines to the development version of Fedora
Linux and thought it may prove fruitful to try out some of the
experimental features. While I haven't entirely grasped how to
sysupdate I did learn a few neat tricks.
is a new feature (as of May at least) that is considered
experimental. The intent seems to be partially automated updates via
systemd timers, it can pull from remote HTTP servers, local
directories etc. It seems to be a pretty obvious pull-based
mechanism that allows for maintaining multiple versions on the same
machine (but only allows running a single version at a time). It
looks like the intent is to "pull" some software
package/image/container as package_1 and later when an
update is available sysupdate will pull package_2 and you
would maintain both copies locally (the exact number is
configurable) with the name package being symlinked to the
newest version, based on the trailing numeric value.
For me personally I was interested in the overlap between this and portable services which I took a look at for a dumb April Fool's Day joke. My idea (and I think the intent in their design) is to package all sorts of typically standalone software in something container-like (and sometimes an actual container). The integration with systemd means things like service files, sockets and timers can all be included and when the portable service is "attached" they are all automatically plugged in. My limited experience has been that things just work and it is pretty easy to get a reasonably secure plug-and-play setup for a few different server-type softwares.
There is a minor annoyance with the portable services setup though that I was a little flummoxed by which is the manner in which socket files are unloaded and thus defeat some of the benefits of socket activation. I mostly understand why this is the case but it means a socket-activated server packaged with the binary, service file and socket file will suffer some small amount of downtime when things are detached and reattached. There is however this quote in the portable services documentation:
The portablectl reattach command combines a detach with an attach. It is useful in case an image gets upgraded, as it allows performing a restart operation on the units instead of stop plus start, thus providing lower downtime and avoiding losing runtime state associated with the unit such as the file descriptor store.
It at least sounded like a better software upgrade path was available.
While it breaks some of the encapsulation built into portable
services I found that by moving the socket files associated with my
server out of the portable service I was able to guarantee better
availability during software upgrades. The server in question is
taken nearly directly
introductory post from Lennart Poettering, it is a simple little
HTTP server that uses a local file to count visitors. All it does is
accept an HTTP request, open a file and increment a counter before
"Hello! You are visitor #%d!". The
server isn't the interesting part of this! The necessary service
file included with the portable service looks like this:
[Unit] Description=Portable Walkthrough Go Edition Service Wants=portable-walkthrough-go.socket After=portable-walkthrough-go.socket [Service] ExecStart=/usr/bin/portable-walkthrough-go Type=exec StateDirectory=walkthrough-go
Key to my finding though is
portable-walkthrough-go.socket is not
included in the portable service package. I wrote it to the host
server directly and it looks like this:
[Unit] Description=Portable Walkthrough Go Edition Socket [Socket] ListenStream=8080 [Install] WantedBy=sockets.target
With this one modification it is possible for systemd to continue
listening on the designated port and buffer while the backing server
is restarted. For my own testing I created a second version of the
server that responds with
"Hola! You are visitor #%d!"
so that it is obvious if and when the new software is working.
Because I am a dinosaur
use apache bench for this sort of testing. Specifically I set
apache bench to run a basic load test for a fixed amount of time,
pointed at my systemd socket. While the load test executed I
performed the sort of upgrade described above, symlinking my newer
software and then running
portablectl reattach --now
The results were pretty gratifying, mid-test the logs show the following:
LOG: header received: HTTP/1.0 200 OK Date: Sat, 03 Sep 2022 23:06:32 GMT Content-Length: 32 Content-Type: text/plain; charset=utf-8 Hello! You are visitor #120221! LOG: header received: HTTP/1.0 200 OK Date: Sat, 03 Sep 2022 23:06:32 GMT Content-Length: 31 Content-Type: text/plain; charset=utf-8 Hola! You are visitor #120222!
With no service interruption the new software version rolls out and quietly starts working. The test report reiterates that no failures occurred as a result of the deployment:
Concurrency Level: 10 Time taken for tests: 30.002 seconds Complete requests: 28988 Failed requests: 0 Total transferred: 4296068 bytes HTML transferred: 904472 bytes Requests per second: 966.20 [#/sec] (mean)
As with all things, there are a few gotchas. Firstly, the above test does not make use of TCP keepalive, doing so caused ~16 failed requests. The good news though is this is fixable by sticking it all behind a load balancer like HAProxy. By fronting the requests with the load balancer keepalives will be answered while the server restarts and the socket can buffer requests so none are dropped. This is super cool to me.
In the same vein long-lived connections like a websocket would likely suffer a similar failure without such an easy fix. This isn't a go/haproxy/systemd problem though it's just how long-lived connections tend to work in my experience. The only work around I'm familiar with tends to leave worker processes running so long as connections exist and serve new requests from the upgraded software, which has always been fraught with operational snags for me.
While I am happy with my findings so far I was stumped by how such a workflow of reattaching a portable service was intended to be integrated with sysupdate. The image/package is pulled but then what? Truth be told I haven't figured it out. It seems I'm not the only one with this question though, there is an unanswered email to the systemd-devel mailing list asking as much. It seems there isn't a totally automatic way to roll out updates to portable services but this is rapidly approaching good enough for me — perfect is the enemy of good after all.
I will keep an eye out for automated techniques for triggering restarts for portable services but in the mean time I can take my networking Swiss army knife, a stock Linux machine and get zero-downtime, reasonably secure internet services with a simple deployment story.