Visualizing Monitoring Metrics2018-01-28
I recently experimented with system monitoring but stopped short of creating any reporting on the metrics gathered. I took some time to draw up a quick interface to surface some of that information in a further exploration of managing my own server.
As mentioned when I built the metric collection application, all data is stored in a SQLite database in a single table with the following schema:
CREATE TABLE metrics ( key TEXT, value FLOAT, timestamp DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL );
This makes writes very easy, and allows for new "interesting" things to be
written to the database without the need for database migrations first, new
information can be inserted directly using a new
My previous standard practice for monitoring this server was to occasionally
SSH in and run
top, manually verifying things looked "fine" (without ever
really bothering to outline what "fine" looked like). Keeping with my standard
the best is the enemy of the good,
I've opted to simply ease my own access to system state, rather than try to
automate any checks. Influenced heavily by the original project impetus
I set to create a chart that captured those things I was regularly watching
A single SQL query is suffient to extract all the necessary data, it is necessary however to do a bit of JOINing on the table itself to de-dupe entries into a single tuple (row) per timestamp. Here I limit the query to only the last 7 days of data:
SELECT strftime('%s', memory.timestamp), memory.value*100, cpu.value, disk.value FROM metrics AS memory, metrics AS cpu, metrics AS disk WHERE memory.timestamp = cpu.timestamp AND disk.timestamp = cpu.timestamp AND memory.key = 'memory' AND cpu.key = 'cpu' AND disk.key = 'disk_/' AND memory.timestamp > date('now', '-7 day') ORDER BY memory.timestamp ASC;
Annoyingly, it occurred to me only now that I am trying to use the data that my metric for available memory is a bit inconsistent with disk usage. Disk is reported as a percentage, while memory is reported as the quotient of available/total — which means I end up multiplying by 100 in the query itself to normalize the two numbers to percentages.
I chose gnuplot for the graphing, for no real reason other than it integrates nicely with org-mode, where I do a lot of this kind of exploratory work. The final script ends up like the following:
reset set palette defined (2 '#1485d4', 4 '#38b99e', 6 '#d9ba56') set style line 101 lc rgb '#606060' lt 1 lw 1 set border 3 front ls 101 set tics nomirror out scale 0.75 set format '%g' set style line 102 lc rgb '#808080' lt 0 lw 1 set grid back ls 102 set terminal svg size 900,350 set style fill transparent solid 0.4 noborder set xdata time set timefmt "%s" set yrange [0:100] set xtics nomirror set ylabel 'Percentage, Memory and Disk' set y2tics nomirror set y2label 'Load Average' plot data using 1:2 with filledcurves x1 lc 4 notitle,\ '' using 1:2 with lines lw 2 lc 4 title 'Free Memory',\ '' using 1:4 with filledcurves x1 lc 2 notitle,\ '' using 1:4 with lines lw 2 lc 2 title 'Disk Usage',\ '' using 1:3 with filledcurves x1 lc 6 axes x1y2 notitle,\ '' using 1:3 with lines lw 2 lc 6 axes x1y2 title '5 Minute Load Average'
Which is only notable for the fact that it actually charts each value twice, first for the semi-transparent filled area and second for the solid line of the same color. I had some trouble with the formatting of the xtics as dates, where with too much customization they would inevitably creep upward into the plot itself. I gave up after a while and settled for the standard time format.
It may seem a little odd to combine two reports of percentage usage with the load-average metric, but I decided the YY plot didn't over-complicate the end result. If my server saw more traffic or higher utilization, I might reconsider, but as it is, load average is routinely within epsilon of zero.
I first wrapped the above into a shell script, which I won't bother repeating
here, saving it as
plot.sh. The last step in the script is to move the
generated SVG into a path accessible to my webserver, so that I can check the
state of the server without needing to SSH in or transfer files. The "live"
view is visible here.
From there the only thing left to do is setup a scheduled run of the script at
some sensible interval. For reasons that are unclear, I decided I would use a
systemd timer service rather than the more obvious cronjob. I suppose I thought
"it's 2018 and I should get with it" — silly idea. The gist of it though, is
a two-parter, first there is a "timer service" that defines when something
should happen (call it
[Unit] Description=Recreate system monitoring plot every 15 minutes [Timer] OnCalendar=*:0/15
and second is the service itself (call it
[Unit] Description=Recreate system monitoring plot [Service] Type=oneshot ExecStart=/bin/sh /path/to/plot.sh
It took an embarassing while to figure out the above setup and the worst part is I'm not sure it is actually correct. It is unclear whether I should be invoking wrapper scripts inside of service files or if there is a different pattern I don't know of. Equally painful, I think I have the various pieces enabled correctly, but that was mostly through a series of failed attempts. I believe the winning combination ended up being:
$ sudo systemctl enable monitoring-plot.service $ sudo systemctl enable monitoring-plot.timer $ sudo systemctl start monitoring-plot.timer
After all is said and done, I'm happy to have made use of the monitoring service I created previously. I have a few minor complaints now that I've put things through their paces but nothing I didn't suspect already. It turns out my ad-hoc schema is a bit painful for dealing with the key-value time-series data I'm tracking, I mentioned this when I first defined the schema, which I had actually forgotten about. If I were developing more on top of this I would certainly want to clean some things up, specifically:
- Abstract writes to the database better through a client.
- Normalize the "key" portion of the database schema through a secondary table.
- A single chart is probably "good enough", but scale is difficult, perhaps break things out into "recent" and "less recent", for a coarser view. I think this might require tinkering with the plotting to avoid overwhelming the plot with information.