Visualizing Monitoring Metrics

2018-01-28

I recently experimented with system monitoring but stopped short of creating any reporting on the metrics gathered. I took some time to draw up a quick interface to surface some of that information in a further exploration of managing my own server.

Data Format

As mentioned when I built the metric collection application, all data is stored in a SQLite database in a single table with the following schema:

CREATE TABLE metrics (
       key       TEXT,
       value     FLOAT,
       timestamp DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL
);

This makes writes very easy, and allows for new "interesting" things to be written to the database without the need for database migrations first, new information can be inserted directly using a new key.

Reporting

My previous standard practice for monitoring this server was to occasionally SSH in and run top, manually verifying things looked "fine" (without ever really bothering to outline what "fine" looked like). Keeping with my standard tact of the best is the enemy of the good, I've opted to simply ease my own access to system state, rather than try to automate any checks. Influenced heavily by the original project impetus Munin, I set to create a chart that captured those things I was regularly watching for.

Extracting Data

A single SQL query is suffient to extract all the necessary data, it is necessary however to do a bit of JOINing on the table itself to de-dupe entries into a single tuple (row) per timestamp. Here I limit the query to only the last 7 days of data:

SELECT strftime('%s', memory.timestamp),
       memory.value*100,
       cpu.value,
       disk.value 
FROM metrics AS memory,
     metrics AS cpu,
     metrics AS disk
WHERE memory.timestamp = cpu.timestamp AND disk.timestamp = cpu.timestamp
      AND memory.key = 'memory'
      AND cpu.key = 'cpu'
      AND disk.key = 'disk_/'
      AND memory.timestamp > date('now', '-7 day')
ORDER BY memory.timestamp ASC;

timestamp	memory	load	disk
1517116030	70.9596820889485	0.04	10.8728178196271
1517116330	70.4480737350583	0.01	10.8728760298603
1517116630	71.0302761098728	0.0	10.8728760298603
1517116930	71.0485195085387	0.0	10.8728954332714
1517117230	71.042173978568	0.0	10.8729536435046
1517117530	71.0048939899899	0.0	10.8729536435046
1517117830	71.0255169623948	0.0	10.87305066056
1517118130	72.0328698452484	0.0	10.87305066056
1517118430	72.0526996264069	0.0	10.8730894673822
1517118730	72.0717362163191	0.0	10.8731282742043

Annoyingly, it occurred to me only now that I am trying to use the data that my metric for available memory is a bit inconsistent with disk usage. Disk is reported as a percentage, while memory is reported as the quotient of available/total — which means I end up multiplying by 100 in the query itself to normalize the two numbers to percentages.

Graphing Data

I chose gnuplot for the graphing, for no real reason other than it integrates nicely with org-mode, where I do a lot of this kind of exploratory work. The final script ends up like the following:

reset

set palette defined (2 '#1485d4', 4 '#38b99e', 6 '#d9ba56')
set style line 101 lc rgb '#606060' lt 1 lw 1
set border 3 front ls 101
set tics nomirror out scale 0.75
set format '%g'
set style line 102 lc rgb '#808080' lt 0 lw 1
set grid back ls 102
set terminal svg size 900,350
set style fill transparent solid 0.4 noborder
set xdata time
set timefmt "%s"
set yrange [0:100]
set xtics nomirror
set ylabel 'Percentage, Memory and Disk'
set y2tics nomirror
set y2label 'Load Average'

plot data using 1:2 with filledcurves x1 lc 4 notitle,\
     '' using 1:2 with lines lw 2 lc 4 title 'Free Memory',\
     '' using 1:4 with filledcurves x1 lc 2 notitle,\
     '' using 1:4 with lines lw 2 lc 2 title 'Disk Usage',\
     '' using 1:3 with filledcurves x1 lc 6 axes x1y2 notitle,\
     '' using 1:3 with lines lw 2 lc 6 axes x1y2 title '5 Minute Load Average'

Which is only notable for the fact that it actually charts each value twice, first for the semi-transparent filled area and second for the solid line of the same color. I had some trouble with the formatting of the xtics as dates, where with too much customization they would inevitably creep upward into the plot itself. I gave up after a while and settled for the standard time format.

It may seem a little odd to combine two reports of percentage usage with the load-average metric, but I decided the YY plot didn't over-complicate the end result. If my server saw more traffic or higher utilization, I might reconsider, but as it is, load average is routinely within epsilon of zero.

The Result

Automating It

I first wrapped the above into a shell script, which I won't bother repeating here, saving it as plot.sh. The last step in the script is to move the generated SVG into a path accessible to my webserver, so that I can check the state of the server without needing to SSH in or transfer files. The "live" view is visible here.

From there the only thing left to do is setup a scheduled run of the script at some sensible interval. For reasons that are unclear, I decided I would use a systemd timer service rather than the more obvious cronjob. I suppose I thought "it's 2018 and I should get with it" — silly idea. The gist of it though, is a two-parter, first there is a "timer service" that defines when something should happen (call it monitoring-plot.timer):

[Unit]
Description=Recreate system monitoring plot every 15 minutes

[Timer]
OnCalendar=*:0/15

and second is the service itself (call it monitoring-plot.service):

[Unit]
Description=Recreate system monitoring plot

[Service]
Type=oneshot
ExecStart=/bin/sh /path/to/plot.sh

It took an embarassing while to figure out the above setup and the worst part is I'm not sure it is actually correct. It is unclear whether I should be invoking wrapper scripts inside of service files or if there is a different pattern I don't know of. Equally painful, I think I have the various pieces enabled correctly, but that was mostly through a series of failed attempts. I believe the winning combination ended up being:

$ sudo systemctl enable monitoring-plot.service
$ sudo systemctl enable monitoring-plot.timer
$ sudo systemctl start monitoring-plot.timer

Retrospective

After all is said and done, I'm happy to have made use of the monitoring service I created previously. I have a few minor complaints now that I've put things through their paces but nothing I didn't suspect already. It turns out my ad-hoc schema is a bit painful for dealing with the key-value time-series data I'm tracking, I mentioned this when I first defined the schema, which I had actually forgotten about. If I were developing more on top of this I would certainly want to clean some things up, specifically:

Abstract writes to the database better through a client.
Normalize the "key" portion of the database schema through a secondary table.
A single chart is probably "good enough", but scale is difficult, perhaps break things out into "recent" and "less recent", for a coarser view. I think this might require tinkering with the plotting to avoid overwhelming the plot with information.