Lately I've been thinking about database design and the implications for query performance. I have found myself repeatedly arguing against architectural complexity in favor of "just let the database do that" and I thought it was probably a good idea to further develop an intuition for performance against some hard numbers.
Here is a lightly edited example of a problem I've been discussing recently. There are hardware devices, they have names, their firmware can be updated but there is no guarantee that a notification indicates an update (duplicates are possible), other systems reference the devices by another system of mutable IDs that map back to a device. The device communicates version information via a loosely constrained JSON object (more fields are possible), such as:
{
"make": "foo",
"model": "bar",
"version": "V0.1.2"
}
The goal is to track a history of firmware updates per-device as well as support answering "what is the current firmware?".
I think the proposed schema is decidely boring (in a good way) and
the only thing that may require explanation is
the internal_id
on
the device_id_internal_id
table — it is very much an
artifact of the system I'm usually working in but presents an
interesting point for interfacing legacy systems with new
designs. The old way of identifying things is through a composite
identifier
I've mentioned in
passing before, it isn't a good identifier for being mutable and
oftentimes treated as a prefix because of the "parts" that compose
it. Most of that complexity can be ignored here as I'm only keeping
a 1:1 table with this new system's own identifying information
(device names and internal primary keys). The indirection allows for
much greater flexibility in building better abstractions and
interfaces than are otherwise possible when carrying around janky
legacy identifers as the only unique identifier. Also fun to
demonstrate is the possibility of generated columns which seem
underutilized in lots of systems.
create table if not exists device
(
id integer primary key,
device_name,
unique (device_name)
);
create table if not exists firmware_notification
(
data, -- playing fast and loose with JSON here
update_time,
device_id,
foreign key (device_id) references device (id),
unique (device_id, data)
);
create table if not exists device_id_internal_id
(
device_id,
part_a_id,
part_b_id,
part_c_id,
internal_id generated as (part_a_id || '-' || part_b_id || '-' || part_c_id),
foreign key (device_id) references device (id),
unique (device_id)
);
With some of those contraints in place it is possible to lean on the database for supporting otherwise tedious update scenarios on-insert. Specifically I'm thinking of the ability to follow changes to internal IDs while maintaining a canonical reference to the device. An insert such as this will smoothly update records to track the latest internal ID while keeping history for the device itself:
insert into device_id_internal_id (device_id, part_a_id, part_b_id, part_c_id)
select d.id,
change_notification.column1, -- "change_notification" is messaged from the device
change_notification.column2,
change_notification.column3,
from (values ('0001', '02', '03'),
('0001', '03', '04')) as change_notification,
(select id from device where device_name = 'some_name') as d
where true
on conflict(device_id) do update set part_a_id = excluded.part_a_id,
part_b_id = excluded.part_b_id,
part_c_id = excluded.part_c_id;
The uniqueness contraint on the device ID means the internal ID is easily updated because it is purely referential for the external systems. Should those external systems one day change their notion of uniqueness it doesn't have to impact this system.
Another nice thing the database can do is deduplicate data. For my problem space there are frequent messages from devices signaling their firmware version but it is possible, usual even, for the same information to be re-sent — not every message indicates a change. I've highlighted below a series of times which do not carry a different version and so will replace each other, the highlighted version changes though will be tracked individually.
insert into firmware_notification (data, update_time, device_id)
select notifications.column1, notifications.column2, d.id
from (values ('{"make":"foo","model":"bar","version":"V0.1.2"}', '2025-01-01T02:00:00'),
('{"make":"foo","model":"bar","version":"V0.1.2"}', '2025-01-01T02:05:00'),
('{"make":"foo","model":"bar","version":"V0.1.2"}', '2025-01-01T02:10:00'),
('{"make":"foo","model":"bar","version":"V0.1.3"}', '2025-01-01T03:00:00'),
('{"make":"foo","model":"bar","version":"V0.1.4"}', '2025-01-01T04:00:00')) as notifications,
(select id from device where device_name = 'some_name') as d
where true
on conflict(device_id, data) do update set update_time = excluded.update_time;
Using an upsert against the contents of the message allows the table to track a single record per device and simply update with the last time we positively identified the firmware version in use.
I'm interested in trying to characterize performance for different
queries and scope out necessary indexes. To accomplish that I need
enough data to be representative of the problem. There's no shortcut
to understanding the domain so I won't belabor how I landed on
specific values here, they're just a guess that they might be in the
ballpark of a representative sample size. More interesting is
perhaps the approach to bulk data generation using SQL, which I've
looked into some
before. Perhaps worth explaining, given the previous post's use
of generate_series
, I'm using recursive CTEs here
because I am also trying out a new IDE rather than using the SQLite
shell directly and I can't immediately tell how to load an
extension.
with recursive numbers as (
select 1 as n
union all
select n + 1
from numbers
where n < 100000
)
insert into device (device_name)
select printf('dev%07d', n)
from numbers;
This is probably needlessly complex but does achieve a fairly normal distribution of randomized IDs to devices. I'll admit I mostly used this to test the on-conflict logic.
with recursive
device_changes as (select d.id as device_id,
abs(random() % 10) + 1 as num_changes -- 1 to 10 changes per device
from device d),
change_sequence as (select device_id,
1 as change_num,
num_changes
from device_changes
union all
select device_id,
change_num + 1,
num_changes
from change_sequence
where change_num < num_changes)
insert
into device_id_internal_id (device_id, part_a_id, part_b_id, part_c_id)
select device_id,
printf('%04d', abs(random() % 10000)) as part_a_id,
printf('%02d', abs(random() % 100)) as part_b_id,
printf('%02d', abs(random() % 100)) as part_c_id
from change_sequence
where true
on conflict(device_id) do update set part_a_id = excluded.part_a_id,
part_b_id = excluded.part_b_id,
part_c_id = excluded.part_c_id;
At the cost of some complexity to this insert I opted to generate randomized JSON data rather than repeat the same piece of data over and over. The intent is to give at least a more realistic picture without requiring too complex a data generator while conceding that chronological inserts aren't terribly necessary (so time offsets are random instead of linear).
with recursive
device_notifications as (select d.id as device_id,
abs(random() % 12) + 1 as num_notifications -- 1 to 12 notifications per device
from device d),
notification_sequence as (select device_id,
1 as notification_num,
num_notifications
from device_notifications
union all
select device_id,
notification_num + 1,
num_notifications
from notification_sequence
where notification_num < num_notifications)
insert
into firmware_notification (data, update_time, device_id)
select json_object(
'vendor',
case abs(random() % 3)
when 0 then 'vendorA'
when 1 then 'vendorB'
else 'vendorC'
end,
'model',
case abs(random() % 4)
when 0 then 'modelX'
when 1 then 'modelY'
when 2 then 'modelZ'
else 'modelw'
end,
'version',
printf('v%d.%d.%d',
abs(random() % 5),
abs(random() % 10),
abs(random() % 10)
)
) as data,
datetime(
'2024-01-01 00:00:00',
printf('+%d seconds', abs(random() % (365 * 24 * 60 * 60)))
) as update_time,
device_id
from notification_sequence
where true
on conflict(device_id, data) do update set update_time = excluded.update_time;
The result of these bulk data inserts are as follows:
> select count(*) from device;
100000
> select count(*) from device_id_internal_id ;
100000
> select count(*) from firmware_notification ;
650795
In this example I'll cover two query patterns I'm interested in exploring the performance of. They are: querying firmware change history per-device and querying the latest firmware (as an indication of what is in use). Simpler of the two is the per-device history, which only has to join the different tables together:
select device_name,
internal_id,
firmware_notification.data ->> 'version' as version,
firmware_notification.update_time
from device
join firmware_notification on device.id = firmware_notification.device_id
join device_id_internal_id on device.id = device_id_internal_id.device_id
where device_name = 'some_name'
order by update_time desc;
Picking a device name at random this one is already pretty good:
QUERY PLAN
|--SCAN firmware_notification
|--BLOOM FILTER ON device (id=?)
|--SEARCH device USING INTEGER PRIMARY KEY (rowid=?)
|--SCAN device_id_internal_id
`--USE TEMP B-TREE FOR ORDER BY
DEV0020663|4000-97-72|V1.7.4|2024-12-26 17:03:56
DEV0020663|4000-97-72|V0.5.1|2024-11-11 21:19:43
DEV0020663|4000-97-72|V3.2.3|2024-10-09 15:21:00
DEV0020663|4000-97-72|V3.8.8|2024-08-28 22:00:37
Run Time: real 0.071 user 0.044757 sys 0.024575
More complicated is querying the latest version of every device:
select device.device_name,
fw.data ->> 'version' as version,
fw.update_time,
device_id_internal_id.internal_id
from device
join
(select device_id,
data,
update_time,
row_number() over (partition by device_id order by update_time desc) as rn
from firmware_notification) fw
on device.id = fw.device_id and fw.rn = 1
join device_id_internal_id
on device.id = device_id_internal_id.device_id;
This one was awful to run, my laptop fan was pushing so hard I had time to wonder whether I wasn't going to trip some thermal protection on the CPU.
QUERY PLAN
|--CO-ROUTINE fw
| |--CO-ROUTINE (subquery-3)
| | |--SCAN firmware_notification USING INDEX sqlite_autoindex_firmware_notification_1
| | `--USE TEMP B-TREE FOR LAST TERM OF ORDER BY
| `--SCAN (subquery-3)
|--SCAN device_id_internal_id
|--SEARCH device USING INTEGER PRIMARY KEY (rowid=?)
|--BLOOM FILTER ON fw (rn=?)
`--SEARCH fw USING AUTOMATIC PARTIAL COVERING INDEX (rn=?)
...
Run Time: real 723.393 user 495.007924 sys 224.768183
The above is the first thing that occurred to me, ranking each
firmware update by time per device and then joining to the other
tables using only the first in time (most recent). I'm especially
interested in trying to eke out performance on this query because I
expect it'll be the most frequent. Thinking about it further I am
curious to compare with a slightly different approach
using max
, thinking it may prove cheaper to only
collect the one value instead of all of them and then discarding.
select device.device_name,
fw.data ->> 'version' as version,
fw.update_time,
internal_id
from device
join
(select max(firmware_notification.update_time) update_time, data, device_id
from firmware_notification
group by device_id) fw on device.id = fw.device_id
join device_id_internal_id on device.id = device_id_internal_id.device_id;
Not so slow as the window query but the performance is still abysmal out of the gate.
QUERY PLAN
|--CO-ROUTINE fw
| `--SCAN firmware_notification USING INDEX sqlite_autoindex_firmware_notification_1
|--SCAN fw
|--SEARCH device USING INTEGER PRIMARY KEY (rowid=?)
`--SCAN device_id_internal_id
...
Run Time: real 398.004 user 344.102231 sys 51.005167
I have to admit I spent quite a while fumbling with different indexing strategies based on the query plans above for "latest version" reporting. I should know better but I didn't really spend long enough thinking about what the performance is saying. I'll try to break it down for the query that uses max because it is simpler and ended up being fastest.
SCAN firmware_notification USING INDEX sqlite_autoindex_firmware_notification_1
,
SQLite executes this subquery in a
coroutine. The
documentation explains:
The outer query blocks whenever it needs another row of input from the subquery. Control switches to the co-routine which produces the desired output row, then control switches back to the main routine which continues processing.This is an alternative to using a temporary table and may be preferred because the query results are only visited once.
SCAN fw
, the coroutine necessarily has to be
scanned because I'm collecting every result from the "latest"
values, all 100,000 of them.
SEARCH device USING INTEGER PRIMARY KEY
(rowid=?)
, we necessarily also have to visit every one of
the device rows, using the primary key seems about as good as
it'll get.
SCAN device_id_internal_id
, every row of
the device_id_internal_id
table will end up collected
to build the final result but the use of SCAN here is anomalous.
So why is clever indexing probably not the correct first
move here? Probably important to note is both the unfortunate
looking SCAN operation in the query plan when using max
but I think equally important to consider is that it shows up in the
window function variant query too. Why isn't the uniqueness
constraint or foreign key being used to search instead of scan?
To give things away it is a result of the join condition I
supplied: join device_id_internal_id on device.id =
device_id_internal_id.device_id
is joining against the device
table instead of the subquery "fw" table.
Compare the difference between these two:
select device.device_name,
fw.data ->> 'version' as version,
fw.update_time,
internal_id
from device
join
(select max(firmware_notification.update_time) update_time, data, device_id
from firmware_notification
group by device_id) fw on fw.device_id = device.id
join device_id_internal_id on device.id = device_id_internal_id.device_id;
select device.device_name,
fw.data ->> 'version' as version,
fw.update_time,
internal_id
from device
join
(select max(firmware_notification.update_time) update_time, data, device_id
from firmware_notification
group by device_id) fw on fw.device_id = device.id
join device_id_internal_id on fw.id = device_id_internal_id.device_id;
And now the query plan and performance for the "latest firmware for every device" query:
QUERY PLAN
|--CO-ROUTINE fw
| `--SCAN firmware_notification USING INDEX sqlite_autoindex_firmware_notification_1
|--SCAN fw
|--SEARCH device USING INTEGER PRIMARY KEY (rowid=?)
`--SEARCH device_id_internal_id USING INDEX sqlite_autoindex_device_id_internal_id_1 (device_id=?)
...
Run Time: real 0.367 user 0.331444 sys 0.033939
So, yeah. Important to remember that despite carrying the "same" data the way in which a table is joined carries a lot of weight to the query planner.
Despite the fact that both access patterns are now very fast for even the "full table scan" scenario it is too tempting not to try improving query performance where possible. Beginning with the per-device history: indexing the device id and update time gets a bit more than a 40% improvement in performance:
> create index idx_firmware_device_update
on firmware_notification(device_id, update_time desc);
QUERY PLAN
|--SCAN device_id_internal_id
|--SEARCH device USING INTEGER PRIMARY KEY (rowid=?)
|--SCAN firmware_notification USING INDEX idx_firmware_device_update
`--USE TEMP B-TREE FOR ORDER BY
DEV0020663|4000-97-72|V1.7.4|2024-12-26 17:03:56
DEV0020663|4000-97-72|V0.5.1|2024-11-11 21:19:43
DEV0020663|4000-97-72|V3.2.3|2024-10-09 15:21:00
DEV0020663|4000-97-72|V3.8.8|2024-08-28 22:00:37
Run Time: real 0.042 user 0.038475 sys 0.003671
And delightfully the same index benefits the latest firmware for all devices query too to the tune of about 30%:
QUERY PLAN
|--CO-ROUTINE fw
| `--SCAN firmware_notification USING INDEX idx_firmware_device_update
|--SCAN fw
|--SEARCH device USING INTEGER PRIMARY KEY (rowid=?)
`--SEARCH device_id_internal_id USING INDEX sqlite_autoindex_device_id_internal_id_1 (device_id=?)
...
Run Time: real 0.237 user 0.208258 sys 0.027774
In an earlier iteration on this idea I mistakenly introduced some
duplication into the query results when I applied a cross join
between two tables by using the form FROM table_a, table_b
WHERE ...
instead of using the explicit JOIN
... ON
. Considering the insidious nature of the issue I
figured making the stylistic choice to avoid comma-style joins
entirely might be warranted. Even for intentional uses of CROSS JOIN
typing the words out probably better conveys intent. In the case of
this performance gaff because of join conditions I'm forced to
consider if there's something similar to be done. I am inclined to
say USING
can help with readability but it turns out,
even renaming device.id
to device.device_id
so that all the joins can instead
use JOIN <table> USING(device_id)
isn't
sufficient. Instead the key seems to be hoisting the subquery out to
the first table selection and then joining from there. I might be
biased with the knowledge that this new form is also key to good
performance but I think it reads better too:
select device.device_name,
fw.data ->> 'version' as version,
fw.update_time,
internal_id
from (select max(firmware_notification.update_time) update_time,
data,
device_id
from firmware_notification
group by device_id) fw
join device using(device_id)
join device_id_internal_id using(device_id);
There is something deeply funny in realizing I've managed to re-solve (and re-document!) a variation on a problem I explored here years ago. Maybe things will stick better for having repeated my mistakes again.