Exploring Schemas and Performance

2025-06-28

Lately I've been thinking about database design and the implications for query performance. I have found myself repeatedly arguing against architectural complexity in favor of "just let the database do that" and I thought it was probably a good idea to further develop an intuition for performance against some hard numbers.

A Tiny Problem Space

Here is a lightly edited example of a problem I've been discussing recently. There are hardware devices, they have names, their firmware can be updated but there is no guarantee that a notification indicates an update (duplicates are possible), other systems reference the devices by another system of mutable IDs that map back to a device. The device communicates version information via a loosely constrained JSON object (more fields are possible), such as:

  {
    "make": "foo",
    "model": "bar",
    "version": "V0.1.2"
  }

The goal is to track a history of firmware updates per-device as well as support answering "what is the current firmware?".

Proposed Schema

I think the proposed schema is decidely boring (in a good way) and the only thing that may require explanation is the internal_id on the device_id_internal_id table — it is very much an artifact of the system I'm usually working in but presents an interesting point for interfacing legacy systems with new designs. The old way of identifying things is through a composite identifier I've mentioned in passing before, it isn't a good identifier for being mutable and oftentimes treated as a prefix because of the "parts" that compose it. Most of that complexity can be ignored here as I'm only keeping a 1:1 table with this new system's own identifying information (device names and internal primary keys). The indirection allows for much greater flexibility in building better abstractions and interfaces than are otherwise possible when carrying around janky legacy identifers as the only unique identifier. Also fun to demonstrate is the possibility of generated columns which seem underutilized in lots of systems.

create table if not exists device
(
    id integer primary key,
    device_name,
    unique (device_name)
);

create table if not exists firmware_notification
(
    data, -- playing fast and loose with JSON here
    update_time,
    device_id,
    foreign key (device_id) references device (id),
    unique (device_id, data)
);

create table if not exists device_id_internal_id
(
    device_id,
    part_a_id,
    part_b_id,
    part_c_id,
    internal_id generated as (part_a_id || '-' || part_b_id || '-' || part_c_id),
    foreign key (device_id) references device (id),
    unique (device_id)
);

With some of those contraints in place it is possible to lean on the database for supporting otherwise tedious update scenarios on-insert. Specifically I'm thinking of the ability to follow changes to internal IDs while maintaining a canonical reference to the device. An insert such as this will smoothly update records to track the latest internal ID while keeping history for the device itself:

insert into device_id_internal_id (device_id, part_a_id, part_b_id, part_c_id)
select d.id,
       change_notification.column1, -- "change_notification" is messaged from the device
       change_notification.column2,
       change_notification.column3,
from (values ('0001', '02', '03'),
             ('0001', '03', '04')) as change_notification,
     (select id from device where device_name = 'some_name') as d
where true
on conflict(device_id) do update set part_a_id = excluded.part_a_id,
				     part_b_id = excluded.part_b_id,
				     part_c_id = excluded.part_c_id;

The uniqueness contraint on the device ID means the internal ID is easily updated because it is purely referential for the external systems. Should those external systems one day change their notion of uniqueness it doesn't have to impact this system.

Another nice thing the database can do is deduplicate data. For my problem space there are frequent messages from devices signaling their firmware version but it is possible, usual even, for the same information to be re-sent — not every message indicates a change. I've highlighted below a series of times which do not carry a different version and so will replace each other, the highlighted version changes though will be tracked individually.

insert into firmware_notification (data, update_time, device_id)
select notifications.column1, notifications.column2, d.id
from (values ('{"make":"foo","model":"bar","version":"V0.1.2"}', '2025-01-01T02:00:00'),
             ('{"make":"foo","model":"bar","version":"V0.1.2"}', '2025-01-01T02:05:00'),
             ('{"make":"foo","model":"bar","version":"V0.1.2"}', '2025-01-01T02:10:00'),
             ('{"make":"foo","model":"bar","version":"V0.1.3"}', '2025-01-01T03:00:00'),
             ('{"make":"foo","model":"bar","version":"V0.1.4"}', '2025-01-01T04:00:00')) as notifications,
     (select id from device where device_name = 'some_name') as d
where true
on conflict(device_id, data) do update set update_time = excluded.update_time;

Using an upsert against the contents of the message allows the table to track a single record per device and simply update with the last time we positively identified the firmware version in use.

Some Fake Data

I'm interested in trying to characterize performance for different queries and scope out necessary indexes. To accomplish that I need enough data to be representative of the problem. There's no shortcut to understanding the domain so I won't belabor how I landed on specific values here, they're just a guess that they might be in the ballpark of a representative sample size. More interesting is perhaps the approach to bulk data generation using SQL, which I've looked into some before. Perhaps worth explaining, given the previous post's use of generate_series, I'm using recursive CTEs here because I am also trying out a new IDE rather than using the SQLite shell directly and I can't immediately tell how to load an extension.

Bulk generate 100k device entries

with recursive numbers as (
    select 1 as n
 union all
    select n + 1
      from numbers
     where n < 100000
)
insert into device (device_name)
select printf('dev%07d', n)
  from numbers;

Randomize device references multiple times

This is probably needlessly complex but does achieve a fairly normal distribution of randomized IDs to devices. I'll admit I mostly used this to test the on-conflict logic.

with recursive
   device_changes as (select d.id as device_id,
                             abs(random() % 10) + 1 as num_changes -- 1 to 10 changes per device
                        from device d),
   change_sequence as (select device_id,
                              1 as change_num,
                              num_changes
                         from device_changes
                    union all
                        select device_id,
                               change_num + 1,
                               num_changes
                         from change_sequence
                        where change_num < num_changes)
insert
into device_id_internal_id (device_id, part_a_id, part_b_id, part_c_id)
select device_id,
       printf('%04d', abs(random() % 10000)) as part_a_id,
       printf('%02d', abs(random() % 100))   as part_b_id,
       printf('%02d', abs(random() % 100))   as part_c_id
from change_sequence
where true
on conflict(device_id) do update set part_a_id = excluded.part_a_id,
                                     part_b_id = excluded.part_b_id,
                                     part_c_id = excluded.part_c_id;

Generate firmware histories

At the cost of some complexity to this insert I opted to generate randomized JSON data rather than repeat the same piece of data over and over. The intent is to give at least a more realistic picture without requiring too complex a data generator while conceding that chronological inserts aren't terribly necessary (so time offsets are random instead of linear).

with recursive
    device_notifications as (select d.id as device_id,
                                    abs(random() % 12) + 1 as num_notifications -- 1 to 12 notifications per device
                             from device d),
    notification_sequence as (select device_id,
                                     1 as notification_num,
                                     num_notifications
                              from device_notifications
                              union all
                              select device_id,
                                     notification_num + 1,
                                     num_notifications
                              from notification_sequence
                              where notification_num < num_notifications)
insert
into firmware_notification (data, update_time, device_id)
select json_object(
               'vendor',
               case abs(random() % 3)
                   when 0 then 'vendorA'
                   when 1 then 'vendorB'
                   else 'vendorC'
                   end,
               'model',
               case abs(random() % 4)
                   when 0 then 'modelX'
                   when 1 then 'modelY'
                   when 2 then 'modelZ'
                   else 'modelw'
                   end,
               'version',
               printf('v%d.%d.%d',
                      abs(random() % 5),
                      abs(random() % 10),
                      abs(random() % 10)
               )
       ) as data,
       datetime(
               '2024-01-01 00:00:00',
               printf('+%d seconds', abs(random() % (365 * 24 * 60 * 60)))
       ) as update_time,
       device_id
from notification_sequence
where true
on conflict(device_id, data) do update set update_time = excluded.update_time;

The result of these bulk data inserts are as follows:

> select count(*) from device;
100000
> select count(*) from device_id_internal_id ;
100000
> select count(*) from firmware_notification ;
650795

Some Example Queries and Query Plans

In this example I'll cover two query patterns I'm interested in exploring the performance of. They are: querying firmware change history per-device and querying the latest firmware (as an indication of what is in use). Simpler of the two is the per-device history, which only has to join the different tables together:

select device_name,
       internal_id,
       firmware_notification.data ->> 'version' as version,
       firmware_notification.update_time
from device
         join firmware_notification on device.id = firmware_notification.device_id
         join device_id_internal_id on device.id = device_id_internal_id.device_id
where device_name = 'some_name'
order by update_time desc;

Picking a device name at random this one is already pretty good:

QUERY PLAN
|--SCAN firmware_notification
|--BLOOM FILTER ON device (id=?)
|--SEARCH device USING INTEGER PRIMARY KEY (rowid=?)
|--SCAN device_id_internal_id
`--USE TEMP B-TREE FOR ORDER BY
DEV0020663|4000-97-72|V1.7.4|2024-12-26 17:03:56
DEV0020663|4000-97-72|V0.5.1|2024-11-11 21:19:43
DEV0020663|4000-97-72|V3.2.3|2024-10-09 15:21:00
DEV0020663|4000-97-72|V3.8.8|2024-08-28 22:00:37
Run Time: real 0.071 user 0.044757 sys 0.024575

More complicated is querying the latest version of every device:

select device.device_name,
       fw.data ->> 'version' as version,
       fw.update_time,
       device_id_internal_id.internal_id
from device
         join
     (select device_id,
             data,
             update_time,
             row_number() over (partition by device_id order by update_time desc) as rn
      from firmware_notification) fw
     on device.id = fw.device_id and fw.rn = 1
         join device_id_internal_id
              on device.id = device_id_internal_id.device_id;

This one was awful to run, my laptop fan was pushing so hard I had time to wonder whether I wasn't going to trip some thermal protection on the CPU.

QUERY PLAN
|--CO-ROUTINE fw
|  |--CO-ROUTINE (subquery-3)
|  |  |--SCAN firmware_notification USING INDEX sqlite_autoindex_firmware_notification_1
|  |  `--USE TEMP B-TREE FOR LAST TERM OF ORDER BY
|  `--SCAN (subquery-3)
|--SCAN device_id_internal_id
|--SEARCH device USING INTEGER PRIMARY KEY (rowid=?)
|--BLOOM FILTER ON fw (rn=?)
`--SEARCH fw USING AUTOMATIC PARTIAL COVERING INDEX (rn=?)
...
Run Time: real 723.393 user 495.007924 sys 224.768183

The above is the first thing that occurred to me, ranking each firmware update by time per device and then joining to the other tables using only the first in time (most recent). I'm especially interested in trying to eke out performance on this query because I expect it'll be the most frequent. Thinking about it further I am curious to compare with a slightly different approach using max, thinking it may prove cheaper to only collect the one value instead of all of them and then discarding.

select device.device_name,
       fw.data ->> 'version' as version,
       fw.update_time,
       internal_id
from device
         join
     (select max(firmware_notification.update_time) update_time, data, device_id
      from firmware_notification
      group by device_id) fw on device.id = fw.device_id
         join device_id_internal_id on device.id = device_id_internal_id.device_id;

Not so slow as the window query but the performance is still abysmal out of the gate.

QUERY PLAN
|--CO-ROUTINE fw
|  `--SCAN firmware_notification USING INDEX sqlite_autoindex_firmware_notification_1
|--SCAN fw
|--SEARCH device USING INTEGER PRIMARY KEY (rowid=?)
`--SCAN device_id_internal_id
...
Run Time: real 398.004 user 344.102231 sys 51.005167

In Which I Spend Too Long Not Thinking

I have to admit I spent quite a while fumbling with different indexing strategies based on the query plans above for "latest version" reporting. I should know better but I didn't really spend long enough thinking about what the performance is saying. I'll try to break it down for the query that uses max because it is simpler and ended up being fastest.

SCAN firmware_notification USING INDEX sqlite_autoindex_firmware_notification_1, SQLite executes this subquery in a coroutine. The documentation explains:
The outer query blocks whenever it needs another row of input from the subquery. Control switches to the co-routine which produces the desired output row, then control switches back to the main routine which continues processing.
This is an alternative to using a temporary table and may be preferred because the query results are only visited once.
SCAN fw, the coroutine necessarily has to be scanned because I'm collecting every result from the "latest" values, all 100,000 of them.
SEARCH device USING INTEGER PRIMARY KEY (rowid=?), we necessarily also have to visit every one of the device rows, using the primary key seems about as good as it'll get.
SCAN device_id_internal_id, every row of the device_id_internal_id table will end up collected to build the final result but the use of SCAN here is anomalous.

So why is clever indexing probably not the correct first move here? Probably important to note is both the unfortunate looking SCAN operation in the query plan when using max but I think equally important to consider is that it shows up in the window function variant query too. Why isn't the uniqueness constraint or foreign key being used to search instead of scan? To give things away it is a result of the join condition I supplied: join device_id_internal_id on device.id = device_id_internal_id.device_id is joining against the device table instead of the subquery "fw" table.

Compare the difference between these two:

select device.device_name,
       fw.data ->> 'version' as version,
       fw.update_time,
       internal_id
from device
        join
     (select max(firmware_notification.update_time) update_time, data, device_id
        from firmware_notification
    group by device_id) fw on fw.device_id = device.id
        join device_id_internal_id on device.id = device_id_internal_id.device_id;

select device.device_name,
       fw.data ->> 'version' as version,
       fw.update_time,
       internal_id
from device
        join
     (select max(firmware_notification.update_time) update_time, data, device_id
        from firmware_notification
    group by device_id) fw on fw.device_id = device.id
        join device_id_internal_id on fw.id = device_id_internal_id.device_id;

And now the query plan and performance for the "latest firmware for every device" query:

QUERY PLAN
|--CO-ROUTINE fw
|  `--SCAN firmware_notification USING INDEX sqlite_autoindex_firmware_notification_1
|--SCAN fw
|--SEARCH device USING INTEGER PRIMARY KEY (rowid=?)
`--SEARCH device_id_internal_id USING INDEX sqlite_autoindex_device_id_internal_id_1 (device_id=?)
...
Run Time: real 0.367 user 0.331444 sys 0.033939

So, yeah. Important to remember that despite carrying the "same" data the way in which a table is joined carries a lot of weight to the query planner.

Indexing

Despite the fact that both access patterns are now very fast for even the "full table scan" scenario it is too tempting not to try improving query performance where possible. Beginning with the per-device history: indexing the device id and update time gets a bit more than a 40% improvement in performance:

> create index idx_firmware_device_update 
      on firmware_notification(device_id, update_time desc);

QUERY PLAN
|--SCAN device_id_internal_id
|--SEARCH device USING INTEGER PRIMARY KEY (rowid=?)
|--SCAN firmware_notification USING INDEX idx_firmware_device_update
`--USE TEMP B-TREE FOR ORDER BY
DEV0020663|4000-97-72|V1.7.4|2024-12-26 17:03:56
DEV0020663|4000-97-72|V0.5.1|2024-11-11 21:19:43
DEV0020663|4000-97-72|V3.2.3|2024-10-09 15:21:00
DEV0020663|4000-97-72|V3.8.8|2024-08-28 22:00:37
Run Time: real 0.042 user 0.038475 sys 0.003671

And delightfully the same index benefits the latest firmware for all devices query too to the tune of about 30%:

QUERY PLAN
|--CO-ROUTINE fw
|  `--SCAN firmware_notification USING INDEX idx_firmware_device_update
|--SCAN fw
|--SEARCH device USING INTEGER PRIMARY KEY (rowid=?)
`--SEARCH device_id_internal_id USING INDEX sqlite_autoindex_device_id_internal_id_1 (device_id=?)
...
Run Time: real 0.237 user 0.208258 sys 0.027774

Thoughts

In an earlier iteration on this idea I mistakenly introduced some duplication into the query results when I applied a cross join between two tables by using the form FROM table_a, table_b WHERE ... instead of using the explicit JOIN ... ON. Considering the insidious nature of the issue I figured making the stylistic choice to avoid comma-style joins entirely might be warranted. Even for intentional uses of CROSS JOIN typing the words out probably better conveys intent. In the case of this performance gaff because of join conditions I'm forced to consider if there's something similar to be done. I am inclined to say USING can help with readability but it turns out, even renaming device.id to device.device_id so that all the joins can instead use JOIN <table> USING(device_id) isn't sufficient. Instead the key seems to be hoisting the subquery out to the first table selection and then joining from there. I might be biased with the knowledge that this new form is also key to good performance but I think it reads better too:

select device.device_name,
       fw.data ->> 'version' as version,
       fw.update_time,
       internal_id
from (select max(firmware_notification.update_time) update_time, 
             data, 
             device_id
        from firmware_notification
    group by device_id) fw
        join device using(device_id)
        join device_id_internal_id using(device_id);

There is something deeply funny in realizing I've managed to re-solve (and re-document!) a variation on a problem I explored here years ago. Maybe things will stick better for having repeated my mistakes again.