A few notes on how basic pagination should work. Never mind the fact that it rarely does!
I previously worked up an easy way to create random data in bulk. I have modified the approach slightly here but the idea is the same. The format of the table is:
create table samples
(
id integer primary key,
facility integer not null,
category integer not null,
data text not null,
time datetime not null
);
insert into samples(facility, category, data, time)
select abs(random() % 1000),
abs(random()) % 3,
hex(randomblob(random() % 128)),
datetime(strftime('%s', '2000-01-01 00:00:00') +
abs(random() % (strftime('%s', '2022-07-23 23:59:59') -
strftime('%s', '2000-01-01 00:00:00'))),
'unixepoch')
from generate_series(0)
limit 1000000;
This results in about 400 megabytes of data for me which should be sufficient to test out some ideas without making it too easy on myself.
My first idea is of the kind of time series dashboard that allows for drilling into a given facet of the data. In the example here there might be a table of "facilities" with an aggregation of recent data for each. In my experience everyone makes the table first return all data at once, whether or not things are paginated client-side and only later when it becomes an obvious performance problem is any consideration given to adding pagination to the queries themselves.
Imagine a simple table:
facility | last reading | readings |
---|---|---|
001 | 2022-07-23 23:36:11 | [19 13 58 84 64 35 24 9 45 27] |
002 | 2022-07-23 23:08:07 | [29 53 68 2 13 30 49 42 35 53] |
003 | 2022-07-23 22:49:27 | [96 92 52 75 88 53 46 41 21 61] |
004 | 2022-07-23 22:47:27 | [31 25 43 30 68 65 39 40 6 53] |
005 | 2022-07-23 22:36:15 | [10 62 14 50 4 99 36 8 14 85] |
Sometimes "readings" is instead a sparkline plot but the idea is pretty general. This works in development but falls apart when there are 100 facilities or 1000 readings.
For the generated example data there are 1000 facilities each with between 907 and 1089 pieces of data. In my particular case I created random data rather than drumming up something like fake structured data — it seemed more realistic than a single integer, given my past experience. It also ensures that the tuples aren't all neatly aligned along the same field size boundary. I'm not sure this is entirely necessary but it can't hurt to give the database a little more work.
If I had a time series chart with time along the x-axis and each
point representing a measurement or reading for a given facility it
may be easy to see why select * from samples
doesn't
scale:
Instead the idea is to subdivide the full dataset in an intuitive way that could easily map to the table show above. I imagine a frame like you might use for photography or painting that limits what you "see".
Some aspects of this are easier than others, I imagine it is obvious how you might do:
select facility, data
from samples
where facility in (0, 1, 2);
Which achieves the "vertical" selection part of the above. The "horizontal" part could be achieved with a date range query:
select facility, data
from samples
where facility in (1, 2)
and time between '2006-01-01 00:00:00' and '2008-01-01 00:00:00'
Of course things get marginally more complicated if the sort order is configurable but it doesn't exactly ruin our day. With just these it is possible to (conceptually) slide the frame around the data while keeping responses small and fast. The pages of a table might drive the inputs to the where clauses.
The thing that got me thinking about this in the first place though was how best to achieve a query like "last 50 readings no older than a week per facility". This kind of question doesn't exactly lend itself to the visual given above, at best I guess I imagine it like:
This sort of thing requires the use of a window function, which
allow you to limit an aggregation within a given partition. In my
example the partition would be facility and the aggregation
could be something like group_concat
, max
,
or json_group_array
. Let's start with a query that
works but returns too much data:
select facility,
json_group_array(time)
from samples
group by facility
For the full data set I generated at the beginning, this returns over 20MB of data. I might trim it down to just 10 facilities, which brings the total size down to about 216KB — but only for now! How about 10 facilities and only in the last 3 years?
select facility,
time
from samples
where facility < 10
and time > date('now', '-3 years')
While that gets down to about 30KB it is also a bit fragile to my point of view. What I would like instead is a maximum upper bound on the number of readings on top of which I might add further limits rather than the other way around. Here is where window functions come in, I happen to know that there is a pretty even distribution of records per facility (because with enough random data packed into a fixed number of buckets the distribution becomes uniform). I would expect then that taking 1/10 of the number of readings (100 of ~1000) should produce an equivalent fraction of data:
select facility,
time
from (
select row_number() over (partition by facility order by time desc) seqnum,
facility,
time
from samples
) where seqnum <=100
As expected, this produces about 2.5MB of data. It works by segmenting the table by facility before numbering each record per-partition in order of time. A small sample of the nested query looks like this:
seqn | facility | time |
---|---|---|
1 | 1 | 2022-06-23 21:36:05 |
1 | 2 | 2022-07-15 21:44:24 |
2 | 2 | 2022-07-04 12:55:30 |
3 | 2 | 2022-07-02 08:31:12 |
1 | 3 | 2022-07-21 11:15:27 |
2 | 3 | 2022-07-21 08:40:31 |
3 | 3 | 2022-07-11 06:09:41 |
4 | 3 | 2022-06-26 08:27:02 |
1 | 4 | 2022-07-09 19:51:14 |
2 | 4 | 2022-06-24 08:02:15 |
Circling back to the first example table, if I were going to design for more realistic data sizes I might do something like this:
select facility, max(time) [last reading], json_group_array(time) [reading dates]
from (
select row_number() over (partition by facility order by time desc) seqnum,
facility,
time
from samples
)
where seqnum <= 10
and facility <= 10
group by facility;
facility | last reading | reading dates |
---|---|---|
0 | 2022-07-22 20:19:47 | ["2022-07-22 20:19:47","2022-07-06 06:04:39","2022-06-19 20:32:47","2022-06-19 02:29:53","2022-06-17 21:10:50","2022-06-17 09:54:25","2022-06-11 02:29:21","2022-06-09 22:26:04","2022-06-07 01:41:27","2022-05-30 03:29:29"] |
1 | 2022-07-13 10:36:28 | ["2022-07-13 10:36:28","2022-07-13 00:56:24","2022-06-28 03:21:48","2022-06-22 20:32:58","2022-06-18 08:40:32","2022-06-11 00:27:20","2022-06-09 07:24:07","2022-06-08 12:27:27","2022-05-21 18:57:03","2022-05-15 14:13:23"] |
2 | 2022-07-09 18:34:17 | ["2022-07-09 18:34:17","2022-06-27 16:10:39","2022-06-19 09:17:39","2022-06-18 09:18:16","2022-05-27 23:19:53","2022-05-07 03:52:07","2022-05-03 19:51:08","2022-04-26 18:48:43","2022-04-02 22:35:28","2022-03-28 04:13:32"] |
3 | 2022-07-11 13:31:13 | ["2022-07-11 13:31:13","2022-07-07 05:27:16","2022-06-19 19:13:27","2022-06-04 20:08:59","2022-06-02 08:58:41","2022-05-27 02:24:25","2022-05-17 15:16:31","2022-05-03 14:55:32","2022-04-28 10:20:33","2022-04-25 03:34:11"] |
4 | 2022-07-20 12:27:39 | ["2022-07-20 12:27:39","2022-07-15 21:05:38","2022-07-13 19:32:25","2022-07-11 13:14:56","2022-06-04 01:43:21","2022-05-28 22:57:35","2022-05-23 09:35:41","2022-04-22 03:42:55","2022-03-25 22:50:54","2022-03-15 20:54:42"] |
5 | 2022-07-17 22:20:05 | ["2022-07-17 22:20:05","2022-07-17 04:57:10","2022-07-12 04:55:16","2022-07-06 14:08:40","2022-06-29 13:35:53","2022-06-29 10:32:16","2022-06-17 18:57:50","2022-06-15 06:01:14","2022-06-11 00:34:06","2022-05-23 01:57:06"] |
6 | 2022-07-17 23:08:40 | ["2022-07-17 23:08:40","2022-07-14 15:37:25","2022-07-02 08:24:34","2022-06-17 10:01:40","2022-06-10 06:43:35","2022-06-02 03:53:53","2022-05-29 19:05:33","2022-05-29 12:51:47","2022-05-19 09:40:15","2022-05-16 00:38:39"] |
7 | 2022-07-23 04:45:42 | ["2022-07-23 04:45:42","2022-07-21 03:47:07","2022-07-17 02:29:07","2022-07-06 21:39:15","2022-07-03 10:18:24","2022-06-22 16:26:45","2022-06-01 01:13:53","2022-05-22 03:29:49","2022-05-06 00:29:07","2022-04-28 01:42:27"] |
8 | 2022-07-16 02:17:04 | ["2022-07-16 02:17:04","2022-07-08 09:42:01","2022-07-01 21:25:53","2022-06-19 19:53:08","2022-06-13 16:34:28","2022-06-07 08:55:54","2022-06-05 13:21:44","2022-06-04 23:45:41","2022-05-19 14:56:57","2022-05-11 11:45:34"] |
9 | 2022-07-11 13:21:13 | ["2022-07-11 13:21:13","2022-07-05 06:22:22","2022-06-21 19:02:50","2022-06-14 14:56:47","2022-06-13 03:14:32","2022-05-22 06:46:08","2022-05-20 20:31:27","2022-05-17 20:28:18","2022-05-03 00:11:13","2022-04-28 16:58:30"] |
10 | 2022-07-23 03:19:46 | ["2022-07-23 03:19:46","2022-07-21 07:10:09","2022-07-18 22:34:25","2022-07-14 07:22:15","2022-06-29 15:17:07","2022-06-27 11:19:42","2022-06-22 15:34:03","2022-06-13 01:27:43","2022-06-08 13:44:59","2022-05-31 05:31:14"] |
Considering the variety of ways I have seen these sorts of things
done incorrectly I didn't find any of this particularly
novel. Admittedly, requesting large amounts of data is always going
to be a little slow but for the example above I can aggregate 100
readings for 1000 facilities (never mind the fact that one hundred
thousand points of data probably don't have any business being on a
single page in the first place) in 1.29 seconds without any
optimization. With indexes you can then dive into any given reading
by date in thousandths of a second. There's vanishingly little
reason applications should be slow! There are a few more options in
other databases that allow fetching along with offset
without
incurring the performance hit described in the SQLite docs.
The SQLIte documentation also gives an interesting suggestion for
using a compound row-value comparison in order to implement
pagination without the same limitations I face with my own approach
here.