SQLite Pagination

2022-07-23

A few notes on how basic pagination should work. Never mind the fact that it rarely does!

I previously worked up an easy way to create random data in bulk. I have modified the approach slightly here but the idea is the same. The format of the table is:

sequential ID
random number 0-999
category: 0, 1, or 2
random string up to 256 characters long
random datetime between the beginning of the century and today

I have named the data "samples" and given the fields plausible sounding names because I find it easier to think about than metasyntactic variables like foo and bar. The point is to create vaguely real-looking data in sufficient volume and shape that isn't totally contrived in order to test out access patterns.


create table samples
(
    id integer primary key,
    facility integer not null,
    category integer not null,
    data text not null,
    time datetime not null
);

insert into samples(facility, category, data, time)
select abs(random() % 1000),
       abs(random()) % 3,
       hex(randomblob(random() % 128)),
       datetime(strftime('%s', '2000-01-01 00:00:00') +
                abs(random() % (strftime('%s', '2022-07-23 23:59:59') -
                                strftime('%s', '2000-01-01 00:00:00'))),
                'unixepoch')
  from generate_series(0)
 limit 1000000;

This results in about 400 megabytes of data for me which should be sufficient to test out some ideas without making it too easy on myself.

My first idea is of the kind of time series dashboard that allows for drilling into a given facet of the data. In the example here there might be a table of "facilities" with an aggregation of recent data for each. In my experience everyone makes the table first return all data at once, whether or not things are paginated client-side and only later when it becomes an obvious performance problem is any consideration given to adding pagination to the queries themselves.

Imagine a simple table:

facility	last reading	readings
001	2022-07-23 23:36:11	[19 13 58 84 64 35 24 9 45 27]
002	2022-07-23 23:08:07	[29 53 68 2 13 30 49 42 35 53]
003	2022-07-23 22:49:27	[96 92 52 75 88 53 46 41 21 61]
004	2022-07-23 22:47:27	[31 25 43 30 68 65 39 40 6 53]
005	2022-07-23 22:36:15	[10 62 14 50 4 99 36 8 14 85]

Sometimes "readings" is instead a sparkline plot but the idea is pretty general. This works in development but falls apart when there are 100 facilities or 1000 readings.

For the generated example data there are 1000 facilities each with between 907 and 1089 pieces of data. In my particular case I created random data rather than drumming up something like fake structured data — it seemed more realistic than a single integer, given my past experience. It also ensures that the tuples aren't all neatly aligned along the same field size boundary. I'm not sure this is entirely necessary but it can't hurt to give the database a little more work.

If I had a time series chart with time along the x-axis and each point representing a measurement or reading for a given facility it may be easy to see why select * from samples doesn't scale:

Instead the idea is to subdivide the full dataset in an intuitive way that could easily map to the table show above. I imagine a frame like you might use for photography or painting that limits what you "see".

Some aspects of this are easier than others, I imagine it is obvious how you might do:


select facility, data
  from samples
 where facility in (0, 1, 2);

Which achieves the "vertical" selection part of the above. The "horizontal" part could be achieved with a date range query:


select facility, data
  from samples
 where facility in (1, 2)
   and time between '2006-01-01 00:00:00' and '2008-01-01 00:00:00'

Of course things get marginally more complicated if the sort order is configurable but it doesn't exactly ruin our day. With just these it is possible to (conceptually) slide the frame around the data while keeping responses small and fast. The pages of a table might drive the inputs to the where clauses.

The thing that got me thinking about this in the first place though was how best to achieve a query like "last 50 readings no older than a week per facility". This kind of question doesn't exactly lend itself to the visual given above, at best I guess I imagine it like:

This sort of thing requires the use of a window function, which allow you to limit an aggregation within a given partition. In my example the partition would be facility and the aggregation could be something like group_concat, max, or json_group_array. Let's start with a query that works but returns too much data:


  select facility,
         json_group_array(time)
    from samples
group by facility

For the full data set I generated at the beginning, this returns over 20MB of data. I might trim it down to just 10 facilities, which brings the total size down to about 216KB — but only for now! How about 10 facilities and only in the last 3 years?


  select facility,
         time
    from samples
   where facility < 10
     and time > date('now', '-3 years')

While that gets down to about 30KB it is also a bit fragile to my point of view. What I would like instead is a maximum upper bound on the number of readings on top of which I might add further limits rather than the other way around. Here is where window functions come in, I happen to know that there is a pretty even distribution of records per facility (because with enough random data packed into a fixed number of buckets the distribution becomes uniform). I would expect then that taking ¹/₁₀ of the number of readings (100 of ~1000) should produce an equivalent fraction of data:


select facility,
       time
  from (
    select row_number() over (partition by facility order by time desc) seqnum,
           facility,
           time
      from samples
) where seqnum <=100

As expected, this produces about 2.5MB of data. It works by segmenting the table by facility before numbering each record per-partition in order of time. A small sample of the nested query looks like this:

seqn	facility	time
1	1	2022-06-23 21:36:05
1	2	2022-07-15 21:44:24
2	2	2022-07-04 12:55:30
3	2	2022-07-02 08:31:12
1	3	2022-07-21 11:15:27
2	3	2022-07-21 08:40:31
3	3	2022-07-11 06:09:41
4	3	2022-06-26 08:27:02
1	4	2022-07-09 19:51:14
2	4	2022-06-24 08:02:15

Circling back to the first example table, if I were going to design for more realistic data sizes I might do something like this:


  select facility, max(time) [last reading], json_group_array(time) [reading dates]
    from (
        select row_number() over (partition by facility order by time desc) seqnum,
               facility,
               time
          from samples
)
   where seqnum <= 10
     and facility <= 10
group by facility;

facility	last reading	reading dates
0	2022-07-22 20:19:47	`["2022-07-22 20:19:47","2022-07-06 06:04:39","2022-06-19 20:32:47","2022-06-19 02:29:53","2022-06-17 21:10:50","2022-06-17 09:54:25","2022-06-11 02:29:21","2022-06-09 22:26:04","2022-06-07 01:41:27","2022-05-30 03:29:29"]`
1	2022-07-13 10:36:28	`["2022-07-13 10:36:28","2022-07-13 00:56:24","2022-06-28 03:21:48","2022-06-22 20:32:58","2022-06-18 08:40:32","2022-06-11 00:27:20","2022-06-09 07:24:07","2022-06-08 12:27:27","2022-05-21 18:57:03","2022-05-15 14:13:23"]`
2	2022-07-09 18:34:17	`["2022-07-09 18:34:17","2022-06-27 16:10:39","2022-06-19 09:17:39","2022-06-18 09:18:16","2022-05-27 23:19:53","2022-05-07 03:52:07","2022-05-03 19:51:08","2022-04-26 18:48:43","2022-04-02 22:35:28","2022-03-28 04:13:32"]`
3	2022-07-11 13:31:13	`["2022-07-11 13:31:13","2022-07-07 05:27:16","2022-06-19 19:13:27","2022-06-04 20:08:59","2022-06-02 08:58:41","2022-05-27 02:24:25","2022-05-17 15:16:31","2022-05-03 14:55:32","2022-04-28 10:20:33","2022-04-25 03:34:11"]`
4	2022-07-20 12:27:39	`["2022-07-20 12:27:39","2022-07-15 21:05:38","2022-07-13 19:32:25","2022-07-11 13:14:56","2022-06-04 01:43:21","2022-05-28 22:57:35","2022-05-23 09:35:41","2022-04-22 03:42:55","2022-03-25 22:50:54","2022-03-15 20:54:42"]`
5	2022-07-17 22:20:05	`["2022-07-17 22:20:05","2022-07-17 04:57:10","2022-07-12 04:55:16","2022-07-06 14:08:40","2022-06-29 13:35:53","2022-06-29 10:32:16","2022-06-17 18:57:50","2022-06-15 06:01:14","2022-06-11 00:34:06","2022-05-23 01:57:06"]`
6	2022-07-17 23:08:40	`["2022-07-17 23:08:40","2022-07-14 15:37:25","2022-07-02 08:24:34","2022-06-17 10:01:40","2022-06-10 06:43:35","2022-06-02 03:53:53","2022-05-29 19:05:33","2022-05-29 12:51:47","2022-05-19 09:40:15","2022-05-16 00:38:39"]`
7	2022-07-23 04:45:42	`["2022-07-23 04:45:42","2022-07-21 03:47:07","2022-07-17 02:29:07","2022-07-06 21:39:15","2022-07-03 10:18:24","2022-06-22 16:26:45","2022-06-01 01:13:53","2022-05-22 03:29:49","2022-05-06 00:29:07","2022-04-28 01:42:27"]`
8	2022-07-16 02:17:04	`["2022-07-16 02:17:04","2022-07-08 09:42:01","2022-07-01 21:25:53","2022-06-19 19:53:08","2022-06-13 16:34:28","2022-06-07 08:55:54","2022-06-05 13:21:44","2022-06-04 23:45:41","2022-05-19 14:56:57","2022-05-11 11:45:34"]`
9	2022-07-11 13:21:13	`["2022-07-11 13:21:13","2022-07-05 06:22:22","2022-06-21 19:02:50","2022-06-14 14:56:47","2022-06-13 03:14:32","2022-05-22 06:46:08","2022-05-20 20:31:27","2022-05-17 20:28:18","2022-05-03 00:11:13","2022-04-28 16:58:30"]`
10	2022-07-23 03:19:46	`["2022-07-23 03:19:46","2022-07-21 07:10:09","2022-07-18 22:34:25","2022-07-14 07:22:15","2022-06-29 15:17:07","2022-06-27 11:19:42","2022-06-22 15:34:03","2022-06-13 01:27:43","2022-06-08 13:44:59","2022-05-31 05:31:14"]`

Thoughts

Considering the variety of ways I have seen these sorts of things done incorrectly I didn't find any of this particularly novel. Admittedly, requesting large amounts of data is always going to be a little slow but for the example above I can aggregate 100 readings for 1000 facilities (never mind the fact that one hundred thousand points of data probably don't have any business being on a single page in the first place) in 1.29 seconds without any optimization. With indexes you can then dive into any given reading by date in thousandths of a second. There's vanishingly little reason applications should be slow! There are a few more options in other databases that allow fetching along with offset without incurring the performance hit described in the SQLite docs. The SQLIte documentation also gives an interesting suggestion for using a compound row-value comparison in order to implement pagination without the same limitations I face with my own approach here.