Bulk Data Generation in SQLite

2021-07-18

Browsing the internet this afternoon I saw an article about fast SQLite inserts posted online. As a fun exercise I thought to try things out myself.

Premise

The post author needed a large dataset to benchmark indices and was nerd sniped into optimizing first Python, then PyPy, then Rust. Somewhat confusingly the article title mentions 1 billion records but the article itself benchmarks 100 million. For the sake of compatibility (and time!) I aimed for 100 million.

The first problem I ran into was recreating all of the benchmarks provided in the example repository. While I installed PyPy without issue I immediately ran into a problem with digit-grouped numbers where the interpreter would choke on numbers formatted like 100_000_000. I tried reformatting those but it didn't bode well for the actual execution, the real issue that stopped me from fully re-running the benchmark suite were exceptions thrown while executing the Python scripts. I think perhaps the author is on a newer version of Python than I can readily obtain on my machine so I have opted instead to try only to recreate the best performing benchmarks.

Execution Time

It might be a little unfair but I will include here the time it took to build the Rust program because I was surprised at how slow it was. I guess on the plus-side it seems to use multiple CPUs.

$ time cargo build --release --quiet --bin basic_batched

real    1m23.458s
user    9m23.622s
sys     0m18.180s

$ time ./target/release/basic_batched

real    0m29.057s
user    0m27.605s
sys     0m1.452s

Although I didn't see an improvement in runtime from the threaded Rust approach, generally speaking, my execution times were roughly similar to the original article for the best solutions.

$ time ./target/release/threaded_batched

real    0m29.480s
user    0m39.956s
sys     0m7.545s

My Own Approach

I have been having fun lately with SQL and my first thought when reading the article was to wonder how long a pure SQL solution would take. I have not yet had a chance to try out recursive common table expressions in SQLite so this seemed like a great time to learn something.

Using the requirements outlined in the original article:

The area column would hold six digits area code (any six digits would do, no validation). The age would be any of 5, 10, or 15. The active column is either 0 or 1.

begin transaction;

create table user
(
    id integer primary key,
    area char(6),
    age integer not null,
    active integer not null
);

insert into user(area, age, active)
    with recursive
      randomUser(area, age, active) as (

      values(printf('%.6d', substr(abs(random()), 1, 6)),
             case abs(random()) % 3
                 when 0 then 5
                 when 1 then 10
                 when 2 then 15
             end, 
             abs(random()) % 2) 

      union all

      select printf('%.6d', substr(abs(random()), 1, 6)),
             case abs(random()) % 3
                 when 0 then 5
                 when 1 then 10
                 when 2 then 15
             end,
             abs(random()) % 2 from randomUser)

    select * from randomUser limit 100000000;

commit;

I would characterize this as an interesting if slightly surprising use of a recursive CTE. Firstly, by defining the insert into using just area, age, and active SQLite will provide automatic IDs for the primary key. The values portion of the recursive CTE defines the initial case for the "table" with the following union all defining the subsequent values of the table. This is an interesting case because both are populated from the same expressions (basically calls to random with some munging).

Highlights of the data munging include:

using printf to zero-pad an at-most 6-digit positive number
using a case statement to cast a modulo operation into one of the three possible values for age

Timing

$ time sqlite3 benchmark.db < bulk-data.sql

real    1m54.957s
user    1m53.100s
sys     0m1.052s

Just like that a pretty naive pure-SQL approach would seem to beat out or at least be competitive with Python and PyPy. While it is still about 400% slower than the best Rust execution I think there is room to improve via the generate_series extension.

`generate_series`

begin transaction;

create table user
(
    id integer primary key,
    area char(6),
    age integer not null,
    active integer not null
);

insert into user(area, age, active)
select printf('%.6d', substr(abs(random()), 1, 6)),
       case abs(random()) % 3
           when 0 then 5
           when 1 then 10
           when 2 then 15
       end,
       abs(random()) % 2
  from generate_series
 limit 100000000;

commit;

I am really impressed with how clean this reads. I know that is personal preference, but where the recursive CTE is interesting, this is almost pleasingly uninteresting. It is also nearly twice as fast:

$ time sqlite3 benchmark.db < generate-series-data.sql

real    1m0.977s
user    0m58.965s
sys     0m1.137s

All told, I'm happy to find that ~9 lines of SQL is within 200% of optimized Rust; generate_series is very neat and a quality addition to the SQLite toolbox. Even ignoring the compilation times, I still think Rust seems like the wrong tool in this case. It looks like the Rust program requires 144 separate pacakges, which sounds like a nightmare for understanding the whole program or long-term maintenance. I think my time will be better spent learning more about SQL and SQLite before I bother with Rust.