Bulk Data Generation in SQLite2021-07-18
Browsing the internet this afternoon I saw an article about fast SQLite inserts posted online. As a fun exercise I thought to try things out myself.
The post author needed a large dataset to benchmark indices and was nerd sniped into optimizing first Python, then PyPy, then Rust. Somewhat confusingly the article title mentions 1 billion records but the article itself benchmarks 100 million. For the sake of compatibility (and time!) I aimed for 100 million.
The first problem I ran into was recreating all of the benchmarks
provided in the example repository. While I installed PyPy without
issue I immediately ran into a problem with digit-grouped numbers
where the interpreter would choke on numbers formatted like
100_000_000. I tried reformatting those but it didn't
bode well for the actual execution, the real issue that stopped me
from fully re-running the benchmark suite were exceptions thrown
while executing the Python scripts. I think perhaps the author is on
a newer version of Python than I can readily obtain on my machine so
I have opted instead to try only to recreate the best performing
It might be a little unfair but I will include here the time it took to build the Rust program because I was surprised at how slow it was. I guess on the plus-side it seems to use multiple CPUs.
$ time cargo build --release --quiet --bin basic_batched real 1m23.458s user 9m23.622s sys 0m18.180s $ time ./target/release/basic_batched real 0m29.057s user 0m27.605s sys 0m1.452s
Although I didn't see an improvement in runtime from the threaded Rust approach, generally speaking, my execution times were roughly similar to the original article for the best solutions.
$ time ./target/release/threaded_batched real 0m29.480s user 0m39.956s sys 0m7.545s
My Own Approach
I have been having fun lately with SQL and my first thought when reading the article was to wonder how long a pure SQL solution would take. I have not yet had a chance to try out recursive common table expressions in SQLite so this seemed like a great time to learn something.
Using the requirements outlined in the original article:
The area column would hold six digits area code (any six digits would do, no validation). The age would be any of 5, 10, or 15. The active column is either 0 or 1.
begin transaction; create table user ( id integer primary key, area char(6), age integer not null, active integer not null ); insert into user(area, age, active) with recursive randomUser(area, age, active) as ( values(printf('%.6d', substr(abs(random()), 1, 6)), case abs(random()) % 3 when 0 then 5 when 1 then 10 when 2 then 15 end, abs(random()) % 2) union all select printf('%.6d', substr(abs(random()), 1, 6)), case abs(random()) % 3 when 0 then 5 when 1 then 10 when 2 then 15 end, abs(random()) % 2 from randomUser) select * from randomUser limit 100000000; commit;
I would characterize this as an interesting if slightly
surprising use of a recursive CTE. Firstly, by defining
insert into using just area, age, and active SQLite
will provide automatic IDs for the primary
values portion of the recursive CTE defines
the initial case for the "table" with the following
all defining the subsequent values of the table. This is an
interesting case because both are populated from the same
expressions (basically calls to
random with some
Highlights of the data munging include:
printfto zero-pad an at-most 6-digit positive number
- using a
casestatement to cast a modulo operation into one of the three possible values for age
$ time sqlite3 benchmark.db < bulk-data.sql real 1m54.957s user 1m53.100s sys 0m1.052s
Just like that a pretty naive pure-SQL approach would seem to beat
out or at least be competitive with Python and PyPy. While it is
still about 400% slower than the best Rust execution I think there
is room to improve via
begin transaction; create table user ( id integer primary key, area char(6), age integer not null, active integer not null ); insert into user(area, age, active) select printf('%.6d', substr(abs(random()), 1, 6)), case abs(random()) % 3 when 0 then 5 when 1 then 10 when 2 then 15 end, abs(random()) % 2 from generate_series limit 100000000; commit;
I am really impressed with how clean this reads. I know that is personal preference, but where the recursive CTE is interesting, this is almost pleasingly uninteresting. It is also nearly twice as fast:
$ time sqlite3 benchmark.db < generate-series-data.sql real 1m0.977s user 0m58.965s sys 0m1.137s
All told, I'm happy to find that ~9 lines of SQL is within 200% of
generate_series is very neat and a
quality addition to the SQLite toolbox. Even ignoring the
compilation times, I still think Rust seems like the wrong tool in
this case. It looks like the Rust program requires 144 separate
pacakges, which sounds like a nightmare for understanding the whole
program or long-term maintenance. I think my time will be better
spent learning more about SQL and SQLite before I bother with Rust.