Browsing the internet this afternoon I saw an article about fast SQLite inserts posted online. As a fun exercise I thought to try things out myself.
The post author needed a large dataset to benchmark indices and was nerd sniped into optimizing first Python, then PyPy, then Rust. Somewhat confusingly the article title mentions 1 billion records but the article itself benchmarks 100 million. For the sake of compatibility (and time!) I aimed for 100 million.
The first problem I ran into was recreating all of the benchmarks
provided in the example repository. While I installed PyPy without
issue I immediately ran into a problem with digit-grouped numbers
where the interpreter would choke on numbers formatted like
100_000_000
. I tried reformatting those but it didn't
bode well for the actual execution, the real issue that stopped me
from fully re-running the benchmark suite were exceptions thrown
while executing the Python scripts. I think perhaps the author is on
a newer version of Python than I can readily obtain on my machine so
I have opted instead to try only to recreate the best performing
benchmarks.
It might be a little unfair but I will include here the time it took to build the Rust program because I was surprised at how slow it was. I guess on the plus-side it seems to use multiple CPUs.
$ time cargo build --release --quiet --bin basic_batched
real 1m23.458s
user 9m23.622s
sys 0m18.180s
$ time ./target/release/basic_batched
real 0m29.057s
user 0m27.605s
sys 0m1.452s
Although I didn't see an improvement in runtime from the threaded Rust approach, generally speaking, my execution times were roughly similar to the original article for the best solutions.
$ time ./target/release/threaded_batched
real 0m29.480s
user 0m39.956s
sys 0m7.545s
I have been having fun lately with SQL and my first thought when reading the article was to wonder how long a pure SQL solution would take. I have not yet had a chance to try out recursive common table expressions in SQLite so this seemed like a great time to learn something.
Using the requirements outlined in the original article:
The area column would hold six digits area code (any six digits would do, no validation). The age would be any of 5, 10, or 15. The active column is either 0 or 1.
begin transaction;
create table user
(
id integer primary key,
area char(6),
age integer not null,
active integer not null
);
insert into user(area, age, active)
with recursive
randomUser(area, age, active) as (
values(printf('%.6d', substr(abs(random()), 1, 6)),
case abs(random()) % 3
when 0 then 5
when 1 then 10
when 2 then 15
end,
abs(random()) % 2)
union all
select printf('%.6d', substr(abs(random()), 1, 6)),
case abs(random()) % 3
when 0 then 5
when 1 then 10
when 2 then 15
end,
abs(random()) % 2 from randomUser)
select * from randomUser limit 100000000;
commit;
I would characterize this as an interesting if slightly
surprising use of a recursive CTE. Firstly, by defining
the insert into
using just area, age, and active SQLite
will provide automatic IDs for the primary
key. The values
portion of the recursive CTE defines
the initial case for the "table" with the following union
all
defining the subsequent values of the table. This is an
interesting case because both are populated from the same
expressions (basically calls to random
with some
munging).
Highlights of the data munging include:
printf
to zero-pad an at-most 6-digit
positive number
case
statement to cast a modulo operation
into one of the three possible values for age
$ time sqlite3 benchmark.db < bulk-data.sql
real 1m54.957s
user 1m53.100s
sys 0m1.052s
Just like that a pretty naive pure-SQL approach would seem to beat
out or at least be competitive with Python and PyPy. While it is
still about 400% slower than the best Rust execution I think there
is room to improve via
the generate_series
extension.
generate_series
begin transaction;
create table user
(
id integer primary key,
area char(6),
age integer not null,
active integer not null
);
insert into user(area, age, active)
select printf('%.6d', substr(abs(random()), 1, 6)),
case abs(random()) % 3
when 0 then 5
when 1 then 10
when 2 then 15
end,
abs(random()) % 2
from generate_series
limit 100000000;
commit;
I am really impressed with how clean this reads. I know that is personal preference, but where the recursive CTE is interesting, this is almost pleasingly uninteresting. It is also nearly twice as fast:
$ time sqlite3 benchmark.db < generate-series-data.sql
real 1m0.977s
user 0m58.965s
sys 0m1.137s
All told, I'm happy to find that ~9 lines of SQL is within 200% of
optimized Rust; generate_series
is very neat and a
quality addition to the SQLite toolbox. Even ignoring the
compilation times, I still think Rust seems like the wrong tool in
this case. It looks like the Rust program requires 144 separate
pacakges, which sounds like a nightmare for understanding the whole
program or long-term maintenance. I think my time will be better
spent learning more about SQL and SQLite before I bother with Rust.