I saw a "100-million-row challenge in PHP" posted today and it reminded me of the last time I saw something like this. While I don't know PHP I thought I might treat it as an exercise for using SQLite to keep me in practice with a few different SQLite functions. It turns out there is definitely a point where SQLite may not be a good fit!
The challenge is pretty straight forward and is really aimed at optimizing PHP. There is a large CSV file that is to be aggregated into a specific format:
You'll be parsing millions of CSV lines into a JSON file, with the following rules in mind:
- Each entry in the generated JSON file should be a key-value pair with the page's URL path as the key and an array with the number of visits per day as the value.
- Visits should be sorted by date in ascending order.
- The output should be encoded as a pretty JSON string.
As an example, take the following input:
https://stitcher.io/blog/11-million-rows-in-seconds,2026-01-24T01:16:58+00:00 https://stitcher.io/blog/php-enums,2024-01-24T01:16:58+00:00 https://stitcher.io/blog/11-million-rows-in-seconds,2026-01-24T01:12:11+00:00 https://stitcher.io/blog/11-million-rows-in-seconds,2025-01-24T01:15:20+00:00Your parser should store the following output in $outputPath as a JSON file:
{ "\/blog\/11-million-rows-in-seconds": { "2025-01-24": 1, "2026-01-24": 2 }, "\/blog\/php-enums": { "2024-01-24": 1 } }
I don't care so much about CSV parsing so instead I thought I'd see
about the data transformation using SQL. The CSV import is pretty
easy using the sqlite3 command line:
$ sqlite3
SQLite version 3.50.2 2025-06-28 14:00:48
Connected to a transient in-memory database.
sqlite> create table hits(url, ts);
sqlite> .import -csv data.csv hits
In about six tenths of a second I'm able to load the 1 million line test file. From there I think it makes sense to reformat the data, stripping the protocol and hostname from the URLs:
sqlite> select substr(url, 20) from hits limit 5;
/blog/11-million-rows-in-seconds
/blog/php-enums
/blog/11-million-rows-in-seconds
/blog/11-million-rows-in-seconds
And transforming the timestamp into just a YYYY-MM-DD format:
sqlite> select date(ts) from hits limit 5;
2026-01-24
2024-01-24
2026-01-24
2025-01-24
It seems logical to pull that into a CTE with some plausible name,
maybe path_days:
with path_days as (
select substr(url, 20) path, date(ts) day from hits
)
To count the path segments by day is a perfect use for a window function, where the partition is the path and the group is the path and day. It is easiest to confirm using the 4 line example from the project README, I loaded that into a separate table to prove my window function matches the requirements and example:
with path_days as (
select substr(url, 20) path, date(ts) day from hits
)
select path, day, count(*)
over (partition by path order by day)
from path_days
group by path, day;
/blog/11-million-rows-in-seconds|2025-01-24|1
/blog/11-million-rows-in-seconds|2026-01-24|2
/blog/php-enums|2024-01-24|1
All that remains is formatting into JSON. In truth this is one of the things that got me wondering about solving this in SQL to begin with. It isn't uncommon for me to query JSON but to construct it is unusual, as a result I took the chance to browse the available functions. Using another CTE to organize the now working aggregation into another named section:
with path_days as (
select substr(url, 20) path, date(ts) day from hits
),
path_day_counts as (
select path, day, count(*)
over (partition by path order by day) c
from path_days
group by path, day
)
The structure of the desired result is a JSON object with a key for
each path element and the value side is an object with the key of
the date and value of the count. The nested object is itself an
aggregation result so the trickiest part is really finding the
correct function (json_group_object) and the grouping
criteria (the path):
with path_day as (
select substr(url, 20) path, date(ts) day from hits
),
path_day_count as (
select path, day, count(*) over (partition by path order by day) c
from path_day
group by path, day
)
select json_pretty(
json_object(path, json_group_object(day, c))
)
from path_day_count
group by path;
{
"/blog/11-million-rows-in-seconds": {
"2025-01-24": 1,
"2026-01-24": 2
}
}
{
"/blog/php-enums": {
"2024-01-24": 1
}
}
And it is working! Now to time execution against the 1 million line sample data:
$ time sqlite3 > /tmp/out.txt <<EOF
create table hits(url, ts);
.import -csv data.csv hits
with path_day as (
select substr(url, 20) path, date(ts) day from hits
),
path_day_count as (
select path, day, count(*) over (partition by path order by day) c
from path_day
group by path, day
)
select json_pretty(
json_object(path, json_group_object(day, c))
)
from path_day_count
group by path;
EOF
real 0m2.651s
user 0m2.250s
sys 0m0.388s
A bit over two and a half seconds to ingest, process, and aggregate a million rows! That is pretty satisfying but the stress test is surely going to be the 100 million line file:
$ time sqlite3 > /tmp/out.txt <<EOF
create table hits(url, ts);
.import -csv big-data.csv hits
with path_day as (
select substr(url, 20) path, date(ts) day from hits
),
path_day_count as (
select path, day, count(*) over (partition by path order by day) c
from path_day
group by path, day
)
select json_pretty(
json_object(path, json_group_object(day, c))
)
from path_day_count
group by path;
EOF
real 5m21.553s
user 3m26.326s
sys 1m20.567s
Ouch, over five minutes and twenty seconds to handle one hundred million rows. Admittedly, that is a lot of data, about 7GB on disk. Probably I wouldn't suggest throwing 100 million rows at SQLite with total disregard for the time and work involved but at the same time, it isn't that long given how much data there is, maybe that is workable in some circumstances.
I may yet dig through the optimized PHP answers to try gleaning something interesting about optimization work or profiling in an unfamiliar language but I'm otherwise satisfied with this little exercise in SQL.