I've recently rekindled my interest in relational databases and SQL, spurred on by a number of issues with query performance. As a tractable example of poor performance in an existing system I returned to a problem I've written about before, visualizing code review metrics.
Due to my own poor planning in the early stages of collecting data, I ended up with a lousy data model that duplicates much of the data for every entry in a single table. As a solution to part of that problem I ended up writing the following query to de-duplicate rows by selecting the most recent "date" to filter on.
SELECT t1."Review Title",
t1."Review Creation Date",
strftime("%Y-%m-%d", t1.date),
t1."Author Full Name"
FROM reviews t1
WHERE t1.date = (SELECT t2.date
FROM reviews t2
WHERE t2."Review Title" = t1."Review Title"
ORDER BY t2.date DESC
LIMIT 1)
AND t1."Author Full Name" IN
("Nolan Prescott", "Coworker A", "Coworker B", "Coworker C")
ORDER BY "date" DESC;
Now, with some time to stew on it and more data every day, there is an obvious performance problem in the query, which is that the nested select ends up being a correlated subquery. This is a fancy term I've recently learned that describes a situation like the above:
...
FROM reviews t1
WHERE t1.date = (SELECT t2.date FROM reviews t2
WHERE t2."Review Title" = t1."Review Title"
...
By making reference to the outer query using t1
in the nested query,
the database ends up re-evaluating for each row. This makes things
really slow. Using the sqlite3
command line interface you can
enable timing as follows:
sqlite> .timer on
sqlite> SELECT t1."Review Title",
...
Run Time: real 6.329 user 6.328000 sys 0.000000
Over 6 seconds to execute for a table with only 8,000 rows —
atrocious! Trying to dig a bit deeper and validate my diagnosis I used
explain query plan
to get the following, which makes a bit of sense
if you squint hard enough
at the docs:
selectid order from detail
--------- ----- ---- ----------------------------------------
0 0 0 SCAN TABLE reviews AS t1
0 0 0 EXECUTE CORRELATED SCALAR SUBQUERY 1
1 0 0 SCAN TABLE reviews AS t2
1 0 0 USE TEMP B-TREE FOR ORDER BY
0 0 0 EXECUTE LIST SUBQUERY 2
0 0 0 USE TEMP B-TREE FOR ORDER BY
While it isn't quite as good as an explain plan from Postgres, it at
least confirms the correlated subquery with line 2. With this new
information in hand, it becomes possible to optimize the query, the
most obvious goal is to simply remove the need for the correlated
subquery. In the above case, the entire sub-query is used to de-dupe
reviews to only select the most recent entry, so let's try to use a
group by
instead:
sqlite> .timer on
sqlite> select "ID",
"Review Title",
"Review Creation Date",
strftime("%Y-%m-%d", date),
"Author Full Name",
"Idle For"
from reviews
where "Author Full Name" in
("Nolan Prescott",
"Coworker A",
"Coworker B",
"Coworker C")
group by "ID"
order by "date" desc;
...
Run Time: real 0.010 user 0.008000 sys 0.004000
And just like that the query is 600x faster. The query plan for the revised query is as follows, the most notable part is of course the elimination of the correlated scalar subquery and second table scan:
selectid order from detail
--------- ----- ---- ----------------------------
0 0 0 SCAN TABLE reviews AS t1
0 0 0 EXECUTE LIST SUBQUERY 1
0 0 0 USE TEMP B-TREE FOR GROUP BY
0 0 0 USE TEMP B-TREE FOR ORDER BY
SQLite allows a more lax approach to querying than other databases. The above query doesn't work at all in PostgreSQL due to a constraint on how group by is used, requiring any selected columns to appear in the group by:
When
GROUP BY
is present, or any aggregate functions are present, it is not valid for theSELECT
list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.
So while diagnostics available within SQLite may not be as good as bigger database systems, I still think it is unmatched for small projects like this one. In an effort to round out my knowledge of those databases I commonly use though, I've attempted to translate the optimization to a Postgres instance of the same data and came up with the following:
select t1."ID",
t1."Review Title",
t1."Review Creation Date",
t1.date,
t1."Author Full Name",
t1."Idle For"
from reviews t1
join (select "ID", max(date) as date
from reviews
group by "ID") t2
using("ID")
where t1.date=t2.date
and t1."Author Full Name" in ('Nolan Prescott',
'Coworker A',
'Coworker B',
'Coworker C')
order by t1.date desc;
It is necessarily more involved to construct a second query that is not correlated to the first and then join it in lieu of a group by directly. But that being said, I think the fundamentals aren't too surprising once you get over the weirdness of joining a table on itself. The difference before and after with Postgres is only a 300x improvement. Happily, the reason for the lack of improvement is due to how much faster Postgres was in the first place.
POSTGRESQL QUERY PLAN
----------------------------------------------------------------------------------------------------------
Unique
-> Sort
Sort Key: t1.date DESC, t1."ID", t1."Review Title", t1."Review Creation Date", t1."Author Full Name", t1."Idle For"
-> Seq Scan on reviews t1
Filter: (("Author Full Name" = ANY ('{"Coworker A","Coworker B","Coworker C","Nolan Prescott"}'::text[]))
AND (date = (SubPlan 1)))
SubPlan 1
-> Limit loops=1795
-> Sort loops=1795
Sort Key: t2.date DESC
-> Seq Scan on reviews t2 loops=1795
Filter: ("Review Title" = t1."Review Title")
Planning time: 0.282 ms
Execution time: 1322.200 ms
POSTGRESQL QUERY PLAN
----------------------------------------------------------------------------------------------------------
Sort
Sort Key: t1.date DESC
-> Hash Join
Hash Cond: ((t1."ID" = reviews."ID") AND (t1.date = (max(reviews.date))))
-> Seq Scan on reviews t1
Filter: ("Author Full Name" = ANY ('{"Coworker A","Coworker B","Coworker C","Nolan Prescott"}'::text[]))
-> Hash
-> HashAggregate
Group Key: reviews."ID"
-> Seq Scan on reviews
Planning time: 0.129 ms
Execution time: 4.488 ms
I've munged a bit of information from an explain analyze
in with the
sparser explain
to capture the important bit from the first query,
which is the number of loops involved (1795) vs the optimized (with
only 1). Testing the Postgres compatible query in sqlite reveals
similar performance (if not a hair better), which probably means it is
the solution to be preferred, as it is compliant with multiple
(probably all) databases.
I am really enjoying getting back into SQL more directly, having spent much of my time recently feeling straight-jacketed by an ORM. In terms of getting even a little more in-depth I've found Mastering PostgreSQL in Application Development to be a pretty good introduction to some of the "advanced" features available in Postgres, especially when paired with PostgreSQL Exercises. The documentation that accompanies Postgres is still daunting, though I'm told it is excellent.