Optimizing SQL

2018-04-18

I've recently rekindled my interest in relational databases and SQL, spurred on by a number of issues with query performance. As a tractable example of poor performance in an existing system I returned to a problem I've written about before, visualizing code review metrics.

Identifying Poor Performance

Due to my own poor planning in the early stages of collecting data, I ended up with a lousy data model that duplicates much of the data for every entry in a single table. As a solution to part of that problem I ended up writing the following query to de-duplicate rows by selecting the most recent "date" to filter on.

  SELECT t1."Review Title",
         t1."Review Creation Date",
         strftime("%Y-%m-%d", t1.date),
         t1."Author Full Name"
  FROM reviews t1
  WHERE t1.date = (SELECT t2.date
                   FROM reviews t2
                   WHERE t2."Review Title" = t1."Review Title"
                   ORDER BY t2.date DESC
                   LIMIT 1)
  AND t1."Author Full Name" IN
  ("Nolan Prescott", "Coworker A", "Coworker B", "Coworker C")
  ORDER BY "date" DESC;

Now, with some time to stew on it and more data every day, there is an obvious performance problem in the query, which is that the nested select ends up being a correlated subquery. This is a fancy term I've recently learned that describes a situation like the above:

  ...
  FROM reviews t1
  WHERE t1.date = (SELECT t2.date FROM reviews t2
                   WHERE t2."Review Title" = t1."Review Title"
  ...

By making reference to the outer query using t1 in the nested query, the database ends up re-evaluating for each row. This makes things really slow. Using the sqlite3 command line interface you can enable timing as follows:

  sqlite> .timer on
  sqlite> SELECT t1."Review Title",
          ...

  Run Time: real 6.329 user 6.328000 sys 0.000000

Over 6 seconds to execute for a table with only 8,000 rows — atrocious! Trying to dig a bit deeper and validate my diagnosis I used explain query plan to get the following, which makes a bit of sense if you squint hard enough at the docs:

selectid   order  from  detail                                  
---------  -----  ----  ----------------------------------------
0          0      0     SCAN TABLE reviews AS t1                
0          0      0     EXECUTE CORRELATED SCALAR SUBQUERY 1    
1          0      0     SCAN TABLE reviews AS t2                
1          0      0     USE TEMP B-TREE FOR ORDER BY            
0          0      0     EXECUTE LIST SUBQUERY 2                 
0          0      0     USE TEMP B-TREE FOR ORDER BY

While it isn't quite as good as an explain plan from Postgres, it at least confirms the correlated subquery with line 2. With this new information in hand, it becomes possible to optimize the query, the most obvious goal is to simply remove the need for the correlated subquery. In the above case, the entire sub-query is used to de-dupe reviews to only select the most recent entry, so let's try to use a group by instead:

  sqlite> .timer on
  sqlite>  select "ID",
           "Review Title", 
           "Review Creation Date", 
           strftime("%Y-%m-%d", date),
           "Author Full Name",
           "Idle For"
      from reviews
     where "Author Full Name" in
           ("Nolan Prescott",
            "Coworker A",
            "Coworker B",
            "Coworker C")
     group by "ID"
     order by "date" desc;
  ...
  Run Time: real 0.010 user 0.008000 sys 0.004000

And just like that the query is 600x faster. The query plan for the revised query is as follows, the most notable part is of course the elimination of the correlated scalar subquery and second table scan:

selectid   order  from  detail                                  
---------  -----  ----  ----------------------------
0          0      0     SCAN TABLE reviews AS t1    
0          0      0     EXECUTE LIST SUBQUERY 1     
0          0      0     USE TEMP B-TREE FOR GROUP BY
0          0      0     USE TEMP B-TREE FOR ORDER BY

A More General Solution

SQLite allows a more lax approach to querying than other databases. The above query doesn't work at all in PostgreSQL due to a constraint on how group by is used, requiring any selected columns to appear in the group by:

When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.

So while diagnostics available within SQLite may not be as good as bigger database systems, I still think it is unmatched for small projects like this one. In an effort to round out my knowledge of those databases I commonly use though, I've attempted to translate the optimization to a Postgres instance of the same data and came up with the following:

     select t1."ID",
            t1."Review Title",
            t1."Review Creation Date", 
            t1.date,
            t1."Author Full Name",
            t1."Idle For"
       from reviews t1
       join   (select "ID", max(date) as date
                 from reviews
             group by "ID") t2
      using("ID")
      where t1.date=t2.date
        and t1."Author Full Name" in ('Nolan Prescott',
                                      'Coworker A',
                                      'Coworker B',
                                      'Coworker C')
   order by t1.date desc;

It is necessarily more involved to construct a second query that is not correlated to the first and then join it in lieu of a group by directly. But that being said, I think the fundamentals aren't too surprising once you get over the weirdness of joining a table on itself. The difference before and after with Postgres is only a 300x improvement. Happily, the reason for the lack of improvement is due to how much faster Postgres was in the first place.

Before

POSTGRESQL QUERY PLAN
----------------------------------------------------------------------------------------------------------
Unique
  ->  Sort
        Sort Key: t1.date DESC, t1."ID", t1."Review Title", t1."Review Creation Date", t1."Author Full Name", t1."Idle For"
        ->  Seq Scan on reviews t1
              Filter: (("Author Full Name" = ANY ('{"Coworker A","Coworker B","Coworker C","Nolan Prescott"}'::text[])) 
                        AND (date = (SubPlan 1)))
              SubPlan 1
                ->  Limit loops=1795
                      ->  Sort loops=1795
                            Sort Key: t2.date DESC
                            ->  Seq Scan on reviews t2 loops=1795
                                  Filter: ("Review Title" = t1."Review Title")
Planning time: 0.282 ms
Execution time: 1322.200 ms

After

POSTGRESQL QUERY PLAN
----------------------------------------------------------------------------------------------------------
Sort
  Sort Key: t1.date DESC
  ->  Hash Join
        Hash Cond: ((t1."ID" = reviews."ID") AND (t1.date = (max(reviews.date))))
        ->  Seq Scan on reviews t1
              Filter: ("Author Full Name" = ANY ('{"Coworker A","Coworker B","Coworker C","Nolan Prescott"}'::text[]))
        ->  Hash
              ->  HashAggregate
                    Group Key: reviews."ID"
                    ->  Seq Scan on reviews
Planning time: 0.129 ms
Execution time: 4.488 ms

I've munged a bit of information from an explain analyze in with the sparser explain to capture the important bit from the first query, which is the number of loops involved (1795) vs the optimized (with only 1). Testing the Postgres compatible query in sqlite reveals similar performance (if not a hair better), which probably means it is the solution to be preferred, as it is compliant with multiple (probably all) databases.

I am really enjoying getting back into SQL more directly, having spent much of my time recently feeling straight-jacketed by an ORM. In terms of getting even a little more in-depth I've found Mastering PostgreSQL in Application Development to be a pretty good introduction to some of the "advanced" features available in Postgres, especially when paired with PostgreSQL Exercises. The documentation that accompanies Postgres is still daunting, though I'm told it is excellent.