A Database Indexing Pitfall

2024-10-27

Here's an interesting thing I found this week. In different relational databases you might find that despite indexing a column a sequential scan is still required for querying. Here's a hint for PostgreSQL and SQLite.

PostgreSQL

In my particular case I had an identifier that is a delimited string with an implied hierarchy, like 001-002-003 or 099-002-003. The whole thing is unique but the problem query was for a partial match such as "everything in 099-002". With a table like this:

create table foo(some_id text);

That meant the query was like this:

select some_id 
 where some_id 
  like '099-002%';

In the case of PostgreSQL despite an index:

create index some_id_idx on foo (some_id);

The query requires a sequential scan:

# explain analyze select * from foo where some_id like '099-002-%' limit 1;
                                 QUERY PLAN
-----------------------------------------------------------------------------
  Limit(cost=0.00..179.06 rows=1 width=14) (actual time=262.986..262.987 rows=0 loops=1)
    ->Seq Scan on foo(cost=0.00..89528.00 rows=500 width=14) (actual time=262.983..262.983 rows=0 loops=1)
         Filter: ((some_id)::text ~~ '099-002-%'::text)
         Rows Removed by Filter: 5000000
Planning Time: 0.325 ms
Execution Time: 263.025 ms

The missing piece here was the collation on the database and consequent default collation on the table and column -- it was en_US.UTF-8. What this means is the default btree index on the some_id column isn't used for prefix matching in the LIKE query. I've mentioned it before, the PostgreSQL documentation is excellent but there can be a chicken and egg problem of knowing what to look for:

The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops support B-tree indexes on the types text, varchar, and char respectively. The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX regular expressions) when the database does not use the standard “C” locale.

The solution then is easy enough, the relevant operator class needs to be added to an index:

create index some_id_text_pattern_idx on foo (some_id text_pattern_ops);

And immediately the more expected thing can be observed:

# explain analyze select * from foo where some_id like '099-002-%' limit 1;
                                 QUERY PLAN
---------------------------------------------------------------------------------
Limit(cost=0.43..2.72 rows=1 width=48) (actual time=0.077..0.078 rows=0 loops=1)
  ->Index Only Scan using some_id_text_pattern_idx on foo(cost=0.43..57130.93 rows=25000 width=48) (actual time=0.074..0.075 rows=0 loops=1)
       Index Cond: ((some_id ~>=~ '099-002-'::text) AND (some_id ~<~ '099-002.'::text))
       Filter: ((some_id)::text ~~ '099-002-%'::text)
       Heap Fetches: 0
Planning Time: 0.845 ms
Execution Time: 0.124 ms

SQLite

There is a slightly different issue of the same sort in SQLite. Here I generate ten million records of sample data to try slowing things down sufficiently to demonstrate:

create table foo(some_id text);
create index some_id_idx on foo (some_id);

with recursive numbers as (
  select 0 as num
  union all
  select num + 1 from numbers where num < 10000000
)
insert into foo
select 
  printf('%03d-%03d-%03d', 
	 (num / 1000000) % 1000,
	 (num / 1000) % 1000,
	 num % 1000)
from numbers;

Executing a prefix match query exhibits a similar problem with the database scanning the table despite the index:

sqlite> .eqp on
sqlite> .timer on
sqlite> select some_id from foo where some_id like '009-996-%' limit 1;
QUERY PLAN
`--SCAN foo
┌─────────────┐
│   some_id   │
├─────────────┤
│ 009-996-000 │
└─────────────┘
Run Time: real 0.442 user 0.439346 sys 0.000000

In the case of SQLite though the issue is due to the default behavior of LIKE which is case insensitive and consequently defeats the index. If instead you opt into case sensitivity things behave as expected:

sqlite> PRAGMA case_sensitive_like=1;

sqlite> select some_id from foo where some_id like '009-996-%' limit 1;
QUERY PLAN
`--SEARCH foo USING COVERING INDEX some_id_idx (some_id>? AND some_id<?)
┌─────────────┐
│   some_id   │
├─────────────┤
│ 009-996-000 │
└─────────────┘
Run Time: real 0.000 user 0.000187 sys 0.000021

Alternatively, it is also possible to create the index as case insensitive:

create index some_id_idx_nocase on foo (some_id collate nocase);

Computed Indexes

The final thought I had is a bit more domain specific but in the case I've been describing the identifier is a composite piece of data represented as a string. The use of a LIKE query is to gather everything under the hierarchy of the first two pieces of data. More useful is probably expressing the queries in this way -- assuming it isn't too easy to simply change the table structure this would seem like a good use for a view:

sqlite> create view disaggregated(part_a, part_b, part_c) as 
    select substr(some_id, 1, 3) part_a, 
           substr(some_id, 5, 3) part_b, 
           substr(some_id, 9, 3) part_c 
      from foo;
sqlite> 
sqlite> select * from disaggregated limit 5;
┌────────┬────────┬────────┐
│ part_a │ part_b │ part_c │
├────────┼────────┼────────┤
│ 000    │ 000    │ 000    │
│ 000    │ 000    │ 001    │
│ 000    │ 000    │ 002    │
│ 000    │ 000    │ 003    │
│ 000    │ 000    │ 004    │
└────────┴────────┴────────┘

But then queries will miss out on the benefits of indexing because of the calls to substr over the indexed column. It turns out in SQLite it is entirely possible to index over computed fields like this:

sqlite> create index view_idx_a on foo(substr(some_id, 1, 3));
sqlite> create index view_idx_b on foo(substr(some_id, 5, 3));
sqlite> create index view_idx_c on foo(substr(some_id, 9, 3));

sqlite> -- this last example is a compound index over two computed columns!
sqlite> create index view_idx_ab on foo(substr(some_id, 1, 3), substr(some_id, 5, 3));

That's all it takes to start benefiting from both the more intuitive querying and the performance of the indexes:

sqlite> select * from disaggregated where part_a = '009' and part_b = '010' limit 5;
QUERY PLAN
`--SEARCH foo USING INDEX view_idx_ab (<expr>=? AND <expr>=?)
┌────────┬────────┬────────┐
│ part_a │ part_b │ part_c │
├────────┼────────┼────────┤
│ 009    │ 010    │ 000    │
│ 009    │ 010    │ 001    │
│ 009    │ 010    │ 002    │
│ 009    │ 010    │ 003    │
│ 009    │ 010    │ 004    │
└────────┴────────┴────────┘
Run Time: real 0.000 user 0.000218 sys 0.000028