SQLite full-text search (and TCL)

2023-01-07

Another neat trick or two built into SQLite. I previously worked up a few different ways of building a reverse index in the service of searching a corpus like this site. I later worked out a better way of doing full-text search using a stemming algorithm. SQLite can do all of that out of the box, which aligns nicely with another experiment I did rewriting my site generator in TCL.

Site Generation with SQLite

The generator is mostly unchanged from the last time I poked around with it. TCL globs files and does some rough parsing to stuff records into SQLite from which the latest posts are pulled out for generated pages (index, archive, atom feed).

Full Text Search

Previously there were two tables and only one of them was very interesting, containing the data parsed out of the posts (and containing the entire post body as well). The relevant addition to enable full-text search requires adding a single virtual table using FTS. One thing I found especially neat is that the virtual table can take another table as the source of search content, so the only additional data ends up being the search indexes. For my case I wanted to make both the post title and body searchable (although technically the title is included in the "body" for my case I wanted to try a multi-column search corpus).

    create table if not exists post(path text primary key,
				    path_slug,
                                    title,
                                    date datetime,
                                    leader,
				    body);

    create table if not exists prior(id integer primary key,
                                     dt datetime default 0);

    create virtual table if not exists post_fts using FTS5(path, title, body, content=post);

Just like that, I now get a MATCH keyword that accepts a mini query language. The syntax is pretty intuitive to me, but then I don't mind regular expressions or APL, so who knows?

Where in my previous attempts with text search I came to a solution that would at best support either OR clauses or AND clauses implicitly SQLite's query languages gives you both and more. So any of the following do the right thing:

python OR nginx
python AND nginx
python nginx
NEAR(data model)

Automatic Maintenance

It would be a pain for the search index to get out of sync with the content of the database so do like the documentation suggests and slap a few triggers onto the primary table so the index is refreshed when it changes:

    create trigger if not exists post_ai after insert on post begin
      insert into post_fts(rowid, path, title, body) values (new.rowid, new.path, new.title, new.body);
    end;

    create trigger if not exists post_ad after delete on post begin
      insert into post_fts(post_fts, rowid, path, title, body) values('delete', old.rowid, old.path, old.title, old.body);
    end;

    create trigger if not exists post_au after update on post begin
      insert into post_fts(post_fts, rowid, path, title, body) values('delete', old.rowid, old.path, old.title, old.body);
      insert into post_fts(rowid, path, title, body) values (new.rowid, new.path, new.title, new.body);
    end;

With the virtual table and triggers added to the schema (in 11 lines of code) everything continues to work as it did previously but with support for full-text search added. How neat is that? The data overhead to support all of this is about +60% of the SQLite database without full-text search, for my case without the new schema I was at 1352 KiB and after I am at 2196 KiB. I am fine with that amount of additional disk usage to support ~130 posts at the time of this post!

CGI with TCL

Of course adding the ability to search is separate from actually exposing that new capability. My whole site is static and until now the generation process has been focused on this. I am not really interested in moving things to a dynamic model. Instead what I am interested in is tacking on a modest bit of dynamic behavior to my static site. For this I reached for CGI:

Why CGI? With a capable server the "search page" can be as little as 6 lines of HTML:

<html>
  <form action="/cgi-bin/search" method="POST" enctype="multipart/form-data">
    <input type="text" name="terms" placeholder="your search terms...">
    <input type="submit">
  </form>
</html>

All it has to do is shovel an input field to the appropriate script which ends up being another twenty-odd lines of code:

#!/usr/bin/env tclsh
package require sqlite3
package require ncgi

sqlite3 db ../posts.db -create false -readonly true
::ncgi::parse
set terms [::ncgi::value terms]
puts {Content-Type: text/html

<head><meta charset="utf-8"></head>
<h2>Search Results for:</h2>
<ul>
}
if { [ catch {
    db eval {
       select post.title, path_slug, date
	 from post_fts
	 join post using(path)
	where post_fts.body match :terms
	order by date desc
    } {
        puts "<li><a href=\"/$path_slug\">$title</a></li>"
    }
} ] } { puts {<h1>you screwed up</h1>} }
puts {</ul>}
db close

It is wild how easy that is compared to the other available options. Additionally, for this site the whole thing is static so in the event the database is trashed or the script becomes unmanagable the rest of it continues to work.

In a small concession to securing my database I've opted into making the connection read-only and declining to create the database if it does not exist. As seen in the creation process, this isn't too important because the database is only used for search. I think making the connection read-only can aid in securing the CGI process management though depending on how it is run. I'm thinking specifically of how the CGI process (and probably the web server as a consequence) can be a permission-less dynamic user with read-only access to the host system with appropriate resource limits and execution limited to a finite set of CGI programs.

A Cool Side-Effect

While I've been sitting on this post for a few days I was surprised to see some recent discussion online about CGI. People started rehashing the same old arguments, amongst which is the overhead of process-per-request and all that entails. I don't have any concerns for my own use here but I did become curious about how I might measure memory use, for example if I wanted to compare my interpreted program with a compiled version in situ.

I found this delightfully easy to do because CGI is just executing a program - why not create an intermediary to measure memory usage? For instance, /usr/bin/time. Instead of calling /cgi-bin/search I might call /cgi-bin/profile-search where profile search is:

#!/usr/bin/sh

/usr/bin/time --verbose cgi-bin/search

The result is a limited but useful summary of the CGI script execution written to stderr while the result is unchanged:

127.0.0.1 - - [07/Jan/2023 16:21:16] "POST /cgi-bin/search HTTP/1.1" 200 -
	Command being timed: "cgi-bin/real-search"
	User time (seconds): 0.02
	System time (seconds): 0.00
	Percent of CPU this job got: 93%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 8256
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 933
	Voluntary context switches: 1
	Involuntary context switches: 1
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

I can imagine a similar case for a more robust profiling tool but this is a usable result with approximately 2 minutes of work.

What's the Holdup?

All of the above works like a charm. I have not however included the CGI search page into this site yet (I'm thinking how it could be a nice addition to the archive page eventually). The last real impediment is Nginx, which does not support CGI. It has support for FastCGI but that looks like another kettle of fish entirely and not one I am too keen to dig into. For a while now I've been chafing at a few minor things with Nginx, especially after learning more about HAProxy. I've had the idea that I might switch from Nginx as a reverse proxy and web server to HAProxy for proxying and a dedicated web server for serving HTML, it hasn't become a priority so I haven't done it though. Now there is an argument for serving CGI as well.

My Nginx configuration isn't too complicated but for two things which I haven't spent the time looking into:

certbot integration, my Let's Encrypt automation handles Nginx updates and I would like to keep this behavior as much as possible
RTMP support, I use this feature a small handful of times a year and haven't looked for alternatives. Until I do decide there is a small friction to turning it off by switching away from Nginx

No one is (yet!) clamoring for search on my site so I can put off deciding for a while longer. I have enjoyed being able to search while writing this post though and found it immediately useful. When writing posts I tend to use a local python webserver to ensure my internal links work:

python3 -m http.server --dir posts/

To test out my CGI scripts as well it turns out I can opt-into the behavior with the same:

python3 -m http.server --dir posts/ --cgi