Static sites are neat, but what if I wanted features of a dynamic website? Searching, for example, could prove interesting or useful but is typically done server-side. Here I detail a fun experiment in implementing search client-side for this very blog.
It's hard to beat the shell for quick and dirty text processing, this is the pipeline I first prototyped for exploring a kind of "keyword" extraction algorithm. It doesn't (yet) handle anything like scoring or weighting. In fact, calling it "keyword extraction" is probably too generous. I latched onto the word when I envisioned it as a means to replace tagging blog posts (which it should work for).
tr -c '[:alpha:]' ' ' <post.txt \
| tr ' ' '\n' \
| tr A-Z a-z  \
| sed '/^[a-z]$/d' \
| sort | uniq \
| comm -23 - stop-words.txt \
| sed '/^$/d'
The list of stop words came from this post. In the interest of getting started, I haven't bothered checking how comprehensive this list is and may re-visit it later. For now the complete list is as follows:
| stopwords | |||||
|---|---|---|---|---|---|
| a | but | full | moreover | seeming | together | 
| about | by | further | most | seems | too | 
| above | call | get | mostly | serious | top | 
| across | can | give | move | several | toward | 
| after | cannot | go | much | she | towards | 
| afterwards | cant | had | must | should | twelve | 
| again | co | has | my | show | twenty | 
| against | computer | hasnt | myself | side | two | 
| all | con | have | name | since | un | 
| almost | could | he | namely | sincere | under | 
| alone | couldnt | hence | neither | six | until | 
| along | cry | her | never | sixty | up | 
| already | de | here | nevertheless | so | upon | 
| also | describe | hereafter | next | some | us | 
| although | detail | hereby | nine | somehow | very | 
| always | do | herein | no | someone | via | 
| am | done | hereupon | nobody | something | was | 
| among | down | hers | none | sometime | we | 
| amongst | due | herself | noone | sometimes | well | 
| amoungst | during | him | nor | somewhere | were | 
| amount | each | himself | not | still | what | 
| an | eg | his | nothing | such | whatever | 
| and | eight | how | now | system | when | 
| another | either | however | nowhere | take | whence | 
| any | eleven | hundred | of | ten | whenever | 
| anyhow | else | i | off | than | where | 
| anyone | elsewhere | ie | often | that | whereafter | 
| anything | empty | if | on | the | whereas | 
| anyway | enough | in | once | their | whereby | 
| anywhere | etc | inc | one | them | wherein | 
| are | even | indeed | only | themselves | whereupon | 
| around | ever | interest | onto | then | wherever | 
| as | every | into | or | thence | whether | 
| at | everyone | is | other | there | which | 
| back | everything | it | others | thereafter | while | 
| be | everywhere | its | otherwise | thereby | whither | 
| became | except | itself | our | therefore | who | 
| because | few | keep | ours | therein | whoever | 
| become | fifteen | last | ourselves | thereupon | whole | 
| becomes | fify | latter | out | these | whom | 
| becoming | fill | latterly | over | they | whose | 
| been | find | least | own | thick | why | 
| before | fire | less | part | thin | will | 
| beforehand | first | ltd | per | third | with | 
| behind | five | made | perhaps | this | within | 
| being | for | many | please | those | without | 
| below | former | may | put | though | would | 
| beside | formerly | me | rather | three | yet | 
| besides | forty | meanwhile | re | through | you | 
| between | found | might | same | throughout | your | 
| beyond | four | mill | see | thru | yours | 
| bill | from | mine | seem | thus | yourself | 
| both | front | more | seemed | to | yourselves | 
| bottom | 
As a quick check on the size or scale of search terms, I ran the above script over every post on this blog and was left with approximately 4800 unique "terms". While a little high, this is within the realm of possibility for pushing down to the client. Interested in pursuing this further, I re-implemented the above shell pipeline in Python approximately as follows:
with open('post.md') as f:
    post = f.read()
stopwords = set(stopwords_list)
def interesting_words(text):
    only_alpha = re.sub(r'[^A-Za-z]', ' ', text)
    single_space = re.sub(r'\s+', ' ', only_alpha)
    words = single_space.split()
    unique_lowercase_words = {word.lower() for word in words}
    return unique_lowercase_words
unique_words = interesting_words(post) - stopwords
I admittedly don't know much about this kind of search algorithm, so my first attempt may be a little lame. Fundamentally, "search" will be a two-step look-up attempting to match a single term to a list of post IDs and then from those IDs to the posts themselves. There's certainly room for improvement in handling "phrases" but I'd like to see how a base case works first.
# pseudo-code the requires linking into the static-site generation
for postID, post in enumerate(all_posts)
    for term in post[unique_words]:
        term_map[term].append(postID)
Which results in something like:
| term | IDs | 
|---|---|
| accumulator | 3 | 
| actually | 1,3 | 
| add | 1,2,3 | 
| addr | … | 
| ID | Link | 
|---|---|
| 1 | 2017/little-languages.html | 
| 2 | … | 
| 3 | … | 
The reason for the indirection is a result of another quick experiment. I inserted the paths directly into the lookup table, rather than by ID lookup and found the second lookup used 80% less space, taking the totals from more than 500KB to just over 100KB. This kind of duplication might be spared over-the-wire with gzip, but parsing half a megabyte of JSON client-side is too ugly, even for me. You can check the resulting JSON here.
You can view the source of the search "client" here. But the gist is:
GET/parse search.json
>> prompt for input
results = []
for word in search
  if word in search_corpus:
    results.append(search_corpus[word])
Obviously the single-term search limitation is the most glaring flaw. I haven't yet imagined a clean solution to take the intersection of 1+ term look-ups, but fundamentally it doesn't seem very complicated. Implementing "whole-phrase" lookups seems much harder with the decomposition I've done on the source text.
I think implementing stemming1 could improve the size of the search corpus because right now there are a number of similar words that don't add anything. Similarly, I think in order to implement stemming I would need to improve the term-lookups to match on keys incompletely (this is fairly easy).
I've commited the relevant code to generate the "search corpus" on a branch of my static site generator, mostly as a reference. I make no claims to it being good, only that it works in so far as to get the above demo working.