indexpost archiveatom feed syndication feed icon

Text Search on a Static Blog

2017-05-14

Static sites are neat, but what if I wanted features of a dynamic website? Searching, for example, could prove interesting or useful but is typically done server-side. Here I detail a fun experiment in implementing search client-side for this very blog.

Keyword extraction

It's hard to beat the shell for quick and dirty text processing, this is the pipeline I first prototyped for exploring a kind of "keyword" extraction algorithm. It doesn't (yet) handle anything like scoring or weighting. In fact, calling it "keyword extraction" is probably too generous. I latched onto the word when I envisioned it as a means to replace tagging blog posts (which it should work for).

tr -c '[:alpha:]' ' ' <post.txt \
| tr ' ' '\n' \
| tr A-Z a-z  \
| sed '/^[a-z]$/d' \
| sort | uniq \
| comm -23 - stop-words.txt \
| sed '/^$/d'

The list of stop words came from this post. In the interest of getting started, I haven't bothered checking how comprehensive this list is and may re-visit it later. For now the complete list is as follows:

stopwords          
a but full moreover seeming together
about by further most seems too
above call get mostly serious top
across can give move several toward
after cannot go much she towards
afterwards cant had must should twelve
again co has my show twenty
against computer hasnt myself side two
all con have name since un
almost could he namely sincere under
alone couldnt hence neither six until
along cry her never sixty up
already de here nevertheless so upon
also describe hereafter next some us
although detail hereby nine somehow very
always do herein no someone via
am done hereupon nobody something was
among down hers none sometime we
amongst due herself noone sometimes well
amoungst during him nor somewhere were
amount each himself not still what
an eg his nothing such whatever
and eight how now system when
another either however nowhere take whence
any eleven hundred of ten whenever
anyhow else i off than where
anyone elsewhere ie often that whereafter
anything empty if on the whereas
anyway enough in once their whereby
anywhere etc inc one them wherein
are even indeed only themselves whereupon
around ever interest onto then wherever
as every into or thence whether
at everyone is other there which
back everything it others thereafter while
be everywhere its otherwise thereby whither
became except itself our therefore who
because few keep ours therein whoever
become fifteen last ourselves thereupon whole
becomes fify latter out these whom
becoming fill latterly over they whose
been find least own thick why
before fire less part thin will
beforehand first ltd per third with
behind five made perhaps this within
being for many please those without
below former may put though would
beside formerly me rather three yet
besides forty meanwhile re through you
between found might same throughout your
beyond four mill see thru yours
bill from mine seem thus yourself
both front more seemed to yourselves
bottom          

As a quick check on the size or scale of search terms, I ran the above script over every post on this blog and was left with approximately 4800 unique "terms". While a little high, this is within the realm of possibility for pushing down to the client. Interested in pursuing this further, I re-implemented the above shell pipeline in Python approximately as follows:

with open('post.md') as f:
    post = f.read()

stopwords = set(stopwords_list)

def interesting_words(text):
    only_alpha = re.sub(r'[^A-Za-z]', ' ', text)
    single_space = re.sub(r'\s+', ' ', only_alpha)
    words = single_space.split()
    unique_lowercase_words = {word.lower() for word in words}
    return unique_lowercase_words

unique_words = interesting_words(post) - stopwords

I admittedly don't know much about this kind of search algorithm, so my first attempt may be a little lame. Fundamentally, "search" will be a two-step look-up attempting to match a single term to a list of post IDs and then from those IDs to the posts themselves. There's certainly room for improvement in handling "phrases" but I'd like to see how a base case works first.

# pseudo-code the requires linking into the static-site generation
for postID, post in enumerate(all_posts)
    for term in post[unique_words]:
        term_map[term].append(postID)

Which results in something like:

term IDs
accumulator 3
actually 1,3
add 1,2,3
addr
ID Link
1 2017/little-languages.html
2
3

The reason for the indirection is a result of another quick experiment. I inserted the paths directly into the lookup table, rather than by ID lookup and found the second lookup used 80% less space, taking the totals from more than 500KB to just over 100KB. This kind of duplication might be spared over-the-wire with gzip, but parsing half a megabyte of JSON client-side is too ugly, even for me. You can check the resulting JSON here.

Client Side

You can view the source of the search "client" here. But the gist is:

GET/parse search.json

>> prompt for input
results = []

for word in search
  if word in search_corpus:
    results.append(search_corpus[word])

Try It Out

Potential Improvements

Obviously the single-term search limitation is the most glaring flaw. I haven't yet imagined a clean solution to take the intersection of 1+ term look-ups, but fundamentally it doesn't seem very complicated. Implementing "whole-phrase" lookups seems much harder with the decomposition I've done on the source text.

I think implementing stemming1 could improve the size of the search corpus because right now there are a number of similar words that don't add anything. Similarly, I think in order to implement stemming I would need to improve the term-lookups to match on keys incompletely (this is fairly easy).

I've commited the relevant code to generate the "search corpus" on a branch of my static site generator, mostly as a reference. I make no claims to it being good, only that it works in so far as to get the above demo working.


  1. Wikipedia: Stemming, Stanford, NLP