Text Search on a Static Blog

2017-05-14

Static sites are neat, but what if I wanted features of a dynamic website? Searching, for example, could prove interesting or useful but is typically done server-side. Here I detail a fun experiment in implementing search client-side for this very blog.

Keyword extraction

It's hard to beat the shell for quick and dirty text processing, this is the pipeline I first prototyped for exploring a kind of "keyword" extraction algorithm. It doesn't (yet) handle anything like scoring or weighting. In fact, calling it "keyword extraction" is probably too generous. I latched onto the word when I envisioned it as a means to replace tagging blog posts (which it should work for).

tr -c '[:alpha:]' ' ' <post.txt \
| tr ' ' '\n' \
| tr A-Z a-z  \
| sed '/^[a-z]$/d' \
| sort | uniq \
| comm -23 - stop-words.txt \
| sed '/^$/d'

The list of stop words came from this post. In the interest of getting started, I haven't bothered checking how comprehensive this list is and may re-visit it later. For now the complete list is as follows:

stopwords
a	but	full	moreover	seeming	together
about	by	further	most	seems	too
above	call	get	mostly	serious	top
across	can	give	move	several	toward
after	cannot	go	much	she	towards
afterwards	cant	had	must	should	twelve
again	co	has	my	show	twenty
against	computer	hasnt	myself	side	two
all	con	have	name	since	un
almost	could	he	namely	sincere	under
alone	couldnt	hence	neither	six	until
along	cry	her	never	sixty	up
already	de	here	nevertheless	so	upon
also	describe	hereafter	next	some	us
although	detail	hereby	nine	somehow	very
always	do	herein	no	someone	via
am	done	hereupon	nobody	something	was
among	down	hers	none	sometime	we
amongst	due	herself	noone	sometimes	well
amoungst	during	him	nor	somewhere	were
amount	each	himself	not	still	what
an	eg	his	nothing	such	whatever
and	eight	how	now	system	when
another	either	however	nowhere	take	whence
any	eleven	hundred	of	ten	whenever
anyhow	else	i	off	than	where
anyone	elsewhere	ie	often	that	whereafter
anything	empty	if	on	the	whereas
anyway	enough	in	once	their	whereby
anywhere	etc	inc	one	them	wherein
are	even	indeed	only	themselves	whereupon
around	ever	interest	onto	then	wherever
as	every	into	or	thence	whether
at	everyone	is	other	there	which
back	everything	it	others	thereafter	while
be	everywhere	its	otherwise	thereby	whither
became	except	itself	our	therefore	who
because	few	keep	ours	therein	whoever
become	fifteen	last	ourselves	thereupon	whole
becomes	fify	latter	out	these	whom
becoming	fill	latterly	over	they	whose
been	find	least	own	thick	why
before	fire	less	part	thin	will
beforehand	first	ltd	per	third	with
behind	five	made	perhaps	this	within
being	for	many	please	those	without
below	former	may	put	though	would
beside	formerly	me	rather	three	yet
besides	forty	meanwhile	re	through	you
between	found	might	same	throughout	your
beyond	four	mill	see	thru	yours
bill	from	mine	seem	thus	yourself
both	front	more	seemed	to	yourselves
bottom

As a quick check on the size or scale of search terms, I ran the above script over every post on this blog and was left with approximately 4800 unique "terms". While a little high, this is within the realm of possibility for pushing down to the client. Interested in pursuing this further, I re-implemented the above shell pipeline in Python approximately as follows:

with open('post.md') as f:
    post = f.read()

stopwords = set(stopwords_list)

def interesting_words(text):
    only_alpha = re.sub(r'[^A-Za-z]', ' ', text)
    single_space = re.sub(r'\s+', ' ', only_alpha)
    words = single_space.split()
    unique_lowercase_words = {word.lower() for word in words}
    return unique_lowercase_words

unique_words = interesting_words(post) - stopwords

I admittedly don't know much about this kind of search algorithm, so my first attempt may be a little lame. Fundamentally, "search" will be a two-step look-up attempting to match a single term to a list of post IDs and then from those IDs to the posts themselves. There's certainly room for improvement in handling "phrases" but I'd like to see how a base case works first.

# pseudo-code the requires linking into the static-site generation
for postID, post in enumerate(all_posts)
    for term in post[unique_words]:
        term_map[term].append(postID)

Which results in something like:

term	IDs
accumulator	3
actually	1,3
add	1,2,3
addr	…

ID	Link
1	2017/little-languages.html
2	…
3	…

The reason for the indirection is a result of another quick experiment. I inserted the paths directly into the lookup table, rather than by ID lookup and found the second lookup used 80% less space, taking the totals from more than 500KB to just over 100KB. This kind of duplication might be spared over-the-wire with gzip, but parsing half a megabyte of JSON client-side is too ugly, even for me. You can check the resulting JSON here.

Client Side

You can view the source of the search "client" here. But the gist is:

GET/parse search.json

>> prompt for input
results = []

for word in search
  if word in search_corpus:
    results.append(search_corpus[word])

Try It Out

Potential Improvements

Obviously the single-term search limitation is the most glaring flaw. I haven't yet imagined a clean solution to take the intersection of 1+ term look-ups, but fundamentally it doesn't seem very complicated. Implementing "whole-phrase" lookups seems much harder with the decomposition I've done on the source text.

I think implementing stemming¹ could improve the size of the search corpus because right now there are a number of similar words that don't add anything. Similarly, I think in order to implement stemming I would need to improve the term-lookups to match on keys incompletely (this is fairly easy).

I've commited the relevant code to generate the "search corpus" on a branch of my static site generator, mostly as a reference. I make no claims to it being good, only that it works in so far as to get the above demo working.

Wikipedia: Stemming, Stanford, NLP