Static sites are neat, but what if I wanted features of a dynamic website? Searching, for example, could prove interesting or useful but is typically done server-side. Here I detail a fun experiment in implementing search client-side for this very blog.
It's hard to beat the shell for quick and dirty text processing, this is the pipeline I first prototyped for exploring a kind of "keyword" extraction algorithm. It doesn't (yet) handle anything like scoring or weighting. In fact, calling it "keyword extraction" is probably too generous. I latched onto the word when I envisioned it as a means to replace tagging blog posts (which it should work for).
tr -c '[:alpha:]' ' ' <post.txt \
| tr ' ' '\n' \
| tr A-Z a-z \
| sed '/^[a-z]$/d' \
| sort | uniq \
| comm -23 - stop-words.txt \
| sed '/^$/d'
The list of stop words came from this post. In the interest of getting started, I haven't bothered checking how comprehensive this list is and may re-visit it later. For now the complete list is as follows:
stopwords | |||||
---|---|---|---|---|---|
a | but | full | moreover | seeming | together |
about | by | further | most | seems | too |
above | call | get | mostly | serious | top |
across | can | give | move | several | toward |
after | cannot | go | much | she | towards |
afterwards | cant | had | must | should | twelve |
again | co | has | my | show | twenty |
against | computer | hasnt | myself | side | two |
all | con | have | name | since | un |
almost | could | he | namely | sincere | under |
alone | couldnt | hence | neither | six | until |
along | cry | her | never | sixty | up |
already | de | here | nevertheless | so | upon |
also | describe | hereafter | next | some | us |
although | detail | hereby | nine | somehow | very |
always | do | herein | no | someone | via |
am | done | hereupon | nobody | something | was |
among | down | hers | none | sometime | we |
amongst | due | herself | noone | sometimes | well |
amoungst | during | him | nor | somewhere | were |
amount | each | himself | not | still | what |
an | eg | his | nothing | such | whatever |
and | eight | how | now | system | when |
another | either | however | nowhere | take | whence |
any | eleven | hundred | of | ten | whenever |
anyhow | else | i | off | than | where |
anyone | elsewhere | ie | often | that | whereafter |
anything | empty | if | on | the | whereas |
anyway | enough | in | once | their | whereby |
anywhere | etc | inc | one | them | wherein |
are | even | indeed | only | themselves | whereupon |
around | ever | interest | onto | then | wherever |
as | every | into | or | thence | whether |
at | everyone | is | other | there | which |
back | everything | it | others | thereafter | while |
be | everywhere | its | otherwise | thereby | whither |
became | except | itself | our | therefore | who |
because | few | keep | ours | therein | whoever |
become | fifteen | last | ourselves | thereupon | whole |
becomes | fify | latter | out | these | whom |
becoming | fill | latterly | over | they | whose |
been | find | least | own | thick | why |
before | fire | less | part | thin | will |
beforehand | first | ltd | per | third | with |
behind | five | made | perhaps | this | within |
being | for | many | please | those | without |
below | former | may | put | though | would |
beside | formerly | me | rather | three | yet |
besides | forty | meanwhile | re | through | you |
between | found | might | same | throughout | your |
beyond | four | mill | see | thru | yours |
bill | from | mine | seem | thus | yourself |
both | front | more | seemed | to | yourselves |
bottom |
As a quick check on the size or scale of search terms, I ran the above script over every post on this blog and was left with approximately 4800 unique "terms". While a little high, this is within the realm of possibility for pushing down to the client. Interested in pursuing this further, I re-implemented the above shell pipeline in Python approximately as follows:
with open('post.md') as f:
post = f.read()
stopwords = set(stopwords_list)
def interesting_words(text):
only_alpha = re.sub(r'[^A-Za-z]', ' ', text)
single_space = re.sub(r'\s+', ' ', only_alpha)
words = single_space.split()
unique_lowercase_words = {word.lower() for word in words}
return unique_lowercase_words
unique_words = interesting_words(post) - stopwords
I admittedly don't know much about this kind of search algorithm, so my first attempt may be a little lame. Fundamentally, "search" will be a two-step look-up attempting to match a single term to a list of post IDs and then from those IDs to the posts themselves. There's certainly room for improvement in handling "phrases" but I'd like to see how a base case works first.
# pseudo-code the requires linking into the static-site generation
for postID, post in enumerate(all_posts)
for term in post[unique_words]:
term_map[term].append(postID)
Which results in something like:
term | IDs |
---|---|
accumulator | 3 |
actually | 1,3 |
add | 1,2,3 |
addr | … |
ID | Link |
---|---|
1 | 2017/little-languages.html |
2 | … |
3 | … |
The reason for the indirection is a result of another quick experiment. I inserted the paths directly into the lookup table, rather than by ID lookup and found the second lookup used 80% less space, taking the totals from more than 500KB to just over 100KB. This kind of duplication might be spared over-the-wire with gzip, but parsing half a megabyte of JSON client-side is too ugly, even for me. You can check the resulting JSON here.
You can view the source of the search "client" here. But the gist is:
GET/parse search.json
>> prompt for input
results = []
for word in search
if word in search_corpus:
results.append(search_corpus[word])
Obviously the single-term search limitation is the most glaring flaw. I haven't yet imagined a clean solution to take the intersection of 1+ term look-ups, but fundamentally it doesn't seem very complicated. Implementing "whole-phrase" lookups seems much harder with the decomposition I've done on the source text.
I think implementing stemming1 could improve the size of the search corpus because right now there are a number of similar words that don't add anything. Similarly, I think in order to implement stemming I would need to improve the term-lookups to match on keys incompletely (this is fairly easy).
I've commited the relevant code to generate the "search corpus" on a branch of my static site generator, mostly as a reference. I make no claims to it being good, only that it works in so far as to get the above demo working.