[nolan@nprescott.com] $  cat weblog archive feed

Full-Text Search

2021-03-29

Yesterday I implemented an English language stemming algorithm after reading about full-text search. Might as well round-out the project and implement the search as well.

The mechanism for decomposing a query string into search terms is pretty basic:

  1. normalize letter case
  2. split string into words
  3. remove stop words
  4. stem words
  5. deduplicate terms
I hadn't really considered how simple that would be but I'm happy I did the stemming first! In total this took about an hour, most of which was spent waffling about string splitting(!).

String Splitting

This one was funny if only because of how unimportant it is. Plenty of languages have a means to do this built-in but Common Lisp doesn't, there is a community implementation which is probably worth using but there were a confusing number of options based on a cursory look. Instead I went with the more implementation dependent sb-unicode:words, which pretty much does the same as something like Python's String.split but is part of SBCL only.

(defun split (str)
  (let ((ignored (list #\? #\! #\. #\, #\; #\" #\Space #\Newline #\Tab #\Return)))
    (loop for word in (sb-unicode:words str)
          unless (member word ignored :test #'string=)
            collect word)))

Stop Words

I had actually already written this for the stemmer before removing it, I adapted it here to be used as a predicate:

(defun stop-word? (word)
  (let ((stop-words (list "a" "about" "above" "after" "again" "against" "all" "am" "an" "and" "any"
                          "are" "as" "at" "be" "because" "been" "before" "being" "below" "between"
                          "both" "but" "by" "can" "did" "do" "does" "doing" "don" "down" "during" "each"
                          "few" "for" "from" "further" "had" "has" "have" "having" "he" "her" "here"
                          "hers" "herself" "him" "himself" "his" "how" "i" "if" "in" "into" "is" "it"
                          "its" "itself" "just" "me" "more" "most" "my" "myself" "no" "nor" "not" "now"
                          "of" "off" "on" "once" "only" "or" "other" "our" "ours" "ourselves" "out"
                          "over" "own" "s" "same" "she" "should" "so" "some" "such" "t" "than" "that"
                          "the" "their" "theirs" "them" "themselves" "then" "there" "these" "they"
                          "this" "those" "through" "to" "too" "under" "until" "up" "very" "was" "we"
                          "were" "what" "when" "where" "which" "while" "who" "whom" "why" "will" "with"
                          "you" "your" "yours" "yourself" "yourselves")))
    (member word stop-words :test #'string=)))

Stemming

My stemmer operates on a single word, so I need a function to operate on a list of words:

(defun stem-tokens (words)
  (loop for word in words collect (stem word)))

All Together

(defun analyze (str)
  (let* ((lowercased (string-downcase str))
         (individual-words (split lowercased))
         (interesting-words (remove-if #'stop-word? individual-words))
         (stemmed (stem-tokens interesting-words))
         (analyzed (remove-duplicates stemmed :test #'string=)))
    analyzed))

The usage is pretty uninteresting:

CL-USER> (analyze "Capitol of Texas")
("capitol" "texa")

CL-USER> (analyze "wage labour and capital")
("wage" "labour" "capit")

CL-USER> (analyze "bureau of labor statistics")
("bureau" "labor" "statist")

CL-USER> (analyze "plate of donuts")
("plate" "donut")

Index and Document Store

While the index might permanently reside in memory I think it is obvious the data store would instead be on-disk. I also think the specifics should be determined by the use. In this spirit I didn't implement a real data store, or pursue the example of parsing abstracts from Wikipedia. As such, you'll have to survive my obviously toy example. I think it still proves interesting enough that I'm writing it up here.

The "documents" here are artificial so I've created a very minimal structure:

(defstruct (document) id text)

In the case of my prior work I think the text would be post bodies or in the case of the reference project it'd be Wikipedia abstracts. Here I'll just create a tiny data store (a list) of documents with some text I've written:

(defparameter *DATASTORE*
  (list (make-document :id 1 :text "The volume of the great lakes is more than a glass of water.")
        (make-document :id 2 :text "A donut on a glass plate. Only the donuts.")
        (make-document :id 3 :text "A lake shaped like a donut")
        (make-document :id 4 :text "Catfish rule the lakes and rivers with an iron fin")
        (make-document :id 5 :text "Doing donuts in an empty parking lot")
        (make-document :id 6 :text "Mountains of pancakes. Rivers of syrup. Land O Lakes butter.")))

As mentioned, the index could conceivably remain in-memory, so I kept the hash table approach, where the keys are strings and the values will be lists of integer IDs:

(defparameter *INDEX* (make-hash-table :test #'equal))

Populating the index is as simple as iterating a list of document (or my data store) and adding the ID per document to their unique terms (the result of the analyze function):

(defun index (documents)
  (loop for doc in documents do
    (loop for token in (analyze (document-text doc)) do
      (pushnew (document-id doc) (gethash token *INDEX* '())))))

The pushnew function adds a value (the document ID) to a "place" if it isn't already present, achieving a kind of "set add" functionality. The "place" here is the value-side of a hash table, with a default of an empty list. This is conceptually similar to a defaultdict in Python.

Querying

Querying the index means analyzing the query string and getting the corresponding document IDs from the index data structure (hash table).

(defun query (str)
  (let ((indices (loop for token in (analyze str)
                       when (gethash token *INDEX*)
                         collect it)))
    (reduce #'intersection (or indices (list NIL)))))

I have to admit how pleased I was with this simple little function; it demonstrates a number of neat features of Common Lisp. First is the ability to test a value within the loop-macro to conditionally collect it. I only just discovered this and it obviated the need for me to roll this myself with local variables through let-bindings.

Due to the structure of the hash table, the result value is a list of lists like: ((2 3) (3) (3)). In order to return the intersection of documents for a given query string I wanted to use intersection, but ran afoul of the case where a single list was returned: ((2 3)). In what felt like a nice and functional fix I use reduce to handle the variable number of arguments with some special-case handling if there are no results (the or function will produce a list of a single empty list if there were no matching indexes).

Querying the data store I've included above returns a single flat list of document IDs with matched text.

CL-USER> (query "plate of donuts")
(2)

CL-USER> (query "donuts")
(5 3 2)

CL-USER> (query "glass")
(2 1)

I think the desired result depends on the data store and use-case, I can imagine a scenario where the IDs returned are primary keys to a database to which a SQL query might be made. Or maybe the IDs map to hyperlinks like in my first foray. Here I wrapped the query function to grab the text from the data store to see how things were working in practice.

(defun fts (query-string)
  (loop for doc-id in (query query-string)
        do (format t "ID: ~s TEXT: ~a~%"
                   doc-id
                   (document-text (nth (- doc-id 1) *DATASTORE*)))))

What might be obvious is that I didn't really consider ID look-ups in my data store, so I have to use ID-1 because I wrote 1-based indexes. No matter. Here's the result:

CL-USER> (fts "great lakes")
ID: 1 TEXT: The volume of the great lakes is more than a glass of water.

CL-USER> (fts "lakes and rivers")
ID: 4 TEXT: Catfish rule the lakes and rivers with an iron fin
ID: 6 TEXT: Mountains of pancakes. Rivers of syrup. Land O Lakes butter.

CL-USER> (fts "donut")
ID: 5 TEXT: Doing donuts in an empty parking lot
ID: 3 TEXT: A lake shaped like a donut
ID: 2 TEXT: A donut on a glass plate. Only the donuts.

CL-USER> (fts "plate of donuts")
ID: 2 TEXT: A donut on a glass plate. Only the donuts.

Thoughts

This was a nice finishing touch to the stemming algorithm project. Although it's a bit spread out above, the whole full-text search is about a screenful of code. Paired with the stemmer, the result is zero-dependency full-text search in about 500 lines of code. Further enhancements might be taken from this similar looking full-text search example which touches on frequency and relevancy.