indexpost archiveatom feed syndication feed icon

Language Stemming

2021-03-28

I read Let's build a Full-Text Search engine with interest and ended up on a tangent to implement an English language stemmer.

I think what interested me especially was seeing how close my previous efforts led me, albeit without all the necessary background. While I created a similar bit of indirection I didn't actually know it was called an inverted index. It was satisfying to find the comparison to an index in a book since that is exactly what I had in mind at the time.

I previously said:

I think implementing stemming could improve the size of the search corpus because right now there are a number of similar words that don't add anything. Similarly, I think in order to implement stemming I would need to improve the term-lookups to match on keys incompletely (this is fairly easy).

It hadn't occurred to me to build the index off of the stemmed words instead. That is a nice trick that would solve the problem more cleanly than anything I had in mind. Also interesting is the idea of returning the intersection of results for multiple search terms instead of any kind of "compound term".

Stemming, Briefly

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.

Not unlike my own previous attempts the original post says the following:

Implementing a stemmer is a non-trivial task, it's not covered in this post. We'll take one of the existing modules

While I don't really know the Go programming language I read the linked stemming software with interest. I found that when paired with the Snowball documentation it was almost understandable!

Checking My Understanding

It is one thing to think I understand something and entirely different to know so. In pursuit of this I thought it was time to circle back through implementing things myself. Who is to say what is or isn't trivial, right?

Simultaneously, I've been thinking about reducing software dependencies and how difficult (or easy!) that might be in practice. While I've thought about it before I don't think I have enough experience yet to understand the broader implications of vendoring-in or porting software. Here was a perfect opportunity to gain some experience.

The Result

As might be expected, I did manage to port a subset of the linked Snowball Go module to Common Lisp. I limited my efforts to just the English stemmer. I first had a nearly direct port before realizing there were a number of extraneous parts when considering only English — consequently I stripped those out.

The process was generally painless and took a bit more than an afternoon. Most interesting in retrospect has been the effect of the source-style on the resulting port. I'm not unhappy with result but the code does not read as very idiomatic to me. I read the Go-style and picked a basic CLOS-oriented approach, never once considering a more aggressive refactor out of the gate. I think the end result of this though is a bit of a blindered view of how things might have been.

I was happy enough with how brief the translation was but noticed afterward that there is an older port of the Porter (1) algorithm to Common Lisp which is similar enough to be compared and found it surprisingly short. The author clearly read the C source code and consequently the string handling has a more "pointer-y" feel to it.

I think, if I were to pursue this further, it would be interesting to begin refactoring incrementally now that there is a working implementation. Unfortunately, I don't know that I have sufficient experience for a tasteful Common Lisp port. I'm at the point where I can see that my implementation is on the bloated side of things but I don't see an immediate approach to shaving it down.

Thoughts

It is nice to peek into the black-box of libraries and gain a better understanding of what is involved. I'm still in the camp of "Why not do it yourself?". Sure, this was a small (if not entirely trivial) self-contained problem; but how many things might fit that category?

With a direct port from a more limited language (Go) into a more expressive language (Common Lisp) I think the demonstrated rate (call it a few hundred lines of code a day) sounds about right. Further refinement or refactoring could take weeks, but isn't really vital. If we're talking about doing this sort of thing for "real" projects I don't yet have ironclad arguments for or against but I think I'll probably keep pursuing this tack in my own projects.