Common Lisp Refactoring

2021-04-01

While I may have hinted that I wasn't going to; I went ahead and tried cleaning up my recent port of a Go module to Common Lisp.

Mostly Uninteresting

I won't bore anyone with too many details, most of the changes revolve around using more idiomatic "phrases" rather than the near copy-paste of Go control flow. Specifically I ended up removing many, many NILs after a few changes for better defaults (and consequently probably more implicit flow). The full code is still available here.

A Single Macro

One thing that irked me as soon as I finished the first pass and got the test running was the hideous levels of duplication around all the string comparisons. As an example, this is what one portion looked like:

(cond 
  ((string= str "skis") "ski")
  ((string= str "skies") "sky")
  ((string= str "dying") "die")
  ((string= str "lying") "lie")
  ((string= str "tying") "tie")
  ...)

While it is easy enough in this case to make a function that accomplishes the same with less typing it wasn't long before I set my eyes on how similar the above was to cases like this:

(cond 
  ((string= str "foo") "bar")
  ((string= str "foos") "bar")
  ((string= str "fooses") "bar")
  ...)

Where it would obviously be preferable to collapse them for cases where the result is the same. There's an obvious way to do this for lists with the loop macro:

(loop for s in '()
   thereis (string= comparison-str s))

The only downside to this is that it limits the handling of single values like the first case, where even if they don't have to be wrapped in a list explicitly there would still be an unnecessary loop to handle it. I eventually realized I really wanted a lightly decorated cond to handle any of strings, lists of strings, or booleans. The boolean case is for the potential default values like on a normal cond.

I wrote the following macro to try and address each of those cases in turn without overhead for each call:

(defmacro string-switch (arg (&rest key-values))
  `(cond
     ,@(loop for (k v) in key-values
          collect
            `(,(etypecase k
                 (CONS `(loop for s in ',k thereis (string= ,arg s)))
                 (STRING `(string= ,arg ,k))
                 (BOOLEAN `,k))
               ,v))))

The basic idea is to dispatch on the type of the first argument in the pairs (called "key-values") above. The idea is that the overhead for the first case, where it maps from a string to a another string is low, but the addition of more match-able strings or resulting expressions instead of strings remains consistent. As an example demonstrating each case:

(string-switch "comparison string"
  (("foo" "bar")
   (("baz" "qux") "blah")
   (t (if (> 3 5)
          'any-expression
          'any-at-all))))

Expanding the above may help to explain, the result looks like this:

(COND ((STRING= "comparison string" "foo") "bar")
      ((LOOP FOR S IN '("baz" "qux")
          THEREIS (STRING= "comparison string" S))
       "blah")
      (T
       (IF (> 3 5)
           'ANY-EXPRESSION
           'ANY-AT-ALL)))

With each of these cases covered it fell into place that a number of different call sites could be unified so that much of the string handling at least looks alike. I'm unsure if this is really good style, but the effect is pleasant enough to me.

Demonstration

As a concrete example, below are two pieces of functionally equivalent code. The first, without the macro, the second with.

(defun step-1a (word)
  (let ((suffix (first-suffix word (list "sses" "ied" "ies" "us" "ss" "s"))))
    (cond ((loop for s in '("ssess") thereis (string= suffix s))
           (progn
             (replace-suffix word suffix "ss")
             t))
          ((loop for s in '("ies" "ied") thereis (string= suffix s))
           (let ((replacement-string ""))
             (if (> (length (str word)) 4)
                 (setf replacement-string "i")
                 (setf replacement-string "ie"))
             (replace-suffix word suffix replacement-string)
             t))
          ((loop for s in '("us" "ss") thereis (string= suffix s))
           nil)
          ((loop for s in '("s") thereis (string= suffix s))
	   ;; Delete if the preceding word part contains a vowel not
	   ;; immediately before the s (so gas and this retain the s,
	   ;; gaps and kiwis lose it)
           (loop for i across (subseq (str word) 0 (max 0 (- (length (str word)) 2)))
                 if (is-lower-vowel i)
                   do (remove-last-n-chars word (length suffix))
                      (return t)))
          (t nil))))

(defun step-1a (word)
  (let ((suffix (first-suffix word (list "sses" "ied" "ies" "us" "ss" "s")))
        (word-len (length (str word))))
    (string-switch suffix
                   (("sses" (replace-suffix word suffix "ss"))

                    (("ies" "ied") (if (> word-len 4)
                                       (replace-suffix word suffix "i")
                                       (replace-suffix word suffix "ie")))

                    (("us" "ss") nil)

                    ;; Delete if the preceding word part contains a vowel not
                    ;; immediately before the s (so gas and this retain the s,
                    ;; gaps and kiwis lose it)
                    ("s" (loop for char across (subseq (str word) 0 (max 0 (- word-len 2)))
                            if (lower-vowel? char)
                            return (strip-suffix word suffix)))))))

Of course there are incidental fixes as well, the first snippet is (approximately) what I had and the second is what exists as of this writing. The point is only that I think the second is better.

Thoughts

I'm sure there are plenty of arguments to be made against writing a macro for this. Sure, most of the above could be handled with functions but I think the macro is necessary to defer evaluation in at least a few cases and reduce the "noise" of argument nesting and unpacking.

For now this feels like a logical conclusion to my work on this port. I think this is a local maximum for enhancements to be made without a more invasive refactoring of data structures and the string handling. I'm not unhappy with how it looks and it seems more readable to me now (though I did write it); however I think there is serious room for improvement, especially considering the wealth of options in macros and readtables.

Doing more reading in the Snowball documentation I think the DSL that has been developed there is pretty good! If there were any further improvements to be made here it would be to make the algorithm look more like Snowball. Specifically, you might compare how I have written step 1b to this, from the linked page:

    define Step_1b as (
        [substring] among (
            'eed' 'eedly'
                (R1 <-'ee')
            'ed' 'edly' 'ing' 'ingly'
                (
                test gopast v  delete
                test substring among(
                    'at' 'bl' 'iz'
                         (<+ 'e')
                    'bb' 'dd' 'ff' 'gg' 'mm' 'nn' 'pp' 'rr' 'tt'
                    // ignoring double c, h, j, k, q, v, w, and x
                         ([next]  delete)
                    ''   (atmark p1  test shortv  <+ 'e')
                )
            )
        )
    )