indexpost archiveatom feed syndication feed icon

Your Test Framework is Making Things Worse


I have been using pytest for a few different projects recently and while I've found it very usable, I have also found it to be very complicated.

In attempting to understand how a specific interaction was working I ended up digging into the source and what I found was surprising:

Language files blank comment code
Python 132 7082 12081 26760

Twenty-six thousand lines of code in order to — write tests? I'm not entirely naive, I understand many of the conveniences that pytest provides, but I have to wonder if we aren't causing ourselves the very problems we are trying to solve. Like, "I have to import 565 lines of code in order to format the output of my test run in the same manner as Java's JUnit" — just, why?

Inevitably it is because there is some second or third order tool that can consume the output without having to write a parser ourselves. But how complicated would that be, if the intent was to only solve our own problem? How much would you actually need if you had to rewrite these things yourself?

While it has been a while since I used Forth for much, one thing I remember fondly is the test harness. The actual test "framework" is about 50 lines of very sparse code. I don't want to give the wrong impression, the two projects are not close to equivalent. I have no doubt that pytest has infinitely more in it and is capable of doing just about anything and makes it look easy, at least on the surface. While it might be difficult to understand the Forth test harness without some familiarity with Forth, it can at least be informative to browse the tests that accompany in the repository, which include tests for:

All of which is testable with a 50 line test framework. While this is surely one aspect of a language that maintains what is almost an allergy to "over-engineering", I think it is a hint that things don't have to be this hard. Is it that Forth programs are more testable and a reflection of the language? Or is it that by making a one size fits all framework, the Python test harness has grown inordinately large? I'm not actually sure, I would guess it is a bit of both.

How It Works

T{ 1 2 3 * + -> 7 }T

There are three words to understand here:

And as you might expect, error reporting is pretty sparse, but because of how focused the tests are they're workable:

T{ 1 2 3 * + -> 999 }T
INCORRECT RESULT: T{ 1 2 3 * + -> 999 }T ok

T{ 1 2 3 * + -> 1 2 3 }T
WRONG NUMBER OF RESULTS: T{ 1 2 3 * + -> 1 2 3 }T ok

An Example

The above might be a cute example of how a ridiculously simple "test" can be done in Forth, but as a demonstration of it's utility I thought to apply it to another problem I've written about before: a linear congruence generator

As a refresher, I was matching the following function signature:

unsigned char *LCG(unsigned char *data, int dataLength, unsigned char initialValue)


: lcg ( n -- n )

: generator ( data length value -- addr length )
  value !
  length !
  length @ ALLOCATE throw output !

  0 ?DO
    value @ lcg
    DUP value !
    over I + @ XOR
    output I + !

  output length @ ;

From there, I can write several tests, first to verify the lcg word works for single values, and then based on the original prompt to assert that the generator works in both directions, and a final test to demonstrate the negative case:

data dataLength initialValue result
apple 5 55 \xF3\x93\x68\x2D\xCB
\xF3\x93\x68\x2D\xCB 5 55 apple
TESTING lcg works for single values

T{ 02 lcg -> 13 }T ok

TESTING generator in both forward and reverse

T{ s" apple" 5 55 
   s\" \xF3\x93\x68\x2D\xCB" str= -> true
}T ok

T{ s\" \xF3\x93\x68\x2D\xCB" 5 55
   s" apple" str= -> true
}T ok

TESTING a negative case, string does not match generated value

T{ s\" \xF3\x93\x68\x2D\xCB" 5 55
   s" foo bar" str= -> false
}T ok

Why It Works

In Python, all sorts of design flaws can be masked under the guise of testability provided by a framework that allows you to mock out the dependencies of your dependencies, or monkey-patch a library call at run-time.

While Python purports to support the idea that "There should be one-- and preferably only one --obvious way to do it." I have found Forth much more hard-lined in what is an isn't supported by libraries or the core language. Rather than accommodating tight coupling and masking it through a framework, the tests are intentionally simple because the interfaces are simple. Forth's stack-based programming forces a very particular approach to problem-solving that tends toward doing the obvious thing.