Your Test Framework is Making Things Worse

2018-06-30

I have been using pytest for a few different projects recently and while I've found it very usable, I have also found it to be very complicated.

In attempting to understand how a specific interaction was working I ended up digging into the source and what I found was surprising:

Language	files	blank	comment	code
Python	132	7082	12081	26760

Twenty-six thousand lines of code in order to — write tests? I'm not entirely naive, I understand many of the conveniences that pytest provides, but I have to wonder if we aren't causing ourselves the very problems we are trying to solve. Like, "I have to import 565 lines of code in order to format the output of my test run in the same manner as Java's JUnit" — just, why?

Inevitably it is because there is some second or third order tool that can consume the output without having to write a parser ourselves. But how complicated would that be, if the intent was to only solve our own problem? How much would you actually need if you had to rewrite these things yourself?

While it has been a while since I used Forth for much, one thing I remember fondly is the test harness. The actual test "framework" is about 50 lines of very sparse code. I don't want to give the wrong impression, the two projects are not close to equivalent. I have no doubt that pytest has infinitely more in it and is capable of doing just about anything and makes it look easy, at least on the surface. While it might be difficult to understand the Forth test harness without some familiarity with Forth, it can at least be informative to browse the tests that accompany in the repository, which include tests for:

files
strings
exceptions
memory allocation

All of which is testable with a 50 line test framework. While this is surely one aspect of a language that maintains what is almost an allergy to "over-engineering", I think it is a hint that things don't have to be this hard. Is it that Forth programs are more testable and a reflection of the language? Or is it that by making a one size fits all framework, the Python test harness has grown inordinately large? I'm not actually sure, I would guess it is a bit of both.

How It Works

T{ 1 2 3 * + -> 7 }T

There are three words to understand here:

T{ this can be read as "test begins", it is actually pure syntactic sugar and does nothing
-> can be read as "assert stack", it takes "input" from the left hand side and allows for a comparison against the right hand side. It does this by recording the depth of the stack and saving off the contents.
}T can be read "test concludes", it first compares the depth of the left hand side and right hand side of the stack, then, if the depths match it compares each item. It also handles reporting failures and clearing the stack after a test.

In this case, the left hand side stack has depth one and a value of 7. It is compared to a right hand side depth of one and value 7.

And as you might expect, error reporting is pretty sparse, but because of how focused the tests are they're workable:

T{ 1 2 3 * + -> 999 }T
INCORRECT RESULT: T{ 1 2 3 * + -> 999 }T ok

T{ 1 2 3 * + -> 1 2 3 }T
WRONG NUMBER OF RESULTS: T{ 1 2 3 * + -> 1 2 3 }T ok

An Example

The above might be a cute example of how a ridiculously simple "test" can be done in Forth, but as a demonstration of it's utility I thought to apply it to another problem I've written about before: a linear congruence generator

As a refresher, I was matching the following function signature:

unsigned char *LCG(unsigned char *data, int dataLength, unsigned char initialValue)

HEX
A5  CONSTANT MULTIPLICATIVE
C9  CONSTANT ADDITIVE
100 CONSTANT MODULUS
VARIABLE value
VARIABLE length
VARIABLE output

: lcg ( n -- n )
  MULTIPLICATIVE *
  ADDITIVE +
  MODULUS MOD ;

: generator ( data length value -- addr length )
  value !
  length !
  length @ ALLOCATE throw output !

  0 ?DO
    value @ lcg
    DUP value !
    over I + @ XOR
    output I + !
  LOOP DROP

  output length @ ;

From there, I can write several tests, first to verify the lcg word works for single values, and then based on the original prompt to assert that the generator works in both directions, and a final test to demonstrate the negative case:

data	dataLength	initialValue	result
apple	5	55	`\xF3\x93\x68\x2D\xCB`
`\xF3\x93\x68\x2D\xCB`	5	55	apple

TESTING lcg works for single values

T{ 02 lcg -> 13 }T ok

TESTING generator in both forward and reverse

T{ s" apple" 5 55 
   generator
   s\" \xF3\x93\x68\x2D\xCB" str= -> true
}T ok

T{ s\" \xF3\x93\x68\x2D\xCB" 5 55
   generator
   s" apple" str= -> true
}T ok

TESTING a negative case, string does not match generated value

T{ s\" \xF3\x93\x68\x2D\xCB" 5 55
   generator
   s" foo bar" str= -> false
}T ok

Why It Works

In Python, all sorts of design flaws can be masked under the guise of testability provided by a framework that allows you to mock out the dependencies of your dependencies, or monkey-patch a library call at run-time.

While Python purports to support the idea that "There should be one-- and preferably only one --obvious way to do it." I have found Forth much more hard-lined in what is an isn't supported by libraries or the core language. Rather than accommodating tight coupling and masking it through a framework, the tests are intentionally simple because the interfaces are simple. Forth's stack-based programming forces a very particular approach to problem-solving that tends toward doing the obvious thing.