[nolan@nprescott.com] $>  cat blog archive feed

Linear Congruence Generator (in Forth)

2016-08-05

I've spent some time interviewing recently, the result of which is an interesting assortment of questions and problems. Here's one I had not seen before, a Linear Congruence Generator.

Why?

I did a bit of research to find that linear congruence generators are often used as pseudo-random number generators. This page explains in greater depth, but it seems LCGs are common enough partly because of how easy they are to implement. The example given below is not great in terms of randomness because it is only 8-bit. The linked page gives some context for challenges in scaling to 32-bit lengths and handling integer overflows, the same challenges apply in Forth, and I will be ignoring them to simplify the syntax (no use of CELLS or 2VARIABLE etc.).

The Prompt

A common technique for obfuscating data is to use exclusive‐or (xor) and some key; it is inexpensive and symmetric. A problem occurs when this is used on file formats like portable executable where there are lots of nulls since xor'ing nulls with your key ends up writing your key out.

A slightly more complex algorithm is to implement a linear congruence generator (LCG) to generate a key stream which will be xor'ed with the data. The generic form of an LCG is:

Value = (M * Value + A) % n

where M is the multiplicative constant, A is the additive constant, and n is the modulus. These three values are the parameters of the LCG.

For this problem I will be using an 8‐bit LCG (i.e. n = 256), with M = 0xa5 and A = 0xc9. The LCG is initialized with a value and stepped to produce the key stream. The next byte of key is read from the LCG each step.

For example, if the initial value of the LCG is 0x02, then the next value, which would be the first key byte, would be:

(0xa5 * 0x02 + 0xc9) % 256 = 0x13.

The function should adhere to the following signature:

unsigned char *LCG(unsigned char *data, int dataLength, unsigned char initialValue)

Example Tests:

data dataLength initialValue result
“apple 5 0x55 “\xF3\x93\x68\x2D\xCB”
“\xF3\x93\x68\x2D\xCB” 5 0x55 "apple"

Getting in the Right Frame of Mind

The first thing I notice, having read Starting Forth and reading Thinking Forth is that it makes no sense to mix two number bases unnecessarily. There is no value in keeping a single base-10 number (256) simply because it is a common expression for 8-bit computing. So I'll go ahead and change that to hex, where it is the equally friendly 100.

With that out of the way, the whole operation is simply:

HEX ( change base )  ok
A5 02 *  ok
C9 +  ok
100 MOD ( the result is now on the stack, I'll just pop it to verify )  ok
. 13  ok

Moving Beyond the "Stack Calculator" Style

Huh, well that was pretty simple. I've broken it down in to a single arithmetic step per line, with the final line popping the stack to check the result (which matches the example given). The next step should probably be to create a logical word for this, along with a stack comment.

HEX 
: lcg ( n -- n )
  A5 *
  C9 +
  100 MOD ;

This doesn't yet match the requested function signature (which I am obviously taking some liberties with, using Forth):

unsigned char *LCG(unsigned char *data, int dataLength, unsigned char initialValue)

I will be using a Forth-ism and place the starting address and length of the result on the stack, as a means of returning a "string". Forth doesn't support strings as first-class objects and treats them more like C (arrays). The easiest way to then check the result is with the words type or dump, type is best with text strings, dump is more useful for the hex-encoded reverse operation.

VARIABLE value
VARIABLE length
VARIABLE output

: stepper ( data length value -- addr length )
  value !
  length !
  length @ ALLOCATE throw output !

  0 ?DO
    value @ lcg
    DUP value !
    over I + @ XOR
    output I + !
  LOOP DROP

  output length @ ;

The above code correctly handles allocating memory for the output string, but I haven't bothered to deal with freeing on completion. I think the idea is pretty straight-forward, simply check for an existing output and free before re-allocating or alternatively, allocate a dummy value, possibly a zero-length and re-size within stepper.

Alpha to Hex

s" apple" 5 55  ok
stepper  ok
.s <2> 1093834C8 5  ok
dump 
1093834C8: F3 93 68 2D  CB          -                           ..h-.
 ok

Reverse, Hex to Alpha

s\" \xF3\x93\x68\x2D\xCB" 5 55  ok
stepper ok
.s <2> 1093834C8 5  ok
dump 
1093834C8: 61 70 70 6C  65          -                           apple
 ok

The one bit that tripped me up at first was Gforth's syntax for escaping hex characters within a string syntax. Rather than the string word s" I had to find s\" which allows for the hex escape character \x. I am using dump rather than type in the example above because type doesn't play nicely with trying to print the hex F3 93 68 2D CB.

Thoughts

This was a fun way to dig into Forth a bit after reading about it in books. The syntax is jarring at first but is a low barrier to entry with the language. The bigger barrier is getting into a frame of mind to think about memory addresses directly and iterating over them without creating unnecessary temporary variables. I am not sure I've got the hang of that yet, my above code does a bit of stack juggling, which isn't terrible but I think I could improve it by factoring it out further.

As far as pseudo-random number generators are concerned, I don't think I deeply understand the math involved. I have seen Mersenne Twisters used as a popular means of PRNG but the math involved is even more opaque than the LCG. I think I might need to bone-up on the necessary maths background before gaining further insight on this end.

[nolan@nprescott.com] $> █