[nolan@nprescott.com] $>  cat blog archive feed

A Brainfuck Interpreter in Forth

2016-09-13

Recently interested in learning more about virtual machines and interpreters, I decided trying to implement one was a good way to learn more.

What is brainfuck?

Any character not ><+-.,[] is ignored.

Brainfuck is represented by an array with 30,000 cells initialized to zero and a data pointer pointing at the current cell.

There are eight commands:

[ and ] form a while loop. Obviously, they must be balanced.

Why Forth

More than anything, I am interested in learning more about Forth as a language and have heard a lot about how it is ideally suited to building your own application specific DSLs. Brainfuck is a silly toy problem with a very small mental footprint, it's pretty well defined and leaves me only with the decisions made in how to implement it.

Stumbling Blocks

I am very new to programming Forth and struggled after getting 3/4ths the way through implementing an interpreter when I realized the implications of brainfuck's looping structure. Originally, I had read the spec and concluded I could use Forth's native looping capability, specifically BEGIN WHILE REPEAT; I had 6 of the 8 operations working before it dawned on me that those looping commands rely on being compiled within a word and can't be invoked from the interpreter. As I understand it, this is a result of how Forth compiles loops down to jumps between memory addresses.

A Decision

The only way that I could see to resolve this mismatch in my mental model for the interpreter was to implement my own looping controls and tracking nested loops manually, which sounded like a lot of bother.

Instead, I switched gears and decided I'd try and instead compile brainfuck programs down to Forth words and let the Forth compiler do the right thing1.

The Interpreter (Compiler)

30000 CONSTANT mem-size
CREATE bf-mem mem-size ALLOT

: init ( -- addr n ) bf-mem mem-size 0 FILL   \ zero out memory
                     bf-mem 0 ;

\ the value of the current BF memory cell is held on the stack until a "shift"
\ occurs, when it is written to the address also being carried on the stack
\ (prior to incrementing the address)

: shift> ( addr n -- addr n ) OVER C! 1+ DUP C@ ;
: <shift ( addr n -- addr n ) OVER C! 1- DUP C@ ;

For those unfamiliar, the word for ! is set and @ is fetch, a set operation takes a number and then an address. Shifts are then best described as: set value (carried on the stack), move 1 position, read value at the new position onto the stack.

: dispatch ( c -- )
   CASE
       [CHAR] + OF POSTPONE   1+       ENDOF
       [CHAR] - OF POSTPONE   1-       ENDOF
       [CHAR] > OF POSTPONE   shift>   ENDOF
       [CHAR] < OF POSTPONE   <shift   ENDOF

       [CHAR] . OF POSTPONE   DUP
                   POSTPONE   EMIT     ENDOF

       [CHAR] , OF POSTPONE   DROP
                   POSTPONE   KEY      ENDOF

       [CHAR] [ OF POSTPONE   BEGIN
                   POSTPONE   DUP
                   POSTPONE   WHILE    ENDOF

       [CHAR] ] OF POSTPONE   REPEAT   ENDOF
  ENDCASE ;

The case statement within dispatch was really the first thing I wrote, but proved difficult to wrap my head around the changes necessary when I moved to a "compiler". The word POSTPONE is necessary to (essentially) in-line a definition, rather than executing it, because its content is compiled into a new word and not immediately run.

: bf->forth ( addr n -- )
    POSTPONE init
    BOUNDS DO
        I C@ dispatch
    LOOP
    POSTPONE SWAP  POSTPONE C!  \ final memory cell write
    POSTPONE ;
;

I found the word BOUNDS used online and it's a neat little utility word that I'll need to remember. It takes an (address, length) pair, like those used for strings and converts it into a format consumable by the DO LOOP syntax. DO takes an (end, start) pair and LOOPs an index from start to end. So (1000, 5) becomes (1005, 1000) and it is then possible to LOOP over each character in a string. The word I is used to copy the current index within a loop onto the stack.

The result is each iteration of the loop places the address of a character onto the stack, which is then fetched and passed to dispatch. The last line, POSTPONE ; is used to finalize (exit) the generated Forth word.

: bf : ( name ) CHAR PARSE bf->forth ;

: debug bf-mem 100 DUMP ;

I was feeling really good about my progress until I came to this last step, defining an arbitrary Forth word using the code body generated from bf->forth. I finally wrangled a kind stranger on IRC to explain how on the fly definitions can work. As I understand it: The word : (colon) creates a new word, which is taken to be the next word in the input buffer: bf a-new-word would define a-new-word. The use of CHAR PARSE is definitely weird, but as best I understand it, CHAR consumes the given delimiter (as in " +++>++.<." would be ") and places its value on the stack, which is used by PARSE to define the range within the input buffer. The result looks like a string (address, length) pair, but I think the difference is it is referenced from the input buffer (TIB) rather than the PAD.

I'll freely admit to needing all the help I got on that last step. The good news is, conceptually it isn't fundamental to understanding what is happening. I may yet go back and factor it into simply taking Forth strings as input.

Tests

Any credit for the following test programs is owed to Daniel Cristofani.

Calculate Squares

bf squares " ++++[>+++++<-]>[<+++++>-]+<+[>[>+>+<<-]++>>[<<+>>-]>>>[-]++>[-]+>>>+[[-]++++++>>>]<<<[[<++++++++<++>>-]+<.<[>----<-]<]<<[>>>>>[>>>[-]+++++++++<[>-<-]+++++++++>[-[<->-]+[<<<]]<[>+<-]>]<<-]<<-]"
1
4
9
16
25
36
49
64
81
100
121
144
...

Fibonacci Numbers

bf fib " >++++++++++>+>+[[+++++[>++++++++<-]>.<++++++[>--------<-]+<<<]>.>>[[-]<[>+<-]>>[<<+>+>-]<[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>[-]>+>+<<<-[>+<-]]]]]]]]]]]+>>>]<<<]"
1
1
2
3
5
8
13
21
34
55
89
144
233
377
...

rot13

bf rot13 " ,[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>++++++++++++++<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>>+++++[<----->-]<<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>++++++++++++++<-[>+<-[>+<-[>+<-[>+<-[>+<-[>++++++++++++++<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>>+++++[<----->-]<<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>++++++++++++++<-[>+<-]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]>.[-]<,]"
rot13 klmxyz

This is the result of typing (interactively) xyzklm, which is "rotated" 13 letters as part of the simple substitution cipher. I include it only because it is supposed to be a good test of a given brainfuck implementation's ability to handle deeply nested loops.

Thoughts

I should probably go back and work out how to write a real interpreter, now that this brainfuck to Forth compilation is working enough to satisfy my curiosity. Viewing the compiled-to-Forth brainfuck code is a bit of a laugh:

bf spells " +[[->]-[-<]>-]>.>>>>.<<<<-.>>-.>.<<.>>>>-.<<<<<++.>>++." ok
spells brainfuck ok
see spells
: spells
  init 1+
  BEGIN  dup
  WHILE  BEGIN  dup
         WHILE  1- shift>
         REPEAT
         1-
         BEGIN  dup
         WHILE  1- <shift
         REPEAT
         shift> 1-
  REPEAT
  shift> dup emit shift> shift> shift> shift> dup emit <shift <shift <shift <shift 1- dup emit shift> shift> 1- dup emit shift> dup emit <shift <shift dup emit shift> shift> shift> shift> 1- dup emit
  <shift <shift <shift <shift <shift 1+ 1+ dup emit shift> shift> 1+ 1+ dup emit swap c! ; ok

It is such a direct translation of what the brainfuck code does that I can see how simple a brainfuck to C compilation step would be. Similarly, I have the eerie feeling that a naive "compiler" could be written in sed or similar with 1-to-1 text replacements.

I think a real interpreter is possible if only I abandon my original thoughts of leveraging Forth loops and instead use a (single?) outer "run-forever" loop that tracks the currently executing operation and maintains its own references to nested loops. The only real danger I can think of is exploding the stack with loops too deeply nested, but Gforth on my machine seems to support depths somewhere in the range of 2000 items.


  1. I initially had a more complicated system of tracking the brainfuck pointer within a separate variable and SWAPping heavily before I saw this example which looked pretty much exactly like mine but "flowed" better by using the stack to handle memory addresses.
[nolan@nprescott.com] $> █