Recently interested in learning more about virtual machines and interpreters, I decided trying to implement one was a good way to learn more.
Any character not
><+-.,[]
is ignored.Brainfuck is represented by an array with 30,000 cells initialized to zero and a data pointer pointing at the current cell.
There are eight commands:
+
: Increments the value at the current cell by one.-
: Decrements the value at the current cell by one.>
: Moves the data pointer to the next cell (cell on the right).<
: Moves the data pointer to the previous cell (cell on the left)..
: Prints the ASCII value at the current cell (i.e. 65 = 'A').,
: Reads a single input character into the current cell.[
: If the value at the current cell is zero, skips to the corresponding ] . Otherwise, move to the next instruction.]
: If the value at the current cell is zero, move to the next instruction. Otherwise, move backwards in the instructions to the corresponding [ .[ and ] form a while loop. Obviously, they must be balanced.
More than anything, I am interested in learning more about Forth as a language and have heard a lot about how it is ideally suited to building your own application specific DSLs. Brainfuck is a silly toy problem with a very small mental footprint, it's pretty well defined and leaves me only with the decisions made in how to implement it.
I am very new to programming Forth and struggled after getting 3/4ths the way
through implementing an interpreter when I realized the implications of
brainfuck's looping structure. Originally, I had read the spec and concluded I
could use Forth's native looping capability, specifically BEGIN
WHILE
REPEAT
; I had 6 of the 8 operations working before it dawned on me that those
looping commands rely on being compiled within a word and can't be invoked from
the interpreter. As I understand it, this is a result of how Forth compiles
loops down to jumps between memory addresses.
The only way that I could see to resolve this mismatch in my mental model for the interpreter was to implement my own looping controls and tracking nested loops manually, which sounded like a lot of bother.
Instead, I switched gears and decided I'd try and instead compile brainfuck programs down to Forth words and let the Forth compiler do the right thing1.
30000 CONSTANT mem-size
CREATE bf-mem mem-size ALLOT
: init ( -- addr n ) bf-mem mem-size 0 FILL \ zero out memory
bf-mem 0 ;
\ the value of the current BF memory cell is held on the stack until a "shift"
\ occurs, when it is written to the address also being carried on the stack
\ (prior to incrementing the address)
: shift> ( addr n -- addr n ) OVER C! 1+ DUP C@ ;
: <shift ( addr n -- addr n ) OVER C! 1- DUP C@ ;
For those unfamiliar, the word for !
is set
and @
is fetch
, a set
operation takes a number and then an address. Shifts are then best described
as: set value (carried on the stack), move 1 position, read value at the new
position onto the stack.
: dispatch ( c -- )
CASE
[CHAR] + OF POSTPONE 1+ ENDOF
[CHAR] - OF POSTPONE 1- ENDOF
[CHAR] > OF POSTPONE shift> ENDOF
[CHAR] < OF POSTPONE <shift ENDOF
[CHAR] . OF POSTPONE DUP
POSTPONE EMIT ENDOF
[CHAR] , OF POSTPONE DROP
POSTPONE KEY ENDOF
[CHAR] [ OF POSTPONE BEGIN
POSTPONE DUP
POSTPONE WHILE ENDOF
[CHAR] ] OF POSTPONE REPEAT ENDOF
ENDCASE ;
The case statement within dispatch
was really the first thing I wrote, but
proved difficult to wrap my head around the changes necessary when I moved to a
"compiler". The word POSTPONE
is necessary to (essentially) in-line a
definition, rather than executing it, because its content is compiled into a
new word and not immediately run.
: bf->forth ( addr n -- )
POSTPONE init
BOUNDS DO
I C@ dispatch
LOOP
POSTPONE SWAP POSTPONE C! \ final memory cell write
POSTPONE ;
;
I found the word BOUNDS
used online and it's a neat little utility word that
I'll need to remember. It takes an (address, length) pair, like those used for
strings and converts it into a format consumable by the DO
LOOP
syntax. DO
takes an (end, start) pair and LOOP
s an index from start to
end. So (1000, 5) becomes (1005, 1000) and it is then possible to LOOP
over
each character in a string. The word I
is used to copy the current index
within a loop onto the stack.
The result is each iteration of the loop places the address of a character onto
the stack, which is then fetched and passed to dispatch. The last line,
POSTPONE ;
is used to finalize (exit) the generated Forth word.
: bf : ( name ) CHAR PARSE bf->forth ;
: debug bf-mem 100 DUMP ;
I was feeling really good about my progress until I came to this last step,
defining an arbitrary Forth word using the code body generated from
bf->forth
. I finally wrangled a kind stranger on IRC to explain how on the
fly definitions can work. As I understand it: The word :
(colon) creates a
new word, which is taken to be the next word in the input buffer: bf a-new-word
would define a-new-word
. The use of CHAR PARSE
is definitely
weird, but as best I understand it, CHAR
consumes the given delimiter (as in
" +++>++.<."
would be "
) and places its value on the stack, which is used
by PARSE
to define the range within the input buffer. The result looks like a
string (address, length) pair, but I think the difference is it is referenced
from the input buffer (TIB
) rather than the PAD
.
I'll freely admit to needing all the help I got on that last step. The good news is, conceptually it isn't fundamental to understanding what is happening. I may yet go back and factor it into simply taking Forth strings as input.
Any credit for the following test programs is owed to Daniel Cristofani.
bf squares " ++++[>+++++<-]>[<+++++>-]+<+[>[>+>+<<-]++>>[<<+>>-]>>>[-]++>[-]+>>>+[[-]++++++>>>]<<<[[<++++++++<++>>-]+<.<[>----<-]<]<<[>>>>>[>>>[-]+++++++++<[>-<-]+++++++++>[-[<->-]+[<<<]]<[>+<-]>]<<-]<<-]"
1
4
9
16
25
36
49
64
81
100
121
144
...
bf fib " >++++++++++>+>+[[+++++[>++++++++<-]>.<++++++[>--------<-]+<<<]>.>>[[-]<[>+<-]>>[<<+>+>-]<[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>[-]>+>+<<<-[>+<-]]]]]]]]]]]+>>>]<<<]"
1
1
2
3
5
8
13
21
34
55
89
144
233
377
...
bf rot13 " ,[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>++++++++++++++<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>>+++++[<----->-]<<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>++++++++++++++<-[>+<-[>+<-[>+<-[>+<-[>+<-[>++++++++++++++<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>>+++++[<----->-]<<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>++++++++++++++<-[>+<-]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]>.[-]<,]"
rot13 klmxyz
This is the result of typing (interactively) xyzklm
, which is "rotated" 13
letters as part of the
simple substitution cipher. I include it
only because it is supposed to be a good test of a given brainfuck
implementation's ability to handle deeply nested loops.
I should probably go back and work out how to write a real interpreter, now that this brainfuck to Forth compilation is working enough to satisfy my curiosity. Viewing the compiled-to-Forth brainfuck code is a bit of a laugh:
bf spells " +[[->]-[-<]>-]>.>>>>.<<<<-.>>-.>.<<.>>>>-.<<<<<++.>>++." ok
spells brainfuck ok
see spells
: spells
init 1+
BEGIN dup
WHILE BEGIN dup
WHILE 1- shift>
REPEAT
1-
BEGIN dup
WHILE 1- <shift
REPEAT
shift> 1-
REPEAT
shift> dup emit shift> shift> shift> shift> dup emit <shift <shift <shift <shift 1- dup emit shift> shift> 1- dup emit shift> dup emit <shift <shift dup emit shift> shift> shift> shift> 1- dup emit
<shift <shift <shift <shift <shift 1+ 1+ dup emit shift> shift> 1+ 1+ dup emit swap c! ; ok
It is such a direct translation of what the brainfuck code does that I can see
how simple a brainfuck to C compilation step would be. Similarly, I have the
eerie feeling that a naive "compiler" could be written in sed
or similar with
1-to-1 text replacements.
I think a real interpreter is possible if only I abandon my original thoughts of leveraging Forth loops and instead use a (single?) outer "run-forever" loop that tracks the currently executing operation and maintains its own references to nested loops. The only real danger I can think of is exploding the stack with loops too deeply nested, but Gforth on my machine seems to support depths somewhere in the range of 2000 items.
SWAP
ping heavily before I saw this example which looked pretty much exactly like mine but "flowed" better by using the stack to handle memory addresses.