Implementing a Simple Message Protocol

2016-09-25

Reviewing an interview question posed to me recently. This time, implementing a loosely specified message protocol.

The Prompt

Implement encode() and decode() for a simple wire protocol per the prototypes below, given the following requirements:

Assume you have several ASCII files, where each file contains a single message. A file/message may contain any valid ASCII sequence.

You may assume all filenames are absolute.

encode() must read a single specified file/message from disk and place your encoded version of the message in a provided bytearray.

Sequential calls to encode() with the same bytearray must be supported.

decode() must read a single encoded message from the provided bytearray and write it to the specified file.

The contents of the file written by decode() for a given message should be identical to the corresponding file read by encode().

Sequential calls to decode() with the same bytearray (without interleaved encode() calls) must be supported.

You can think of the bytearray as substituting for a network protocol that’s transmitting the messages over the network in arbitrarily-sized packets.

Provide your encode() and decode() implementations, document any additional assumptions you made, and include any testing or infrastructure code that you feel is helpful. In your implementation, use only standard Python (please specify which version). Do not use the pickle modules.

Prototypes:
def encode ( inputFilename, outputBytearray ):

def decode ( inputBytearray, outputFilename ):

Assumptions

The solution works under both Python 3 and 2. I limit message lengths to 32-bits which means messages can't exceed 4GB; consequently I also assume messages will fit in memory. This isn't critical to how the protocol works, just a side-effect of using Python's read method.

The instructions weren't clear on whether the function prototypes could be modified (for self use) so I didn't modify them. Similarly, I didn't bother building out much infrastructure for sharing a bytearray between input and output - these two things preclude me from advising this for "real world use".

Finally, I didn't bother with any special error handling, if a non-ASCII file is encoded, Python will simply error out at the inappropriate character. This seemed as good as any custom handling I might implement.

Solution

Writing my solution, I realized this was a perfect example of what drew people to Python in the first place. The approach is (I think) identical to what a C program could do, packing structures with encoded byte arrays, the major difference in Python is simplified syntax and brevity.

import struct


def encode(inputFilename, outputBytearray):
    """Appends to outputBytearray a length-header and message of given length,
    maximum message length is 32-bits. Length-header is a 4 byte, big-endian
    struct, see:

    https://docs.python.org/3/library/struct.html#byte-order-size-and-alignment

    """
    message_contents = read_file(inputFilename)
    encoded_message = message_contents.encode('ascii')
    message_length = struct.pack('!I', len(encoded_message))
    outputBytearray += message_length
    outputBytearray += encoded_message

The specification wasn't clear on things like endian-ness for the protocol, so I use Python's suggested "network byte order" when packing the structs. The messages are limited to 4GB in size because of the use of I, which denotes unsigned int (standard size 4 bytes). There are alternatives, including double or long long but 4GB seems as arbitrarily satisfying as anything else.

An alternative protocol might use delimited messages, which could be streamed without padding a header with length information. I haven't thought too deeply about this but it seems it would suffer a different set of issues, along the lines of null-terminated strings in C.

def decode(inputBytearray, outputFilename):
    """returns a modified bytearray with the message and length-header that were
    read removed

    example:
      arr = bytearray()
      encode('input.txt', arr)
      arr = decode(arr, 'output.txt')

    `arr` will be empty after the assignment from `decode`

    """
    header_length = 4
    message_length, = struct.unpack('!I', inputBytearray[:header_length])
    inputBytearray = inputBytearray[header_length:]
    message = inputBytearray[:message_length]
    inputBytearray = inputBytearray[message_length:]
    write_file(outputFilename, message.decode('ascii'))
    return inputBytearray

I really didn't like the function prototype for decode, it seems well-suited for TDD and poorly suited for real use. My use of reassignment over the input byte array feels like a cheap hack but I think performance is slightly better than doing similarly with a Python string - byte arrays in Python are mutable, which (might) limit re-copying.

Helper Functions

The above solution uses two small functions that I felt improved readability by extracting them from the body of encode and decode:

def read_file(filename):
    with open(filename) as message:
        return message.read()

def write_file(filename, contents):
    with open(filename, 'w') as message:
        message.write(contents)

Testing It Out

When I submitted my solution, I submitted the following tests:

import filecmp

arr = bytearray()
encode('a.txt', arr)
encode('b.txt', arr)
arr = decode(arr, 'c.txt')
arr = decode(arr, 'd.txt')

assert filecmp.cmp('a.txt', 'c.txt', shallow=False) == True
assert filecmp.cmp('b.txt', 'd.txt', shallow=False) == True
assert len(arr) == 0

filecmp is a useful little module that is acting as a simple system test² against a "known good" result (the source file). Two files are read sequentially into a byte array and then decoded into separate files, where they are compared to their original. Due to the reassignment to arr the byte array should be exhausted after the two decode operations.

A Less Naive Test

I decided a more interesting test would be to build a client for this protocol after the fact; because really - the whole point of a protocol is standardization, right?

Without bothering to build out the necessary infrastructure for socket programming (because that doesn't sound interesting) I saved an encoded file¹ in Python with the following and set about building a decoder client in Forth.

>>> arr = bytearray()
>>> encode('/tmp/hamlet-ascii.txt', arr)
>>> with open('/tmp/encoded-test.myfile', 'wb') as f:
...     f.write(arr)
...

A quick read of the file into GForth reveals a slight hiccup:

s" /tmp/encoded-test.myfile" slurp-file  ok
drop 100 dump
105B5E000: 00 02 CE C6  2A 2A 2A 54 - 68 65 20 50  72 6F 6A 65  ....***The Proje
105B5E010: 63 74 20 47  75 74 65 6E - 62 65 72 67  27 73 20 45  ct Gutenberg's E
105B5E020: 74 65 78 74  20 6F 66 20 - 53 68 61 6B  65 73 70 65  text of Shakespe
105B5E030: 61 72 65 27  73 20 46 69 - 72 73 74 20  46 6F 6C 69  are's First Foli
105B5E040: 6F 2A 2A 2A  0D 0A 2A 2A - 2A 2A 2A 2A  2A 2A 2A 2A  o***..**********
105B5E050: 2A 2A 2A 2A  2A 2A 2A 2A - 2A 2A 2A 54  68 65 20 54  ***********The T
105B5E060: 72 61 67 65  64 69 65 20 - 6F 66 20 48  61 6D 6C 65  ragedie of Hamle
105B5E070: 74 2A 2A 2A  2A 2A 2A 2A - 2A 2A 2A 2A  2A 2A 2A 2A  t***************
105B5E080: 2A 2A 2A 2A  2A 2A 0D 0A - 0D 0A 54 68  69 73 20 69  ******....This i
105B5E090: 73 20 6F 75  72 20 33 72 - 64 20 65 64  69 74 69 6F  s our 3rd editio

Those first 4 bytes, the ones I packed with Python's struct module, using "network order" endian-ness? They're big endian and my computer is little endian. This had me stumped until I figured out how to do the proper amount of bit-twiddling to coerce the bytes into their correct endian-ness.

The following Forth word works, but it's pretty ugly, too much stack juggling makes it hard to follow (>r and r> are moving values from the value stack to the return stack and vice versa for temporary storage). One overlooked detail which set me back ages was not setting the base explicitly for the shift operations. I had mistakenly typed 24 and 8 which were then implicitly used as their hex value equivalent. The solution is to simply force a notation ($ is hex here, an alternative would be %24 for decimal) rather than leaving it up to the current number system.

: big->little  DUP $FF   AND     $18 lshift >r
               DUP $FF00 AND     $8  lshift r>
               OR  >r
               DUP $FF0000 AND   $8  rshift >r
               $FF000000   AND   $18 rshift r> r>
               OR OR ;

542A2A2AC6CE0200 big->little .s <1> 2CEC6  ok

Get on With It

This was all a bit of a diversion from the original goal of testing the message protocol encoding which can be demonstrated to "work" in a simple test:

s" /tmp/encoded-test.myfile" SLURP-FILE DROP  ok
s" /tmp/decoded-test.txt" w/o CREATE-FILE THROW VALUE out-file
DUP @ big->little SWAP 4 + SWAP out-file WRITE-FILE
out-file CLOSE-FILE THROW

and finally, in the shell:

/tmp $ diff hamlet-ascii.txt decoded.txt
/tmp $

Reveals the output file, run through an encode process in Python and saved to disk, then loaded, decoded and written out from a Forth protocol decoder client is identical to its source file.

I had a devil of a time tracking down a gap in my knowledge of Forth's file-write capabilities and limitations - namely, it is important to close or flush a file after a write operation or else 4096 bytes will remain in a write buffer and never make it to disk.

How Very Circuitous

I actually had a good time with this particular question, it seemed a good test of "can you program?" while allowing some level of self-direction (as opposed to say, fizzbuzz). Circling back to build a client in Forth was an interesting diversion that answered a few questions about working in Forth that I hadn't before thought to ask.

I am using a copy of Shakespeare's Hamlet from Project Gutenberg, but watch out, you'll have to strip the byte-order-mark if you want to reproduce my example
I take some inspiration here from Philip Guo