Reviewing an interview question posed to me recently. This time, implementing a loosely specified message protocol.
Implement
encode()
anddecode()
for a simple wire protocol per the prototypes below, given the following requirements:
Assume you have several ASCII files, where each file contains a single message. A file/message may contain any valid ASCII sequence.
You may assume all filenames are absolute.
encode()
must read a single specified file/message from disk and place your encoded version of the message in a provided bytearray.Sequential calls to
encode()
with the same bytearray must be supported.
decode()
must read a single encoded message from the provided bytearray and write it to the specified file.The contents of the file written by
decode()
for a given message should be identical to the corresponding file read by encode().Sequential calls to decode() with the same bytearray (without interleaved
encode()
calls) must be supported.You can think of the bytearray as substituting for a network protocol that’s transmitting the messages over the network in arbitrarily-sized packets.
Provide your
encode()
anddecode()
implementations, document any additional assumptions you made, and include any testing or infrastructure code that you feel is helpful. In your implementation, use only standard Python (please specify which version). Do not use the pickle modules.Prototypes:
def encode ( inputFilename, outputBytearray ): def decode ( inputBytearray, outputFilename ):
The solution works under both Python 3 and 2. I limit message lengths to
32-bits which means messages can't exceed 4GB; consequently I also assume
messages will fit in memory. This isn't critical to how the protocol works,
just a side-effect of using Python's read
method.
The instructions weren't clear on whether the function prototypes could be
modified (for self
use) so I didn't modify them. Similarly, I didn't bother
building out much infrastructure for sharing a bytearray between input and
output - these two things preclude me from advising this for "real world use".
Finally, I didn't bother with any special error handling, if a non-ASCII file is encoded, Python will simply error out at the inappropriate character. This seemed as good as any custom handling I might implement.
Writing my solution, I realized this was a perfect example of what drew people to Python in the first place. The approach is (I think) identical to what a C program could do, packing structures with encoded byte arrays, the major difference in Python is simplified syntax and brevity.
import struct
def encode(inputFilename, outputBytearray):
"""Appends to outputBytearray a length-header and message of given length,
maximum message length is 32-bits. Length-header is a 4 byte, big-endian
struct, see:
https://docs.python.org/3/library/struct.html#byte-order-size-and-alignment
"""
message_contents = read_file(inputFilename)
encoded_message = message_contents.encode('ascii')
message_length = struct.pack('!I', len(encoded_message))
outputBytearray += message_length
outputBytearray += encoded_message
The specification wasn't clear on things like endian-ness for the protocol, so
I use Python's suggested
"network byte order"
when packing the structs. The messages are limited to 4GB in size because of
the use of I
, which denotes unsigned int
(standard size 4 bytes). There are
alternatives, including double or long long but 4GB seems as arbitrarily
satisfying as anything else.
An alternative protocol might use delimited messages, which could be streamed without padding a header with length information. I haven't thought too deeply about this but it seems it would suffer a different set of issues, along the lines of null-terminated strings in C.
def decode(inputBytearray, outputFilename):
"""returns a modified bytearray with the message and length-header that were
read removed
example:
arr = bytearray()
encode('input.txt', arr)
arr = decode(arr, 'output.txt')
`arr` will be empty after the assignment from `decode`
"""
header_length = 4
message_length, = struct.unpack('!I', inputBytearray[:header_length])
inputBytearray = inputBytearray[header_length:]
message = inputBytearray[:message_length]
inputBytearray = inputBytearray[message_length:]
write_file(outputFilename, message.decode('ascii'))
return inputBytearray
I really didn't like the function prototype for decode
, it seems well-suited
for TDD and poorly suited for real use. My use of reassignment over the input
byte array feels like a cheap hack but I think performance is slightly better
than doing similarly with a Python string - byte arrays in Python are mutable,
which (might) limit re-copying.
The above solution uses two small functions that I felt improved readability by
extracting them from the body of encode
and decode
:
def read_file(filename):
with open(filename) as message:
return message.read()
def write_file(filename, contents):
with open(filename, 'w') as message:
message.write(contents)
When I submitted my solution, I submitted the following tests:
import filecmp
arr = bytearray()
encode('a.txt', arr)
encode('b.txt', arr)
arr = decode(arr, 'c.txt')
arr = decode(arr, 'd.txt')
assert filecmp.cmp('a.txt', 'c.txt', shallow=False) == True
assert filecmp.cmp('b.txt', 'd.txt', shallow=False) == True
assert len(arr) == 0
filecmp
is a useful little module that is acting as a simple system test2
against a "known good" result (the source file). Two files are read
sequentially into a byte array and then decoded into separate files, where they
are compared to their original. Due to the reassignment to arr
the byte array
should be exhausted after the two decode
operations.
I decided a more interesting test would be to build a client for this protocol after the fact; because really - the whole point of a protocol is standardization, right?
Without bothering to build out the necessary infrastructure for socket programming (because that doesn't sound interesting) I saved an encoded file1 in Python with the following and set about building a decoder client in Forth.
>>> arr = bytearray()
>>> encode('/tmp/hamlet-ascii.txt', arr)
>>> with open('/tmp/encoded-test.myfile', 'wb') as f:
... f.write(arr)
...
A quick read of the file into GForth reveals a slight hiccup:
s" /tmp/encoded-test.myfile" slurp-file ok
drop 100 dump
105B5E000: 00 02 CE C6 2A 2A 2A 54 - 68 65 20 50 72 6F 6A 65 ....***The Proje
105B5E010: 63 74 20 47 75 74 65 6E - 62 65 72 67 27 73 20 45 ct Gutenberg's E
105B5E020: 74 65 78 74 20 6F 66 20 - 53 68 61 6B 65 73 70 65 text of Shakespe
105B5E030: 61 72 65 27 73 20 46 69 - 72 73 74 20 46 6F 6C 69 are's First Foli
105B5E040: 6F 2A 2A 2A 0D 0A 2A 2A - 2A 2A 2A 2A 2A 2A 2A 2A o***..**********
105B5E050: 2A 2A 2A 2A 2A 2A 2A 2A - 2A 2A 2A 54 68 65 20 54 ***********The T
105B5E060: 72 61 67 65 64 69 65 20 - 6F 66 20 48 61 6D 6C 65 ragedie of Hamle
105B5E070: 74 2A 2A 2A 2A 2A 2A 2A - 2A 2A 2A 2A 2A 2A 2A 2A t***************
105B5E080: 2A 2A 2A 2A 2A 2A 0D 0A - 0D 0A 54 68 69 73 20 69 ******....This i
105B5E090: 73 20 6F 75 72 20 33 72 - 64 20 65 64 69 74 69 6F s our 3rd editio
Those first 4 bytes, the ones I packed with Python's struct
module, using
"network order" endian-ness? They're big endian and my computer is
little endian. This had me stumped until I figured out how to do the proper
amount of bit-twiddling to coerce the bytes into their correct endian-ness.
The following Forth word works, but it's pretty ugly, too much stack juggling
makes it hard to follow (>r
and r>
are moving values from the value stack
to the return stack and vice versa for temporary storage). One overlooked
detail which set me back ages was not setting the base explicitly for the shift
operations. I had mistakenly typed 24 and 8 which were then implicitly used as
their hex value equivalent. The solution is to simply force a notation ($
is
hex here, an alternative would be %24
for decimal) rather than leaving it up
to the current number system.
: big->little DUP $FF AND $18 lshift >r
DUP $FF00 AND $8 lshift r>
OR >r
DUP $FF0000 AND $8 rshift >r
$FF000000 AND $18 rshift r> r>
OR OR ;
542A2A2AC6CE0200 big->little .s <1> 2CEC6 ok
This was all a bit of a diversion from the original goal of testing the message protocol encoding which can be demonstrated to "work" in a simple test:
s" /tmp/encoded-test.myfile" SLURP-FILE DROP ok
s" /tmp/decoded-test.txt" w/o CREATE-FILE THROW VALUE out-file
DUP @ big->little SWAP 4 + SWAP out-file WRITE-FILE
out-file CLOSE-FILE THROW
and finally, in the shell:
/tmp $ diff hamlet-ascii.txt decoded.txt
/tmp $
Reveals the output file, run through an encode process in Python and saved to disk, then loaded, decoded and written out from a Forth protocol decoder client is identical to its source file.
I had a devil of a time tracking down a gap in my knowledge of Forth's file-write capabilities and limitations - namely, it is important to close or flush a file after a write operation or else 4096 bytes will remain in a write buffer and never make it to disk.
I actually had a good time with this particular question, it seemed a good test of "can you program?" while allowing some level of self-direction (as opposed to say, fizzbuzz). Circling back to build a client in Forth was an interesting diversion that answered a few questions about working in Forth that I hadn't before thought to ask.