I read about some new debugging features recently added to Hare and it wasn't long before I found myself browsing around the mailing list. One recent thread that caught my eye has to do with the performance of a basic Hare program. These are a few notes on diagnosing where all those cycles are spent in a minimal program.
The mailing list thread pertains to the performance of a short
program that reads a file line by line and replaces any occurrences
of a string within the line. The conversation so far suggested the
overhead of allocating strings as part of the search and replace and
the indirection of memory IO buffers. Idly curious I tried a few
things to rule out sources of the problem. I carved up the example
until it had no memio
, no string replacement and no
file writing. I've basically pared it down to a simplistic
implementation of the Unix wc
command. Still it seemed
suspiciously slow, not exactly 80/20 but I was certainly starting to
feel puzzled where the time was going. While the original thread
mentions Dart being 3 times faster I didn't immediately see a good
way to install Dart so I wrote a similar program in Python, figuring
it is slow enough to be a good benchmark. What I found is that,
especially for large files, I was seeing something like a 10x
slowdown with Hare.
use fmt;
use io;
use bufio;
use os;
use types;
export fn main() void = {
const file = os::open(os::args[1])!;
const scanner = bufio::newscanner(file, types::SIZE_MAX);
defer bufio::finish(&scanner);
let i = 0;
for(true) {
let line = match(bufio::scan_line(&scanner)) {
case const str =>
i = i + 1;
yield void;
case io::EOF => break;
};
};
fmt::println(i)!;
};
perf
First, what does perf
say? I am anything but a
performance expert but I do have a copy of System
Performance handy. It makes a compelling case
for perf
so I gave it a spin with the following:
$ perf record ./my-hare-wc test.txt
Here is the result of perf report
:
Samples: 296 of event 'cycles:Pu', Event count (approx.): 304119332 Overhead Command Shared Object Symbol 92.17% my-hare-wc my-hare-wc [.] rt.memmove 4.49% my-hare-wc my-hare-wc [.] encoding.utf8.validate 2.06% my-hare-wc my-hare-wc [.] bytes.index_byte 0.70% my-hare-wc my-hare-wc [.] strings.fromutf8 0.35% my-hare-wc my-hare-wc [.] bufio.scan_string 0.21% my-hare-wc my-hare-wc [.] bufio.scan_line 0.02% my-hare-wc [unknown] [k] 0xffffffff99fe0d41 0.00% my-hare-wc my-hare-wc [.] rt.memset 0.00% my-hare-wc [unknown] [k] 0xffffffff9a0012a0 0.00% my-hare-wc my-hare-wc [.] initfunc.0 0.00% my-hare-wc [unknown] [k] 0xffffffff9909c8e7 0.00% my-hare-wc [unknown] [k] 0xffffffff993adbef
Obviously that 92% number on rt.memmove
sounds
interesting. It is possible to dig deeper into the disassembly to
see how the time breaks down. This isn't too interesting and only
confirms that memmove
does in fact spend nearly all its
time... moving memory.
Samples: 296 of event 'cycles:Pu', 4000 Hz, Event count (approx.): 304119332, DSO: my-hare-wc rt.memmove /tmp/hare-perf/my-hare-wc [Percent: local period] Percent│ sub $0x1,%rdx │ movzbl (%rsi),%eax │ mov %al,(%rdi) │ add $0x1,%rdi │ add $0x1,%rsi │ mov %rsi,%rax │ mov %rsi,%rcx │ ↑ jmp 137 │16f: ↓ jmp 1bd │171: mov %rax,%rsi │ mov %rdi,%rax │177: cmp $0x8,%rdx │ ↓ jb 197 │ mov (%rsi),%rax 4.18 │ mov %rax,(%rdi) 2.28 │ add $0x8,%rdi │ add $0x8,%rsi │ sub $0x8,%rdx │ mov %rdi,%rax │ mov %rsi,%rcx │ ↑ jmp 177 │197: mov %rax,%rdi │ mov %rcx,%rsi 0.38 │19d: cmp $0x0,%rdx │ ↓ jbe 1bd │ movzbl (%rsi),%eax 92.39 │ mov %al,(%rdi) │ add $0x1,%rdi │ add $0x1,%rsi │ sub $0x1,%rdx │ ↑ jmp 19d │1b6: → call rt.memcpy │ ↓ jmp 1bd │1bd: leave 0.38 │ ← ret
I must admit I'm a little overwhelmed at the number of options
available to perf
and I'm not sure how to get more
granular information such as stacktraces out of my Hare
program. Because I'm more interested in learning about Hare than any
specific tooling I decided to throw an old favorite at the problem
to see if there are any more obvious clues — that means Valgrind.
$ valgrind --tool=callgrind ./my-hare-wc test.txt
$ gprof2dot --format=callgrind --output=out.dot callgrind.out.118125
$ dot -Tsvg out.dot -o graph.svg
I really do love how easy this is. I had already browsed through the
Hare sources and this neatly confirms some of my suspicions. The
callgraph above matches the flow through
the bufio::scanner
module, with my sample
program's scan_line
being an indirection
over scan_string
and scan_bytes
. It looks
like there is little overhead of interest in those calls with
scan_bytes
being dominated by scan_shift
which itself is consumed with the call
to rt.memmove
. That certainly feels like it
could point to a cause.
Here is the entirety of scan_shift
:
// Shifts the buffer towards the start, discarding bytes which were read out.
fn scan_shift(scan: *scanner) void = {
const n = scan.readout;
if (n == 0) {
return;
};
scan.buffer[..len(scan.buffer) - n] = scan.buffer[n..];
scan.readout = 0;
scan.pending -= n;
};
Probably useful to note, "rt" refers to the Hare runtime and (I
think) depends on the platform in some cases. I went looking for
places that a call to memmove
comes from in the above
code and found a likely looking call in harec (the hare compiler)
in harec/src/gen.c
in a
function gen_expr_assign_slice
.
My impression of what is happening is that scan.buffer
is a buffer of length 4096 based on the definition
of newscanner
. The buffered reads of the file
incrementally fill this buffer with the read position tracked by
shifting those parts that have been read off the front. The amount
of memory moves depends on the size of data "read out" and the size
of the buffer. I think this could explain the comparative slowness
if it is the case that scan_shift
is actually inducing
some kind of multiplicative factor on the memory IO to continually
move around data within this buffer. In an effort to test this idea
out I figure it should be possible to either increase the read out
size or decrease the size of the buffer - either one should result
in fewer calls to move and (if true) make the program execute
faster.
This turns out to be pretty easy to test due to an alternative
to newscanner
that is available
already: newscanner_static
. It requires providing your
own buffer instead of creating one behind the scenes, my test then
looks like this:
use fmt;
use io;
use bufio;
use os;
export fn main() void = {
const buf: []u8 = alloc([0...], 4096);
defer free(buf);
const file = os::open(os::args[1])!;
const scanner = bufio::newscanner_static(file, buf);
let i = 0;
for(true) {
let line = match(bufio::scan_line(&scanner)) {
case const str =>
i = i + 1;
yield void;
case io::EOF => break;
};
};
fmt::println(i)!;
};
Functionally the above should be the same because the default buffer
is 4096. Running it results in no change in runtime or output. The
real test though comes when I reduce the buffer size from 4096 to
1024. If my hunch isn't totally wrong this would result in the calls
made in scan_shift
to move less memory. As an example,
if the default is 4096 and 10 bytes are read then I expect 4086
bytes are memmove
d. If instead the buffer is 1024, 10
bytes are read then it seems as though 1014 bytes should
be memmove
d. The proof is in the pudding, here's the
execution time of both:
$ time ./4096-buf test.txt
42417
0.079 secs
$ time ./1024-buf test.txt
42417
0.033 secs
Once again, I'm only pretty sure I haven't misunderstood what is happening here. Of course, reducing the buffer size isn't really a great solution; make it too small and you increase the chances of program inputs exceeding the buffer size and crashing the program. Here is the second half of my suggested test, I've created two new test files with one containing 8 bytes per-line and the other containing 1024 bytes per-line, each of them is 1.0M:
$ time ./4096-buf 1024-per-line.txt
1024
0.015 secs
$ time ./4096-buf 8-per-line.txt
131072
0.050 secs
$ time ./1024-buf 1024-per-line.txt
1024
0.013 secs
$ time ./1024-buf 8-per-line.txt
131072
0.026 secs
At the risk of believing my own arguments too much I think this further confirms the idea. Longer lines improve performance because less moves are made within the buffer internally. Shorter lines more negatively impact the larger buffer due to the greater volume of data moved.
While I'm happy enough to have identified some of the why in the program's performance I'm not exactly sure how to proceed yet. At the moment the last message in the original mailing list thread mentions:
... I suspect poor codegen for aggregate returns as being at the very least the easiest next step for improving perf
And I haven't any idea what that means!
For this particular case I think it might be possible to
reduce the calls to memmove
if the buffer's position
was tracked explicitly rather than implicitly with the read contents
shifted out entirely. I'm imagining some kind of offset attribute
which is used for indexing into the buffer for reads without ever
actually moving memory (except in those cases the buffer is filled
etc.). The entirety of the standard
library's bufio/scanner.ha
seems pretty slim and
approachable but I am not exactly an expert either. It seems a
little like the readout attribute on the scanner may fulfill a
similar purpose so I'm not sure that I'm not missing something
else. Probably I should ask around on IRC.