indexpost archiveatom feed syndication feed icon

Profiling Hare Performance

2024-02-04

I read about some new debugging features recently added to Hare and it wasn't long before I found myself browsing around the mailing list. One recent thread that caught my eye has to do with the performance of a basic Hare program. These are a few notes on diagnosing where all those cycles are spent in a minimal program.

The mailing list thread pertains to the performance of a short program that reads a file line by line and replaces any occurrences of a string within the line. The conversation so far suggested the overhead of allocating strings as part of the search and replace and the indirection of memory IO buffers. Idly curious I tried a few things to rule out sources of the problem. I carved up the example until it had no memio, no string replacement and no file writing. I've basically pared it down to a simplistic implementation of the Unix wc command. Still it seemed suspiciously slow, not exactly 80/20 but I was certainly starting to feel puzzled where the time was going. While the original thread mentions Dart being 3 times faster I didn't immediately see a good way to install Dart so I wrote a similar program in Python, figuring it is slow enough to be a good benchmark. What I found is that, especially for large files, I was seeing something like a 10x slowdown with Hare.

the program

use fmt;
use io;
use bufio;
use os;
use types;

export fn main() void = {

   const file = os::open(os::args[1])!;
   const scanner = bufio::newscanner(file, types::SIZE_MAX);
   defer bufio::finish(&scanner);

   let i = 0;

   for(true) {
      let line = match(bufio::scan_line(&scanner)) {
         case const str =>
                         i = i + 1;
                         yield void;
         case io::EOF => break;
      };
   };

   fmt::println(i)!;
};

perf

First, what does perf say? I am anything but a performance expert but I do have a copy of System Performance handy. It makes a compelling case for perf so I gave it a spin with the following:

$ perf record ./my-hare-wc test.txt

Here is the result of perf report:

Samples: 296  of event 'cycles:Pu', Event count (approx.): 304119332
Overhead  Command         Shared Object     Symbol
  92.17%  my-hare-wc  my-hare-wc    [.] rt.memmove
   4.49%  my-hare-wc  my-hare-wc    [.] encoding.utf8.validate
   2.06%  my-hare-wc  my-hare-wc    [.] bytes.index_byte
   0.70%  my-hare-wc  my-hare-wc    [.] strings.fromutf8
   0.35%  my-hare-wc  my-hare-wc    [.] bufio.scan_string
   0.21%  my-hare-wc  my-hare-wc    [.] bufio.scan_line
   0.02%  my-hare-wc  [unknown]     [k] 0xffffffff99fe0d41
   0.00%  my-hare-wc  my-hare-wc    [.] rt.memset
   0.00%  my-hare-wc  [unknown]     [k] 0xffffffff9a0012a0
   0.00%  my-hare-wc  my-hare-wc    [.] initfunc.0
   0.00%  my-hare-wc  [unknown]     [k] 0xffffffff9909c8e7
   0.00%  my-hare-wc  [unknown]     [k] 0xffffffff993adbef

Obviously that 92% number on rt.memmove sounds interesting. It is possible to dig deeper into the disassembly to see how the time breaks down. This isn't too interesting and only confirms that memmove does in fact spend nearly all its time... moving memory.

Samples: 296  of event 'cycles:Pu', 4000 Hz, Event count (approx.): 304119332, DSO: my-hare-wc
rt.memmove  /tmp/hare-perf/my-hare-wc [Percent: local period]
Percent│       sub    $0x1,%rdx
       │       movzbl (%rsi),%eax
       │       mov    %al,(%rdi)
       │       add    $0x1,%rdi
       │       add    $0x1,%rsi
       │       mov    %rsi,%rax
       │       mov    %rsi,%rcx
       │     ↑ jmp    137
       │16f: ↓ jmp    1bd
       │171:   mov    %rax,%rsi
       │       mov    %rdi,%rax
       │177:   cmp    $0x8,%rdx
       │     ↓ jb     197
       │       mov    (%rsi),%rax
  4.18        mov    %rax,(%rdi)
  2.28        add    $0x8,%rdi       add    $0x8,%rsi
       │       sub    $0x8,%rdx
       │       mov    %rdi,%rax
       │       mov    %rsi,%rcx
       │     ↑ jmp    177
       │197:   mov    %rax,%rdi
       │       mov    %rcx,%rsi
  0.38 │19d:   cmp    $0x0,%rdx
       │     ↓ jbe    1bd
       │       movzbl (%rsi),%eax
 92.39        mov    %al,(%rdi)       add    $0x1,%rdi
       │       add    $0x1,%rsi
       │       sub    $0x1,%rdx
       │     ↑ jmp    19d
       │1b6: → call   rt.memcpy
       │     ↓ jmp    1bd
       │1bd:   leave
  0.38 │     ← ret

Valgrind

I must admit I'm a little overwhelmed at the number of options available to perf and I'm not sure how to get more granular information such as stacktraces out of my Hare program. Because I'm more interested in learning about Hare than any specific tooling I decided to throw an old favorite at the problem to see if there are any more obvious clues — that means Valgrind.

$ valgrind --tool=callgrind ./my-hare-wc test.txt
$ gprof2dot --format=callgrind --output=out.dot callgrind.out.118125
$ dot -Tsvg out.dot -o graph.svg
0x0000000008000000 my-hare-wc 0x0000000008000000 100.00% (0.00%) rt.start_linux my-hare-wc rt.start_linux 100.00% (0.00%) 0x0000000008000000->rt.start_linux 100.00% rt.start_ha my-hare-wc rt.start_ha 100.00% (0.00%) rt.start_linux->rt.start_ha 100.00% bufio.scan_bytes my-hare-wc bufio.scan_bytes 95.57% (0.40%) 42418× bufio.scan_shift my-hare-wc bufio.scan_shift 92.61% (0.14%) 42418× bufio.scan_bytes->bufio.scan_shift 92.61% 42418× bytes.index my-hare-wc bytes.index 2.44% (0.15%) 42788× bufio.scan_bytes->bytes.index 2.44% 42788× rt.memmove my-hare-wc rt.memmove 92.47% (92.47%) 42419× bufio.scan_shift->rt.memmove 92.47% 42417× bytes.index_slice my-hare-wc bytes.index_slice 2.30% (0.12%) 42788× bytes.index->bytes.index_slice 2.30% 42788× bufio.scan_line my-hare-wc bufio.scan_line 99.97% (0.14%) 42418× bufio.scan_string my-hare-wc bufio.scan_string 99.83% (0.42%) 42418× bufio.scan_line->bufio.scan_string 99.83% 42418× bufio.scan_string->bufio.scan_bytes 95.57% 42418× strings.fromutf8 my-hare-wc strings.fromutf8 3.79% (0.19%) 42417× bufio.scan_string->strings.fromutf8 3.79% 42417× encoding.utf8.validate my-hare-wc encoding.utf8.validate 3.51% (3.51%) 42419× strings.fromutf8->encoding.utf8.validate 3.51% 42417× bytes.index_byte my-hare-wc bytes.index_byte 2.18% (2.18%) 42788× bytes.index_slice->bytes.index_byte 2.18% 42788× main my-hare-wc main 100.00% (0.03%) main->bufio.scan_line 99.97% 42418× rt.start_ha->main 100.00%

I really do love how easy this is. I had already browsed through the Hare sources and this neatly confirms some of my suspicions. The callgraph above matches the flow through the bufio::scanner module, with my sample program's scan_line being an indirection over scan_string and scan_bytes. It looks like there is little overhead of interest in those calls with scan_bytes being dominated by scan_shift which itself is consumed with the call to rt.memmove. That certainly feels like it could point to a cause.

Here is the entirety of scan_shift:

// Shifts the buffer towards the start, discarding bytes which were read out.
fn scan_shift(scan: *scanner) void = {
	const n = scan.readout;
	if (n == 0) {
		return;
	};
	scan.buffer[..len(scan.buffer) - n] = scan.buffer[n..];
	scan.readout = 0;
	scan.pending -= n;
};

Probably useful to note, "rt" refers to the Hare runtime and (I think) depends on the platform in some cases. I went looking for places that a call to memmove comes from in the above code and found a likely looking call in harec (the hare compiler) in harec/src/gen.c in a function gen_expr_assign_slice.

My impression of what is happening is that scan.buffer is a buffer of length 4096 based on the definition of newscanner. The buffered reads of the file incrementally fill this buffer with the read position tracked by shifting those parts that have been read off the front. The amount of memory moves depends on the size of data "read out" and the size of the buffer. I think this could explain the comparative slowness if it is the case that scan_shift is actually inducing some kind of multiplicative factor on the memory IO to continually move around data within this buffer. In an effort to test this idea out I figure it should be possible to either increase the read out size or decrease the size of the buffer - either one should result in fewer calls to move and (if true) make the program execute faster.

This turns out to be pretty easy to test due to an alternative to newscanner that is available already: newscanner_static. It requires providing your own buffer instead of creating one behind the scenes, my test then looks like this:

use fmt;
use io;
use bufio;
use os;

export fn main() void = {

   const buf: []u8 = alloc([0...], 4096);
   defer free(buf);
   const file = os::open(os::args[1])!;
   const scanner = bufio::newscanner_static(file, buf);

   let i = 0;

   for(true) {
      let line = match(bufio::scan_line(&scanner)) {
         case const str =>
                         i = i + 1;
                         yield void;
         case io::EOF => break;
      };
   };

   fmt::println(i)!;
};

Functionally the above should be the same because the default buffer is 4096. Running it results in no change in runtime or output. The real test though comes when I reduce the buffer size from 4096 to 1024. If my hunch isn't totally wrong this would result in the calls made in scan_shift to move less memory. As an example, if the default is 4096 and 10 bytes are read then I expect 4086 bytes are memmoved. If instead the buffer is 1024, 10 bytes are read then it seems as though 1014 bytes should be memmoved. The proof is in the pudding, here's the execution time of both:

$ time ./4096-buf test.txt 
42417
0.079 secs

$ time ./1024-buf test.txt 
42417
0.033 secs

Once again, I'm only pretty sure I haven't misunderstood what is happening here. Of course, reducing the buffer size isn't really a great solution; make it too small and you increase the chances of program inputs exceeding the buffer size and crashing the program. Here is the second half of my suggested test, I've created two new test files with one containing 8 bytes per-line and the other containing 1024 bytes per-line, each of them is 1.0M:

$ time ./4096-buf 1024-per-line.txt 
1024
0.015 secs

$ time ./4096-buf 8-per-line.txt 
131072
0.050 secs

$ time ./1024-buf 1024-per-line.txt 
1024
0.013 secs

$ time ./1024-buf 8-per-line.txt 
131072
0.026 secs

At the risk of believing my own arguments too much I think this further confirms the idea. Longer lines improve performance because less moves are made within the buffer internally. Shorter lines more negatively impact the larger buffer due to the greater volume of data moved.

What Now?

While I'm happy enough to have identified some of the why in the program's performance I'm not exactly sure how to proceed yet. At the moment the last message in the original mailing list thread mentions:

... I suspect poor codegen for aggregate returns as being at the very least the easiest next step for improving perf

And I haven't any idea what that means!

For this particular case I think it might be possible to reduce the calls to memmove if the buffer's position was tracked explicitly rather than implicitly with the read contents shifted out entirely. I'm imagining some kind of offset attribute which is used for indexing into the buffer for reads without ever actually moving memory (except in those cases the buffer is filled etc.). The entirety of the standard library's bufio/scanner.ha seems pretty slim and approachable but I am not exactly an expert either. It seems a little like the readout attribute on the scanner may fulfill a similar purpose so I'm not sure that I'm not missing something else. Probably I should ask around on IRC.