I've got the day off today, what better way to spend it than debugging Emacs?
I recently tried out the mail client Wanderlust for reading NNTP (usenet) and while it works out of the box there was a one issue nagging at me. When retrieving a new (previously unread) article from the server the minibuffer would display a warning message:
Blocking call to accept-process-output with quit inhibited!!
There was no negative effect that I could see, messages were retrieved and nothing was hung up. It was annoying me to see the message every single time though. I wanted to know what it meant and I wanted to fix it.
The first order of business is to find where it comes from; I happen to have a source checkout of Emacs already on this machine (don't you?) and wagered the message was specific enough I'd find it by grepping the source tree. Sure enough there was exactly one result:
./src/process.c:5214: message1 ("Blocking call to accept-process-output with quit inhibited!!");
Somewhat unfortunately the message is coming out of the C code which is necessarily removed from much of the user-facing portion of Emacs. I had hoped it was coming out of some lisp code I might break into more readily but that was always a faint hope with process handling. In order to find out what was triggering this from inside Wanderlust I'd need to get a lisp callstack from a C breakpoint.
Because Emacs is an excellent project this turns out to be
relatively straightforward. Inside the src
directory is
a file .gdbinit
which defines a function for exactly
this purpose called xbacktrace
. The work then is
getting Emacs built locally. To accomplish this I just
invoked sudo apt build-dep emacs; configure --without-libgmp;
make
. With that done I could start debugging. First I need a
breakpoint on the C code mentioned above:
(gdb) break process.c:5214
Breakpoint 3 at 0x1df5d6: file process.c, line 5214.
Then, run Emacs to trigger the behavior. For me this meant launching Wanderlust and reading a new article:
(gdb) run
Starting program: /home/nolan/sources/emacs/src/emacs
Thread 1 "emacs" hit Breakpoint 3, wait_reading_process_output (time_limit=0, nsecs=0, read_kbd=0, do_display=false, wait_for_cell=0x0, wait_proc=0x5555576deb30, just_wait_proc=0) at process.c:5214
5214 message1 ("Blocking call to accept-process-output with quit inhibited!!");
Finally, look at the traceback for the paused process:
(gdb) xbacktrace
"accept-process-output" (0xffffa028)
"elmo-nntp-read-response" (0xffffa258)
"elmo-nntp-message-fetch" (0xffffa470)
0x5768ed38 PVEC_COMPILED
"apply" (0xffffa720)
"luna-call-next-method" (0xffffa8e0)
"luna-apply-generic" (0xffffaab0)
"elmo-message-fetch-plugged" (0xffffac88)
0x575a66b0 PVEC_COMPILED
"apply" (0xffffaf40)
"luna-call-next-method" (0xffffb100)
"luna-apply-generic" (0xffffb2c8)
"elmo-message-fetch-internal" (0xffffb4c8)
0x574108a0 PVEC_COMPILED
"apply" (0xffffb800)
"luna-call-next-method" (0xffffb9b8)
"elmo-message-fetch" (0xffffbb38)
"apply" (0xffffbc50)
"luna-call-next-method" (0xffffbe10)
"luna-apply-generic" (0xffffbfd8)
"elmo-message-fetch" (0xffffc1b8)
"elmo-message-mime-entity-internal" (0xffffc408)
"elmo-message-mime-entity" (0xffffc6c8)
"wl-message-buffer-display" (0xffffc9c8)
"wl-message-buffer-prefetch-subr" (0xffffcc98)
"apply" (0xffffcdb8)
"timer-event-handler" (0xffffcfe8)
I'm a little amazed at how well that worked. There isn't much to the
traceback but I didn't need much; I only wanted to know the calling
code so I could start poking at it. I invoked "continue" inside gdb
to resume Emacs before pulling up the definition
for elmo-nntp-read-response
. Having little idea what I
was doing exactly, I browsed the Emacs bug archive and found
a similar
looking issue in a different mode that suggested the use
of with-local-quit
around the call
to accept-process-output
.
Here is where it feels like Emacs really shines, having jumped to
the definition of the function I edited it in-place and then
evaluated the definition. Re-checking Wanderlust triggered my gdb
breakpoint again but this time the traceback showed a new function
producing the message: elmo-nntp-read-body
. I went
through the same process again, continuing gdb, jumping to the
definition of the new function and patching it in-place.
Finally, re-testing shows I've fixed my problem. No more warnings! The entirety of the change is as follows:
--- elmo-nntp.el 2021-02-26 10:11:46.906783322 -0500
+++ elmo-nntp.el 2021-03-05 13:20:54.728571637 -0500
@@ -355,7 +355,7 @@
(while (null (progn (goto-char match-end)
(search-forward "\r\n" nil t)))
(setq match-end (max (1- (point-max)) elmo-nntp-read-point))
- (accept-process-output process))
+ (with-local-quit (accept-process-output process)))
(setq match-end (point)
last elmo-nntp-read-point
elmo-nntp-read-point match-end)
@@ -401,8 +401,8 @@
(while (null (progn (goto-char (1- point))
(search-forward "\n.\r\n" nil t)))
(setq point (max (- (point-max) 2) elmo-nntp-read-point))
- (accept-process-output
- (elmo-network-session-process-internal session)))
+ (with-local-quit (accept-process-output
+ (elmo-network-session-process-internal session))))
(setq point (point))
(set-buffer outbuf)
(erase-buffer)
The problem it turns out is a result of timer handling within Emacs. The documentation is good but the gist is this:
If a timer function needs to allow quitting, it should usewith-local-quit
(see Quitting). For example, if a timer function callsaccept-process-output
to receive output from an external process, that call should be wrapped insidewith-local-quit
, to ensure that C-g works if the external process hangs.
I was curious how this problem arose and decided to do some
spelunking through history. With process.c
already open
to the line that produces the message it is a simple matter to check
the history with Emacs' built-in version control
(vc-annotate
). What I found was that the interrupt
handling message was added here almost 11 years ago:
commit 2b0a91e78f83fb446cc38efb99399e83ad2cded3
Author: Stefan Monnier <monnier@iro.umontreal.ca>
Date: Mon Apr 12 22:07:48 2010 -0400
Try to solve the problem of spurious EOF chars in long lines of text
sent to interactive subprocesses.
* sysdep.c (child_setup_tty): Do not enable ICANON any more.
(system_process_attributes): Remove unused var `ttotal'.
* process.c (send_process): Don't bother breaking long line with EOF
chars when talking to ttys any more.
(wait_reading_process_output): Output a warning when called in such
a way that it could block without being interruptible.
Digging into Wanderlust for similar history on the invocations
to accept-process-output
that I ended up patching
showed they
were much
older. It looks as though most of the calls were present at the
time of the initial commit of the project (or at least that history
which is tracked in git) — making them over 21 years old. So the
root cause seems to be changing best practices in the course of
decades long history.
I'm having a hard time putting to words how delightful I found this process. As someone who spends too much time at a computer debugging software I've never had such an easy time investigating a problem using quality tools. I mentioned I have the day off, it turns out I now have the rest of the day off — this went so well I'm done before lunch.