[nolan@nprescott.com] $  cat weblog archive feed

Debugging Emacs

2021-03-05

I've got the day off today, what better way to spend it than debugging Emacs?

I recently tried out the mail client Wanderlust for reading NNTP (usenet) and while it works out of the box there was a one issue nagging at me. When retrieving a new (previously unread) article from the server the minibuffer would display a warning message:

Blocking call to accept-process-output with quit inhibited!!

There was no negative effect that I could see, messages were retrieved and nothing was hung up. It was annoying me to see the message every single time though. I wanted to know what it meant and I wanted to fix it.

The first order of business is to find where it comes from; I happen to have a source checkout of Emacs already on this machine (don't you?) and wagered the message was specific enough I'd find it by grepping the source tree. Sure enough there was exactly one result:

./src/process.c:5214:    message1 ("Blocking call to accept-process-output with quit inhibited!!");

Somewhat unfortunately the message is coming out of the C code which is necessarily removed from much of the user-facing portion of Emacs. I had hoped it was coming out of some lisp code I might break into more readily but that was always a faint hope with process handling. In order to find out what was triggering this from inside Wanderlust I'd need to get a lisp callstack from a C breakpoint.

Because Emacs is an excellent project this turns out to be relatively straightforward. Inside the src directory is a file .gdbinit which defines a function for exactly this purpose called xbacktrace. The work then is getting Emacs built locally. To accomplish this I just invoked sudo apt build-dep emacs; configure --without-libgmp; make. With that done I could start debugging. First I need a breakpoint on the C code mentioned above:

(gdb) break process.c:5214
Breakpoint 3 at 0x1df5d6: file process.c, line 5214.

Then, run Emacs to trigger the behavior. For me this meant launching Wanderlust and reading a new article:

(gdb) run
Starting program: /home/nolan/sources/emacs/src/emacs 

Thread 1 "emacs" hit Breakpoint 3, wait_reading_process_output (time_limit=0, nsecs=0, read_kbd=0, do_display=false, wait_for_cell=0x0, wait_proc=0x5555576deb30, just_wait_proc=0) at process.c:5214
5214        message1 ("Blocking call to accept-process-output with quit inhibited!!");

Finally, look at the traceback for the paused process:

(gdb) xbacktrace
"accept-process-output" (0xffffa028)
"elmo-nntp-read-response" (0xffffa258)
"elmo-nntp-message-fetch" (0xffffa470)
0x5768ed38 PVEC_COMPILED
"apply" (0xffffa720)
"luna-call-next-method" (0xffffa8e0)
"luna-apply-generic" (0xffffaab0)
"elmo-message-fetch-plugged" (0xffffac88)
0x575a66b0 PVEC_COMPILED
"apply" (0xffffaf40)
"luna-call-next-method" (0xffffb100)
"luna-apply-generic" (0xffffb2c8)
"elmo-message-fetch-internal" (0xffffb4c8)
0x574108a0 PVEC_COMPILED
"apply" (0xffffb800)
"luna-call-next-method" (0xffffb9b8)
"elmo-message-fetch" (0xffffbb38)
"apply" (0xffffbc50)
"luna-call-next-method" (0xffffbe10)
"luna-apply-generic" (0xffffbfd8)
"elmo-message-fetch" (0xffffc1b8)
"elmo-message-mime-entity-internal" (0xffffc408)
"elmo-message-mime-entity" (0xffffc6c8)
"wl-message-buffer-display" (0xffffc9c8)
"wl-message-buffer-prefetch-subr" (0xffffcc98)
"apply" (0xffffcdb8)
"timer-event-handler" (0xffffcfe8)

I'm a little amazed at how well that worked. There isn't much to the traceback but I didn't need much; I only wanted to know the calling code so I could start poking at it. I invoked "continue" inside gdb to resume Emacs before pulling up the definition for elmo-nntp-read-response. Having little idea what I was doing exactly, I browsed the Emacs bug archive and found a similar looking issue in a different mode that suggested the use of with-local-quit around the call to accept-process-output.

Here is where it feels like Emacs really shines, having jumped to the definition of the function I edited it in-place and then evaluated the definition. Re-checking Wanderlust triggered my gdb breakpoint again but this time the traceback showed a new function producing the message: elmo-nntp-read-body. I went through the same process again, continuing gdb, jumping to the definition of the new function and patching it in-place.

Finally, re-testing shows I've fixed my problem. No more warnings! The entirety of the change is as follows:

--- elmo-nntp.el	2021-02-26 10:11:46.906783322 -0500
+++ elmo-nntp.el	2021-03-05 13:20:54.728571637 -0500
@@ -355,7 +355,7 @@
 	(while (null (progn (goto-char match-end)
 			    (search-forward "\r\n" nil t)))
 	  (setq match-end (max (1- (point-max)) elmo-nntp-read-point))
-	  (accept-process-output process))
+	  (with-local-quit (accept-process-output process)))
 	(setq match-end (point)
 	      last elmo-nntp-read-point
 	      elmo-nntp-read-point match-end)
@@ -401,8 +401,8 @@
       (while (null (progn (goto-char (1- point))
 			  (search-forward "\n.\r\n" nil t)))
 	(setq point (max (- (point-max) 2) elmo-nntp-read-point))
-	(accept-process-output
-	 (elmo-network-session-process-internal session)))
+	(with-local-quit (accept-process-output
+	 (elmo-network-session-process-internal session))))
       (setq point (point))
       (set-buffer outbuf)
       (erase-buffer)

What Did I Change?

The problem it turns out is a result of timer handling within Emacs. The documentation is good but the gist is this:

If a timer function needs to allow quitting, it should use with-local-quit (see Quitting). For example, if a timer function calls accept-process-output to receive output from an external process, that call should be wrapped inside with-local-quit, to ensure that C-g works if the external process hangs.

A Little History

I was curious how this problem arose and decided to do some spelunking through history. With process.c already open to the line that produces the message it is a simple matter to check the history with Emacs' built-in version control (vc-annotate). What I found was that the interrupt handling message was added here almost 11 years ago:

commit 2b0a91e78f83fb446cc38efb99399e83ad2cded3
Author: Stefan Monnier <monnier@iro.umontreal.ca>
Date:   Mon Apr 12 22:07:48 2010 -0400

    Try to solve the problem of spurious EOF chars in long lines of text
    sent to interactive subprocesses.
    * sysdep.c (child_setup_tty): Do not enable ICANON any more.
    (system_process_attributes): Remove unused var `ttotal'.
    * process.c (send_process): Don't bother breaking long line with EOF
    chars when talking to ttys any more.
    (wait_reading_process_output): Output a warning when called in such
    a way that it could block without being interruptible.

Digging into Wanderlust for similar history on the invocations to accept-process-output that I ended up patching showed they were much older. It looks as though most of the calls were present at the time of the initial commit of the project (or at least that history which is tracked in git) — making them over 21 years old. So the root cause seems to be changing best practices in the course of decades long history.

Thoughts

I'm having a hard time putting to words how delightful I found this process. As someone who spends too much time at a computer debugging software I've never had such an easy time investigating a problem using quality tools. I mentioned I have the day off, it turns out I now have the rest of the day off — this went so well I'm done before lunch.