Debugging a Weird Python Slowdown

I noticed some latency in a unit test on a single machine. It turned into a bit of a rabbit hole that took some time to get to the bottom of. I think I at least understand some libraries and tools better as a result though!

I noticed a weird thing with a unit test recently. It was obvious because the usual feel of the tests is this:

My real test actually does some amount of work but I kept paring it back further and further to try to narrow down where the problem lay. The most minimal case I've come up with looks like this:

The application code was similarly pared back until it was a nearly empty websocket endpoint that did no useful work:

The tests are executed on the same version of Python (3.12.2), on the same operating system (Fedora 39), on the same architecture (x86_64). The problem machine has 8 cores and is using 6.7.10-200.fc39.x86_64 and the non-problem machine has 16 cores and is using kernel 6.6.8-200.fc39.x86_64. In an effort to narrow down what was happening during all those slow runs I ran the program under strace to capture system calls being made. strace is notorious for slowing down programs, even the man page says so:

There isn't always such a discrepancy under strace though and I was able to capture several runs that are representatively slow. Interestingly even in the fast case the biggest cause of slowness remains the same. Here is a full strace output for a slow run:

During the test run the CPU usage for the program hits over 100% in top; Starting to suspect some scheduling problem or partial deadlock of some sort I tried pinning the process to use a single CPU with taskset:

Thinking back to that time I was able to debug emacs by setting a breakpoint in some C code to gather a traceback into elisp code I went rummaging around the standard library to see if I might find something similar. What I have turned up so far is the faulthandler module, specifically the dump_traceback_later method. It allows for a traceback to be dumped for all threads after a configurable timeout, my test now looks like this:

What appears to be happening is user-error on my part. The FastAPI testclient is really a thin wrapper over the starlette library's testclient. That uses a context manager to launch a thread, running an event loop, which in this test is calling select to service events. I think the reason why I'm seeing it only on one computer has to do with the difference in processor speed between the different machines. The fact that the test client spawns a thread to emulate a server isn't too surprising (I was immediately reminded of a very similar construct I wrote about for an HTTP server). What has caught me off guard has been the interaction between the two asyncio event loops in this case. I'm at least pretty sure the event loops are to blame, I've been wending my way through all the layers of indirection with fastapi (which is really starlette) calling anyio calling asyncio and am a little dizzy.

In an effort to better confirm some of my ideas I figured I should keep digging for some better tools. While the output of the faulthandler isn't useless it lacks plenty of detail to wholly explain what is going on when the problem occurs. I dug around some more to find information on integrating gdb with my python program. A bit of a missing piece though was how I was supposed to capture the moment when the problem was occuring without an error happening. Rather than spend too long trying to think of something smart I added a third test client to the test so that test runs take 30+ seconds to complete. While the test is running and I was pretty confident nothing terribly useful was happening I sent a SIGQUIT using ctrl-\ which causes a core dump.

With a core dump available it isn't too hard to then just pass it along to gdb like: coredumpctl debug

What to say about gdb? It is hugely featureful and oftentimes cryptic to a layman such as myself. I mostly limited myself to:

┌─/usr/src/debug/python3.12-3.12.2-2.fc39.x86_64/Python/condvar.h────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│       59 #define PyMUTEX_UNLOCK(mut)     pthread_mutex_unlock(mut)                                                                                                                                                                         │
│       60                                                                                                                                                                                                                                   │
│       61 #define PyCOND_INIT(cond)       _PyThread_cond_init(cond)                                                                                                                                                                         │
│       62 #define PyCOND_FINI(cond)       pthread_cond_destroy(cond)                                                                                                                                                                        │
│       63 #define PyCOND_SIGNAL(cond)     pthread_cond_signal(cond)                                                                                                                                                                         │
│       64 #define PyCOND_BROADCAST(cond)  pthread_cond_broadcast(cond)                                                                                                                                                                      │
│       65 #define PyCOND_WAIT(cond, mut)  pthread_cond_wait((cond), (mut))                                                                                                                                                                  │
│       66                                                                                                                                                                                                                                   │
│       67 /* return 0 for success, 1 on timeout, -1 on error */                                                                                                                                                                             │
│       68 Py_LOCAL_INLINE(int)                                                                                                                                                                                                              │
│       69 PyCOND_TIMEDWAIT(PyCOND_T *cond, PyMUTEX_T *mut, long long us)                                                                                                                                                                    │
│       70 {                                                                                                                                                                                                                                 │
│       71     struct timespec abs_timeout;                                                                                                                                                                                                  │
│       72     _PyThread_cond_after(us, &abs_timeout);                                                                                                                                                                                       │
│  >    73     int ret = pthread_cond_timedwait(cond, mut, &abs_timeout);                                                                                                                                                                    │
│       74     if (ret == ETIMEDOUT) {                                                                                                                                                                                                       │
│       75         return 1;                                                                                                                                                                                                                 │
│       76     }                                                                                                                                                                                                                             │
│       77     if (ret) {                                                                                                                                                                                                                    │
│       78         return -1;                                                                                                                                                                                                                │
│       79     }                                                                                                                                                                                                                             │
│       80     return 0;                                                                                                                                                                                                                     │
│       81 }                                                                                                                                                                                                                                 │
│       82                                                                                                                                                                                                                                   │
│       83 #elif defined(NT_THREADS)                                                                                                                                                                                                         │
│       84 /*                                                                                                                                                                                                                                │
│       85  * Windows (XP, 2003 server and later, as well as (hopefully) CE) support                                                                                                                                                         │
│       86  *                                                                                                                                                                                                                                │
│       87  * Emulated condition variables ones that work with XP and later, plus                                                                                                                                                            │
│       88  * example native support on VISTA and onwards.                                                                                                                                                                                   │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
multi-thre Thread 0x7f74894b37 In: take_gil                                                                                                                                                                          L73   PC: 0x7f7488fe44c4 
#6 Waiting for the GIL
#14 Frame 0x7f74894c1da0, for file /usr/lib64/python3.12/asyncio/selector_events.py, line 152, in _write_to_self (self=<_UnixSelectorEventLoop(_timer_cancelled_count=0, _closed=False, _stopping=False, _ready=<collections.deque at remote 0
x7f7479053650>, _scheduled=[], _default_executor=None, _internal_fds=1, _thread_id=140138136995520, _clock_resolution=<float at remote 0x7f7479563460>, _exception_handler=None, _debug=False, slow_callback_duration=<float at remote 0x7f748
8ab5690>, _current_handle=None, _task_factory=None, _coroutine_origin_tracking_enabled=False, _coroutine_origin_tracking_saved_depth=None, _asyncgens=<WeakSet(data=set(), _remove=<function at remote 0x7f7478eb5850>, _pending_removals=[],
_iterating=set()) at remote 0x7f7478ea1460>, _asyncgens_shutdown_called=False, _executor_shutdown_called=False, _selector=<EpollSelector(_fd_to_key={7: <SelectorKey at remote 0x7f7478e99a90>}, _map=<_SelectorMapping(_selector=<...>) at re
mote 0x7f7478ea0dd0>, _selector=<select.epoll at remote 0x7f7479563280>) at ...(truncated)
    csock.send(b'\0')
#14 Frame 0x7f74894c1d08, for file /usr/lib64/python3.12/asyncio/base_events.py, line 844, in call_soon_threadsafe (self=<_UnixSelectorEventLoop(_timer_cancelled_count=0, _closed=False, _stopping=False, _ready=<collections.deque at remote
 0x7f7479053650>, _scheduled=[], _default_executor=None, _internal_fds=1, _thread_id=140138136995520, _clock_resolution=<float at remote 0x7f7479563460>, _exception_handler=None, _debug=False, slow_callback_duration=<float at remote 0x7f7
488ab5690>, _current_handle=None, _task_factory=None, _coroutine_origin_tracking_enabled=False, _coroutine_origin_tracking_saved_depth=None, _asyncgens=<WeakSet(data=set(), _remove=<function at remote 0x7f7478eb5850>, _pending_removals=[]
, _iterating=set()) at remote 0x7f7478ea1460>, _asyncgens_shutdown_called=False, _executor_shutdown_called=False, _selector=<EpollSelector(_fd_to_key={7: <SelectorKey at remote 0x7f7478e99a90>}, _map=<_SelectorMapping(_selector=<...>) at
remote 0x7f7478ea0dd0>, _selector=<select.epoll at remote 0x7f7479563280>) a...(truncated)
    self._write_to_self()
#14 Frame 0x7f74894c1c60, for file /home/nprescott/project/new-venv/lib64/python3.12/site-packages/anyio/_backends/_asyncio.py, line 2205, in run_sync_from_thread (cls=<ABCMeta(__module__='anyio._backends._asyncio', run=<classmethod at re
mote 0x7f7478ea0a50>, current_token=<classmethod at remote 0x7f7478ea0910>, current_time=<classmethod at remote 0x7f7478ea08c0>, cancelled_exception_class=<classmethod at remote 0x7f7478ea06e0>, checkpoint=<classmethod at remote 0x7f7478e
a1a40>, checkpoint_if_cancelled=<classmethod at remote 0x7f7478ea1a90>, cancel_shielded_checkpoint=<classmethod at remote 0x7f7478ea1ae0>, sleep=<classmethod at remote 0x7f7478ea1b30>, create_cancel_scope=<classmethod at remote 0x7f7478ea
--Type <RET> for more, q to quit, c to continue without paging--

I don't know that I've entirely satisfied my curiousity with this though. I can see that the different threads are invoking things like call_soon_threadsafe but I don't understand if they are not truly dead-locked why it takes so long to progress. I have a hunch the different multitasking patterns in use are fighting with each other behind the scenes. While asyncio is supposed to achieve cooperative multitasking, threads are independently managed by the scheduler. Digging around through all the code in play I came up with a bit of stream of consciousness:

Suddenly things feel like they're clicking into place for me. A call like asyncio.sleep(0) is a very typical way to defer execution back to the event loop so as to service any other pending coroutines — you don't have to actually sleep you just have to signal that you have finished some logical piece of work (cooperating with other tasks explicitly). I suspect though that the limited number of other tasks on this threaded event loop in the thread is leading to a funky too-fast cycle through the event loop, maybe causing it to spin not-quite-endlessly on the check for self._receive_queue.empty() which is taking a lock. I'm not exactly an expert in asyncio but I do happen to know that asyncio.sleep(0) is special-coded distinctly from an actual sleep call. Based on some of what I saw in gdb I think one of these operations is inappropriately holding onto the GIL for too long as a result of over-eagerly looping for events.

I can't think of an elegant way of quickly verifying this idea so we'll have to settle for a hack. If it is the case that yielding to the event loop is happening too fast, we can simply introduce a little latency to stall it ourselves. The first thing that comes to mind is to override the _asgi_receive method on the WebSocketTestSession class.

In principle this means the test should be slower because we've introduced an artificial sleep. What about in practice?

All that really remains are some questions that I don't think I can fully answer mostly having to do with the intent of some of these design choices. Obviously I don't think the websocket test client is really intended to be repeatedly invoked like this despite the fact that it usually seems to work. I'm not sure if that isn't designed into the test client or I just haven't found the right approach. I opened a discussion on the starlette repository, figuring if nothing else I might save someone some time if they go searching for the problem in the future.