Crash report
What happened?
AI Disclaimer: this issue was drafted by Claude Code, which also created and ran the reproducers. Backtraces were generated by the reporter, who also edited and approved of the draft.
Summary
Modules/faulthandler.c mutates its process-global state in _PyRuntime.faulthandler without synchronization. On free-threaded builds this produces a reproducible abort from pure-Python, thread-only scripts:
- Concurrent
dump_traceback_later() / cancel_dump_traceback_later() corrupt the watchdog cancel_event/running lock handshake.
## Bug 1 — non-atomic enabled flags in enable()/disable() tracked in #151363
Bug 2 — watchdog lock-handshake race in dump_traceback_later()
The dump_traceback_later / cancel_dump_traceback_later / faulthandler_thread handshake uses two PyThread_type_locks and assumes a single orchestrating thread holds cancel_event:
// arming (dump_traceback_later_impl)
if (thread.running == NULL)
thread.running = PyThread_allocate_lock(); // :843
if (thread.cancel_event == NULL) {
thread.cancel_event = PyThread_allocate_lock(); // :850
PyThread_acquire_lock(thread.cancel_event, 1); // :858 (main holds it)
}
...
cancel_dump_traceback_later(); // release cancel_event :739, (re)acquire :746
// cancel_dump_traceback_later()
PyThread_release_lock(thread.cancel_event); // :739
PyThread_acquire_lock(thread.running, 1); // wait for watchdog
PyThread_release_lock(thread.running);
PyThread_acquire_lock(thread.cancel_event, 1); // :746
With the GIL disabled, two threads racing arm/cancel break this:
- both see
cancel_event == NULL → both PyThread_allocate_lock() (one lock leaks), and the survivor's acquire(cancel_event, 1) blocks on an already-held lock; and
release/acquire of cancel_event/running happen from the wrong thread, so a lock is released that the releasing thread does not hold.
Reproducer:
import faulthandler, os, threading, time
f = open(os.devnull, "w")
stop = False
def arm():
while not stop:
faulthandler.dump_traceback_later(1000.0, file=f) # long timeout: never fires
def cancel():
while not stop:
faulthandler.cancel_dump_traceback_later()
ts = [threading.Thread(target=arm) for _ in range(4)]
ts += [threading.Thread(target=cancel) for _ in range(4)]
for t in ts: t.start()
time.sleep(10)
stop = True
for t in ts: t.join()
print("done")
Observed (free-threaded):
Fatal Python error: PyMutex_Unlock: unlocking mutex that is not locked
Python runtime state: initialized
Stack (most recent call first):
File "/home/danzin/projects/jit_cpython/repro_ft_finding1_watchdog.py", line 57 in arm
File "/home/danzin/projects/ft_cpython/Lib/threading.py", line 1160 in run
File "/home/danzin/projects/ft_cpython/Lib/threading.py", line 1218 in _bootstrap_inner
File "/home/danzin/projects/ft_cpython/Lib/threading.py", line 1180 in _bootstrap
Thread 6 "Thread-3 (arm)" received signal SIGABRT, Aborted.
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (threadid=<optimized out>, signo=6) at ./nptl/pthread_kill.c:89
#2 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:100
#3 0x00007ffff7c45b7e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007ffff7c288ec in __GI_abort () at ./stdlib/abort.c:77
#5 0x00005555560851b4 in fatal_error_exit (status=status@entry=-1) at Python/pylifecycle.c:3516
#6 0x0000555556084e7d in fatal_error (fd=fd@entry=2, header=header@entry=1, prefix=prefix@entry=0x5555565155a0 <__func__.PyMutex_Unlock> "PyMutex_Unlock",
msg=msg@entry=0x555556514ea0 <str> "unlocking mutex that is not locked", status=status@entry=-1) at Python/pylifecycle.c:3741
#7 0x0000555556080780 in _Py_FatalErrorFunc (func=0x5555565155a0 <__func__.PyMutex_Unlock> "PyMutex_Unlock", msg=0x555556514ea0 <str> "unlocking mutex that is not locked")
at Python/pylifecycle.c:3764
#8 0x000055555605e237 in PyMutex_Unlock (m=<optimized out>) at Python/lock.c:664
#9 0x000055555618ab9a in cancel_dump_traceback_later () at ./Modules/faulthandler.c:739
#10 0x000055555618da1c in faulthandler_dump_traceback_later_impl (module=0x7bffb633a790, timeout_obj=0x7bffb611aba0, repeat=0, file=<optimized out>, exit=0, max_threads=100)
at ./Modules/faulthandler.c:870
#11 faulthandler_dump_traceback_later (module=0x7bffb633a790, args=0x7bffaeeddc90, args@entry=0x7bffaeedde68, nargs=nargs@entry=1, kwnames=kwnames@entry=0x7bffb6328710)
at ./Modules/clinic/faulthandler.c.h:439
#12 0x0000555555c2099b in cfunction_vectorcall_FASTCALL_KEYWORDS (func=func@entry=0x7bffb657a9d0, args=args@entry=0x7bffaeedde68, nargsf=nargsf@entry=9223372036854775809,
kwnames=kwnames@entry=0x7bffb6328710) at Objects/methodobject.c:465
#13 0x0000555555ad1e10 in _PyObject_VectorcallTstate (tstate=0x7bffb423a010, callable=0x7bffb657a9d0, args=0x7bffaeedde68, nargsf=9223372036854775809, kwnames=0x7bffb6328710)
at ./Include/internal/pycore_call.h:144
#14 0x0000555555ebc8db in _Py_VectorCallInstrumentation_StackRefSteal (callable=callable@entry=..., arguments=0x7e8ff700d408, total_args=2, kwnames=kwnames@entry=...,
call_instrumentation=false, frame=frame@entry=0x7e8ff700d3a8, this_instr=0x7bffc00d035a, tstate=0x7bffb423a010) at Python/ceval.c:766
Same binary with -X gil=1: clean — 53k arm + 16M cancel iterations, no error.
Unlike the known _Py_DumpTracebackThreads frame-reading races (#116008, #131580, #140815), Bug 2 is reproduced with a long timeout so the watchdog never fires — the abort is purely in the cancel_event/running lock handshake (unlocking an unheld PyMutex), not in frame reading. It's a self-contained lock-discipline bug, fixable independently of the frame-traversal limitations those issues describe.
Suggested direction
The enable/register/watchdog write paths predate free threading; the FT hardening that landed (gh-128400) covered only the traceback-read path. The sibling signalmodule.c was hardened for the same reason in gh-109693 (67e8d41, "Use pyatomic.h for signal module") and uses _Py_atomic_* throughout; faulthandler.c currently contains no atomics. Py_MOD_GIL_NOT_USED was added to faulthandler in the blanket gh-116322 rollout (c2627d6) without a module-specific shared-state audit.
Suggestion:
- Add a single module-level
PyMutex around the state-mutating entry points (enable, disable, register, unregister, dump_traceback_later, cancel_dump_traceback_later) — none are hot paths and none run in signal-handler context — and make the enabled flags atomic for the signal-handler read.
cc @vstinner
Found using cpython-review-toolkit with Claude Opus 4.8, using the /cpython-review-toolkit:explore Modules/faulthandler.c command.
CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
Output from running 'python -VV' on the command line:
Python 3.16.0a0 free-threading build (heads/main:a7885b46f15, Jun 14 2026, 09:19:51) [Clang 21.1.8 (6ubuntu1)]
Crash report
What happened?
AI Disclaimer: this issue was drafted by Claude Code, which also created and ran the reproducers. Backtraces were generated by the reporter, who also edited and approved of the draft.
Summary
Modules/faulthandler.cmutates its process-global state in_PyRuntime.faulthandlerwithout synchronization. On free-threaded builds this produces a reproducible abort from pure-Python, thread-only scripts:dump_traceback_later()/cancel_dump_traceback_later()corrupt the watchdogcancel_event/runninglock handshake.## Bug 1 — non-atomictracked in #151363enabledflags inenable()/disable()Bug 2 — watchdog lock-handshake race in
dump_traceback_later()The
dump_traceback_later/cancel_dump_traceback_later/faulthandler_threadhandshake uses twoPyThread_type_locks and assumes a single orchestrating thread holdscancel_event:With the GIL disabled, two threads racing arm/cancel break this:
cancel_event == NULL→ bothPyThread_allocate_lock()(one lock leaks), and the survivor'sacquire(cancel_event, 1)blocks on an already-held lock; andrelease/acquireofcancel_event/runninghappen from the wrong thread, so a lock is released that the releasing thread does not hold.Reproducer:
Observed (free-threaded):
Same binary with
-X gil=1: clean — 53k arm + 16M cancel iterations, no error.Unlike the known
_Py_DumpTracebackThreadsframe-reading races (#116008, #131580, #140815), Bug 2 is reproduced with a long timeout so the watchdog never fires — the abort is purely in thecancel_event/runninglock handshake (unlocking an unheldPyMutex), not in frame reading. It's a self-contained lock-discipline bug, fixable independently of the frame-traversal limitations those issues describe.Suggested direction
The enable/register/watchdog write paths predate free threading; the FT hardening that landed (gh-128400) covered only the traceback-read path. The sibling
signalmodule.cwas hardened for the same reason in gh-109693 (67e8d41, "Use pyatomic.h for signal module") and uses_Py_atomic_*throughout;faulthandler.ccurrently contains no atomics.Py_MOD_GIL_NOT_USEDwas added to faulthandler in the blanket gh-116322 rollout (c2627d6) without a module-specific shared-state audit.Suggestion:
PyMutexaround the state-mutating entry points (enable,disable,register,unregister,dump_traceback_later,cancel_dump_traceback_later) — none are hot paths and none run in signal-handler context — and make theenabledflags atomic for the signal-handler read.cc @vstinner
Found using cpython-review-toolkit with Claude Opus 4.8, using the
/cpython-review-toolkit:explore Modules/faulthandler.ccommand.CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
Output from running 'python -VV' on the command line:
Python 3.16.0a0 free-threading build (heads/main:a7885b46f15, Jun 14 2026, 09:19:51) [Clang 21.1.8 (6ubuntu1)]