Upstream issue: background processes are reaped at the turn boundary, and the agent is never re-pinged when long-running work completes
Target repo: anthropics/claude-agent-sdk-python
Filed by: chuenthe
Status: drafted for filing
Summary
When an agent starts a long-running shell command via the Bash tool — including with
run_in_background: true — the child process is killed at the turn boundary (or at the
per-turn stall timeout, ~1800s), and there is no mechanism for the SDK to re-invoke
("re-ping") the agent when that work finishes. The two halves compound:
- Reaping: background commands are children of the per-turn subprocess's process
group, so they receive the group's termination signal when the turn ends or the
stall timeout fires. They do not survive to completion. Observed exit code: 144
(128 + signal 16) on WSL2.
- No re-ping: even when a background command does outlive a turn, there is no
callback/notification path that wakes the agent on completion. The agent has no way
to self-report when the job is done; it must be externally prompted to poll.
The net effect: any unit of work that legitimately outlives a single turn (multi-minute
builds, long test suites, deploys, pipeline resume operations) cannot be driven by the
Bash tool. It either dies silently mid-run or completes invisibly.
Why this is confusing in practice
The reaping masquerades as an application bug. A deterministic workload appears to
"crash" at the same step every time — but the step is simply whatever happens to be
running at ~1800s of wall-clock. The failure is environment-/lifecycle-driven, not
logic-driven. Signature of an SDK-lifecycle death (vs. a real failure): the run ends
with a generic non-zero exit and session_id empty / tokens_used == 0 — i.e. the
process vanished rather than reasoning its way to a failure.
Expected behavior
One or both of:
- A durable background primitive:
run_in_background: true should detach the child
from the turn's process group (e.g. setsid() / new session) so it survives turn
teardown, with the SDK retaining a handle.
- A completion re-ping: when a tracked background command exits, the SDK should
re-invoke the agent with the result (exit code, captured output), the same way
harness-tracked work is documented to re-invoke the agent automatically. Today that
re-ping does not fire for Bash-tool background work that crosses a turn boundary.
Actual behavior
run_in_background: true does not detach; the child shares the turn's
process-group lifecycle and is killed on turn end / stall timeout.
- No completion notification is delivered across a turn boundary; the agent cannot
self-report and must be told to poll, which is impossible if the turn has already
closed.
Reproduction
- From an Agent SDK session, start a command that runs longer than the turn /
stall-timeout window, e.g. sleep 2000 && echo done, with run_in_background: true.
- Let the turn end (or wait for the stall timeout).
- Observe: the process is terminated (exit 144);
echo done never runs; no
notification re-invokes the agent on the (non-)completion.
Environment
- Claude Agent SDK (Python), running as an always-on orchestrator that spawns one SDK
subprocess per turn.
- WSL2 (Ubuntu) on Windows; also reproduced for non-background long commands.
- Per-turn stall timeout configured at 1800s; the kill aligns with that boundary.
Workaround we use today
We launch long jobs in a detached tmux session created by a separate process
(not as a child of the SDK subprocess). The tmux session survives turn teardown and even
a full orchestrator restart. On exit, a wrapper writes an event file that a file-watcher
picks up and turns into a fresh agent turn — effectively re-implementing the missing
re-ping ourselves. This works but requires out-of-band event infrastructure that a raw
Agent SDK consumer would not have.
Suggested fix
- Detach
run_in_background children into their own session/process group so they are
not collected by the per-turn cleanup.
- Add a first-class completion callback that re-invokes the agent with the finished
command's exit status and output when a background job ends after the turn has closed.
Upstream issue: background processes are reaped at the turn boundary, and the agent is never re-pinged when long-running work completes
Target repo:
anthropics/claude-agent-sdk-pythonFiled by: chuenthe
Status: drafted for filing
Summary
When an agent starts a long-running shell command via the Bash tool — including with
run_in_background: true— the child process is killed at the turn boundary (or at theper-turn stall timeout, ~1800s), and there is no mechanism for the SDK to re-invoke
("re-ping") the agent when that work finishes. The two halves compound:
group, so they receive the group's termination signal when the turn ends or the
stall timeout fires. They do not survive to completion. Observed exit code: 144
(128 + signal 16) on WSL2.
callback/notification path that wakes the agent on completion. The agent has no way
to self-report when the job is done; it must be externally prompted to poll.
The net effect: any unit of work that legitimately outlives a single turn (multi-minute
builds, long test suites, deploys, pipeline resume operations) cannot be driven by the
Bash tool. It either dies silently mid-run or completes invisibly.
Why this is confusing in practice
The reaping masquerades as an application bug. A deterministic workload appears to
"crash" at the same step every time — but the step is simply whatever happens to be
running at ~1800s of wall-clock. The failure is environment-/lifecycle-driven, not
logic-driven. Signature of an SDK-lifecycle death (vs. a real failure): the run ends
with a generic non-zero exit and
session_idempty /tokens_used == 0— i.e. theprocess vanished rather than reasoning its way to a failure.
Expected behavior
One or both of:
run_in_background: trueshould detach the childfrom the turn's process group (e.g.
setsid()/ new session) so it survives turnteardown, with the SDK retaining a handle.
re-invoke the agent with the result (exit code, captured output), the same way
harness-tracked work is documented to re-invoke the agent automatically. Today that
re-ping does not fire for Bash-tool background work that crosses a turn boundary.
Actual behavior
run_in_background: truedoes not detach; the child shares the turn'sprocess-group lifecycle and is killed on turn end / stall timeout.
self-report and must be told to poll, which is impossible if the turn has already
closed.
Reproduction
stall-timeout window, e.g.
sleep 2000 && echo done, withrun_in_background: true.echo donenever runs; nonotification re-invokes the agent on the (non-)completion.
Environment
subprocess per turn.
Workaround we use today
We launch long jobs in a detached
tmuxsession created by a separate process(not as a child of the SDK subprocess). The tmux session survives turn teardown and even
a full orchestrator restart. On exit, a wrapper writes an event file that a file-watcher
picks up and turns into a fresh agent turn — effectively re-implementing the missing
re-ping ourselves. This works but requires out-of-band event infrastructure that a raw
Agent SDK consumer would not have.
Suggested fix
run_in_backgroundchildren into their own session/process group so they arenot collected by the per-turn cleanup.
command's exit status and output when a background job ends after the turn has closed.