Skip to content

Background processes reaped at turn boundary; no re-ping when long-running work completes #1030

@chuenthe

Description

@chuenthe

Upstream issue: background processes are reaped at the turn boundary, and the agent is never re-pinged when long-running work completes

Target repo: anthropics/claude-agent-sdk-python
Filed by: chuenthe
Status: drafted for filing


Summary

When an agent starts a long-running shell command via the Bash tool — including with
run_in_background: true — the child process is killed at the turn boundary (or at the
per-turn stall timeout, ~1800s), and there is no mechanism for the SDK to re-invoke
("re-ping") the agent when that work finishes
. The two halves compound:

  1. Reaping: background commands are children of the per-turn subprocess's process
    group, so they receive the group's termination signal when the turn ends or the
    stall timeout fires. They do not survive to completion. Observed exit code: 144
    (128 + signal 16) on WSL2.
  2. No re-ping: even when a background command does outlive a turn, there is no
    callback/notification path that wakes the agent on completion. The agent has no way
    to self-report when the job is done; it must be externally prompted to poll.

The net effect: any unit of work that legitimately outlives a single turn (multi-minute
builds, long test suites, deploys, pipeline resume operations) cannot be driven by the
Bash tool. It either dies silently mid-run or completes invisibly.

Why this is confusing in practice

The reaping masquerades as an application bug. A deterministic workload appears to
"crash" at the same step every time — but the step is simply whatever happens to be
running at ~1800s of wall-clock. The failure is environment-/lifecycle-driven, not
logic-driven. Signature of an SDK-lifecycle death (vs. a real failure): the run ends
with a generic non-zero exit and session_id empty / tokens_used == 0 — i.e. the
process vanished rather than reasoning its way to a failure.

Expected behavior

One or both of:

  • A durable background primitive: run_in_background: true should detach the child
    from the turn's process group (e.g. setsid() / new session) so it survives turn
    teardown, with the SDK retaining a handle.
  • A completion re-ping: when a tracked background command exits, the SDK should
    re-invoke the agent with the result (exit code, captured output), the same way
    harness-tracked work is documented to re-invoke the agent automatically. Today that
    re-ping does not fire for Bash-tool background work that crosses a turn boundary.

Actual behavior

  • run_in_background: true does not detach; the child shares the turn's
    process-group lifecycle and is killed on turn end / stall timeout.
  • No completion notification is delivered across a turn boundary; the agent cannot
    self-report and must be told to poll, which is impossible if the turn has already
    closed.

Reproduction

  1. From an Agent SDK session, start a command that runs longer than the turn /
    stall-timeout window, e.g. sleep 2000 && echo done, with run_in_background: true.
  2. Let the turn end (or wait for the stall timeout).
  3. Observe: the process is terminated (exit 144); echo done never runs; no
    notification re-invokes the agent on the (non-)completion.

Environment

  • Claude Agent SDK (Python), running as an always-on orchestrator that spawns one SDK
    subprocess per turn.
  • WSL2 (Ubuntu) on Windows; also reproduced for non-background long commands.
  • Per-turn stall timeout configured at 1800s; the kill aligns with that boundary.

Workaround we use today

We launch long jobs in a detached tmux session created by a separate process
(not as a child of the SDK subprocess). The tmux session survives turn teardown and even
a full orchestrator restart. On exit, a wrapper writes an event file that a file-watcher
picks up and turns into a fresh agent turn — effectively re-implementing the missing
re-ping ourselves. This works but requires out-of-band event infrastructure that a raw
Agent SDK consumer would not have.

Suggested fix

  • Detach run_in_background children into their own session/process group so they are
    not collected by the per-turn cleanup.
  • Add a first-class completion callback that re-invokes the agent with the finished
    command's exit status and output when a background job ends after the turn has closed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions