Background processes reaped at turn boundary; no re-ping when long-running work completes

# Upstream issue: background processes are reaped at the turn boundary, and the agent is never re-pinged when long-running work completes

**Target repo:** `anthropics/claude-agent-sdk-python`
**Filed by:** chuenthe
**Status:** drafted for filing

---

## Summary

When an agent starts a long-running shell command via the Bash tool — including with
`run_in_background: true` — the child process is killed at the turn boundary (or at the
per-turn stall timeout, ~1800s), and there is **no mechanism for the SDK to re-invoke
("re-ping") the agent when that work finishes**. The two halves compound:

1. **Reaping:** background commands are children of the per-turn subprocess's process
   group, so they receive the group's termination signal when the turn ends or the
   stall timeout fires. They do not survive to completion. Observed exit code: **144**
   (128 + signal 16) on WSL2.
2. **No re-ping:** even when a background command *does* outlive a turn, there is no
   callback/notification path that wakes the agent on completion. The agent has no way
   to self-report when the job is done; it must be externally prompted to poll.

The net effect: any unit of work that legitimately outlives a single turn (multi-minute
builds, long test suites, deploys, pipeline resume operations) cannot be driven by the
Bash tool. It either dies silently mid-run or completes invisibly.

## Why this is confusing in practice

The reaping **masquerades as an application bug**. A deterministic workload appears to
"crash" at the same step every time — but the step is simply whatever happens to be
running at ~1800s of wall-clock. The failure is environment-/lifecycle-driven, not
logic-driven. Signature of an SDK-lifecycle death (vs. a real failure): the run ends
with a generic non-zero exit and **`session_id` empty / `tokens_used == 0`** — i.e. the
process vanished rather than reasoning its way to a failure.

## Expected behavior

One or both of:

- A **durable background primitive**: `run_in_background: true` should detach the child
  from the turn's process group (e.g. `setsid()` / new session) so it survives turn
  teardown, with the SDK retaining a handle.
- A **completion re-ping**: when a tracked background command exits, the SDK should
  re-invoke the agent with the result (exit code, captured output), the same way
  harness-tracked work is documented to re-invoke the agent automatically. Today that
  re-ping does not fire for Bash-tool background work that crosses a turn boundary.

## Actual behavior

- `run_in_background: true` does **not** detach; the child shares the turn's
  process-group lifecycle and is killed on turn end / stall timeout.
- No completion notification is delivered across a turn boundary; the agent cannot
  self-report and must be told to poll, which is impossible if the turn has already
  closed.

## Reproduction

1. From an Agent SDK session, start a command that runs longer than the turn /
   stall-timeout window, e.g. `sleep 2000 && echo done`, with `run_in_background: true`.
2. Let the turn end (or wait for the stall timeout).
3. Observe: the process is terminated (exit 144); `echo done` never runs; no
   notification re-invokes the agent on the (non-)completion.

## Environment

- Claude Agent SDK (Python), running as an always-on orchestrator that spawns one SDK
  subprocess per turn.
- WSL2 (Ubuntu) on Windows; also reproduced for non-background long commands.
- Per-turn stall timeout configured at 1800s; the kill aligns with that boundary.

## Workaround we use today

We launch long jobs in a **detached `tmux` session** created by a *separate* process
(not as a child of the SDK subprocess). The tmux session survives turn teardown and even
a full orchestrator restart. On exit, a wrapper writes an event file that a file-watcher
picks up and turns into a fresh agent turn — effectively re-implementing the missing
re-ping ourselves. This works but requires out-of-band event infrastructure that a raw
Agent SDK consumer would not have.

## Suggested fix

- Detach `run_in_background` children into their own session/process group so they are
  not collected by the per-turn cleanup.
- Add a first-class completion callback that re-invokes the agent with the finished
  command's exit status and output when a background job ends after the turn has closed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Background processes reaped at turn boundary; no re-ping when long-running work completes #1030

Upstream issue: background processes are reaped at the turn boundary, and the agent is never re-pinged when long-running work completes

Summary

Why this is confusing in practice

Expected behavior

Actual behavior

Reproduction

Environment

Workaround we use today

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Background processes reaped at turn boundary; no re-ping when long-running work completes #1030

Description

Upstream issue: background processes are reaped at the turn boundary, and the agent is never re-pinged when long-running work completes

Summary

Why this is confusing in practice

Expected behavior

Actual behavior

Reproduction

Environment

Workaround we use today

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions