fix(workspace-server): diagnose and self-heal startup failures#2675
fix(workspace-server): diagnose and self-heal startup failures#2675posthog[bot] wants to merge 1 commit into
Conversation
spawnChild() previously swallowed every cause of a failed boot and reported a blanket "failed to become healthy within 5000ms" with no child stderr, and polled the full 5s even after the child had already died. - Capture the child's exit/error events and a bounded stderr tail, and include them in the thrown error so env-validation exits (process.exit(2)), EADDRINUSE, and bundle/spawn failures are diagnosable. - Fail fast: race readiness against child death and abort polling the instant the child exits instead of waiting out the timeout against a dead port. - Bounded respawn (3 attempts) on early exit, picking a fresh port each time, which self-heals the findFreePort() TOCTOU port-steal race. - Widen HEALTH_POLL_TIMEOUT_MS to 15s for slow cold starts; fail-fast means a healthy server still returns the moment it is reachable. Generated-By: PostHog Code Task-Id: 717fc35c-098e-4bb2-9f43-0d650a4dd635
|
React Doctor found no issues in the changed files. 🎉 Reviewed by React Doctor for commit |
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
apps/code/src/main/services/workspace-server/service.ts:167-172
**All early exits treated as retriable regardless of exit code**
Every early child exit is thrown as `WorkspaceServerStartError(msg, true)` — unconditionally retriable — which means a deterministic failure like env-validation (`process.exit(2)`) will exhaust all 3 attempts before surfacing the error, adding two extra port-allocation + spawn cycles and ~300 ms of delay. The self-described dominant case (port race) surfaces as `EADDRINUSE` and is genuinely transient, but exit code 2 explicitly means "the server will never start with this configuration" and is not port-dependent. Differentiating on `exit.code === 2` (or checking `exit.code !== null && exit.code !== 1`) would let deterministic failures fail fast while still recovering the TOCTOU race.
### Issue 2 of 2
apps/code/src/main/services/workspace-server/service.ts:224-226
**Abort signal is not applied to the inter-poll sleep**
After the child exits and `abort.abort()` fires, a fetch that is in-flight will throw `AbortError` (caught), but the `await new Promise((r) => setTimeout(r, HEALTH_POLL_INTERVAL_MS))` that follows is not abort-aware. This adds an unconditional 100 ms delay to every abort path before `signal.aborted` is re-checked at the top of the loop. In the fast-exit scenario this is negligible, but replacing the plain `setTimeout` with one that is cleared on abort (or checking `signal.aborted` before sleeping) would make the abort truly immediate.
Reviews (1): Last reviewed commit: "fix(workspace-server): diagnose and self..." | Re-trigger Greptile |
| if (outcome.kind === "exit") { | ||
| throw new WorkspaceServerStartError( | ||
| describeEarlyExit(outcome.exit, stderrTail), | ||
| true, | ||
| ); | ||
| } |
There was a problem hiding this comment.
All early exits treated as retriable regardless of exit code
Every early child exit is thrown as WorkspaceServerStartError(msg, true) — unconditionally retriable — which means a deterministic failure like env-validation (process.exit(2)) will exhaust all 3 attempts before surfacing the error, adding two extra port-allocation + spawn cycles and ~300 ms of delay. The self-described dominant case (port race) surfaces as EADDRINUSE and is genuinely transient, but exit code 2 explicitly means "the server will never start with this configuration" and is not port-dependent. Differentiating on exit.code === 2 (or checking exit.code !== null && exit.code !== 1) would let deterministic failures fail fast while still recovering the TOCTOU race.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/code/src/main/services/workspace-server/service.ts
Line: 167-172
Comment:
**All early exits treated as retriable regardless of exit code**
Every early child exit is thrown as `WorkspaceServerStartError(msg, true)` — unconditionally retriable — which means a deterministic failure like env-validation (`process.exit(2)`) will exhaust all 3 attempts before surfacing the error, adding two extra port-allocation + spawn cycles and ~300 ms of delay. The self-described dominant case (port race) surfaces as `EADDRINUSE` and is genuinely transient, but exit code 2 explicitly means "the server will never start with this configuration" and is not port-dependent. Differentiating on `exit.code === 2` (or checking `exit.code !== null && exit.code !== 1`) would let deterministic failures fail fast while still recovering the TOCTOU race.
How can I resolve this? If you propose a fix, please make it concise.| } catch {} | ||
| await new Promise((r) => setTimeout(r, HEALTH_POLL_INTERVAL_MS)); | ||
| } |
There was a problem hiding this comment.
Abort signal is not applied to the inter-poll sleep
After the child exits and abort.abort() fires, a fetch that is in-flight will throw AbortError (caught), but the await new Promise((r) => setTimeout(r, HEALTH_POLL_INTERVAL_MS)) that follows is not abort-aware. This adds an unconditional 100 ms delay to every abort path before signal.aborted is re-checked at the top of the loop. In the fast-exit scenario this is negligible, but replacing the plain setTimeout with one that is cleared on abort (or checking signal.aborted before sleeping) would make the abort truly immediate.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/code/src/main/services/workspace-server/service.ts
Line: 224-226
Comment:
**Abort signal is not applied to the inter-poll sleep**
After the child exits and `abort.abort()` fires, a fetch that is in-flight will throw `AbortError` (caught), but the `await new Promise((r) => setTimeout(r, HEALTH_POLL_INTERVAL_MS))` that follows is not abort-aware. This adds an unconditional 100 ms delay to every abort path before `signal.aborted` is re-checked at the top of the loop. In the fast-exit scenario this is negligible, but replacing the plain `setTimeout` with one that is cleared on abort (or checking `signal.aborted` before sleeping) would make the abort truly immediate.
How can I resolve this? If you propose a fix, please make it concise.
Problem
Desktop users hit a hard startup failure where the workspace-server child never becomes healthy, silently disabling all git, fs, and process capabilities for the session. The only signal was an opaque
workspace-server failed to become healthy within 5000ms— the child's real failure (env-validationprocess.exit(2), a stolen port →EADDRINUSE, a bundle/spawn error) was swallowed, andpollHealth()waited out the full timeout even after the child had already died. A tight 5s budget plus a TOCTOU window infindFreePort()(it releases the port before the child re-binds) made this both undiagnosable and unrecoverable.This was a brand-new regression surfaced in error tracking (first seen 2026-06-15), introduced by the package-split refactor in #2442.
Changes
In
apps/code/.../workspace-server/service.ts:exit/errorevents and a bounded stderr tail, and include exit code / signal / stderr in the thrown error.HEALTH_POLL_TIMEOUT_MSto 15s; fail-fast means a healthy server still returns the moment it's reachable, so this only helps genuinely-slow starts.Adds unit tests for the happy path, the diagnostic-error path (exit code + stderr surfaced, retries exhausted), and respawn recovery on a fresh port.
How did you test this?
vitest runon the newservice.test.ts— 3 tests pass.biome checkon the changed files — clean.tsctypecheck — the changed file produces no errors (remaining repo typecheck errors are pre-existing unbuilt-dependency issues, identical on cleanmain).Automatic notifications
Created with PostHog Code from an inbox report.