Skip to content

fix(workspace-server): diagnose and self-heal startup failures#2675

Draft
posthog[bot] wants to merge 1 commit into
mainfrom
posthog-code/workspace-server-startup-diagnostics
Draft

fix(workspace-server): diagnose and self-heal startup failures#2675
posthog[bot] wants to merge 1 commit into
mainfrom
posthog-code/workspace-server-startup-diagnostics

Conversation

@posthog

@posthog posthog Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Problem

Desktop users hit a hard startup failure where the workspace-server child never becomes healthy, silently disabling all git, fs, and process capabilities for the session. The only signal was an opaque workspace-server failed to become healthy within 5000ms — the child's real failure (env-validation process.exit(2), a stolen port → EADDRINUSE, a bundle/spawn error) was swallowed, and pollHealth() waited out the full timeout even after the child had already died. A tight 5s budget plus a TOCTOU window in findFreePort() (it releases the port before the child re-binds) made this both undiagnosable and unrecoverable.

This was a brand-new regression surfaced in error tracking (first seen 2026-06-15), introduced by the package-split refactor in #2442.

Changes

In apps/code/.../workspace-server/service.ts:

  • Surface the real cause: capture the child's exit/error events and a bounded stderr tail, and include exit code / signal / stderr in the thrown error.
  • Fail fast: race readiness against child death and abort health polling the instant the child exits, instead of waiting out the timeout against a dead port.
  • Self-heal the port race: bounded respawn (3 attempts) on early exit, choosing a fresh free port each attempt — which naturally recovers a stolen port.
  • Tolerate cold starts: widen HEALTH_POLL_TIMEOUT_MS to 15s; fail-fast means a healthy server still returns the moment it's reachable, so this only helps genuinely-slow starts.

Adds unit tests for the happy path, the diagnostic-error path (exit code + stderr surfaced, retries exhausted), and respawn recovery on a fresh port.

How did you test this?

  • vitest run on the new service.test.ts — 3 tests pass.
  • biome check on the changed files — clean.
  • tsc typecheck — the changed file produces no errors (remaining repo typecheck errors are pre-existing unbuilt-dependency issues, identical on clean main).

Automatic notifications

  • Publish to changelog?
  • Alert Sales and Marketing teams?

Created with PostHog Code from an inbox report.

spawnChild() previously swallowed every cause of a failed boot and reported
a blanket "failed to become healthy within 5000ms" with no child stderr,
and polled the full 5s even after the child had already died.

- Capture the child's exit/error events and a bounded stderr tail, and
  include them in the thrown error so env-validation exits (process.exit(2)),
  EADDRINUSE, and bundle/spawn failures are diagnosable.
- Fail fast: race readiness against child death and abort polling the instant
  the child exits instead of waiting out the timeout against a dead port.
- Bounded respawn (3 attempts) on early exit, picking a fresh port each time,
  which self-heals the findFreePort() TOCTOU port-steal race.
- Widen HEALTH_POLL_TIMEOUT_MS to 15s for slow cold starts; fail-fast means a
  healthy server still returns the moment it is reachable.

Generated-By: PostHog Code
Task-Id: 717fc35c-098e-4bb2-9f43-0d650a4dd635
@github-actions

Copy link
Copy Markdown

React Doctor found no issues in the changed files. 🎉

Reviewed by React Doctor for commit ca8289b.

@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
apps/code/src/main/services/workspace-server/service.ts:167-172
**All early exits treated as retriable regardless of exit code**

Every early child exit is thrown as `WorkspaceServerStartError(msg, true)` — unconditionally retriable — which means a deterministic failure like env-validation (`process.exit(2)`) will exhaust all 3 attempts before surfacing the error, adding two extra port-allocation + spawn cycles and ~300 ms of delay. The self-described dominant case (port race) surfaces as `EADDRINUSE` and is genuinely transient, but exit code 2 explicitly means "the server will never start with this configuration" and is not port-dependent. Differentiating on `exit.code === 2` (or checking `exit.code !== null && exit.code !== 1`) would let deterministic failures fail fast while still recovering the TOCTOU race.

### Issue 2 of 2
apps/code/src/main/services/workspace-server/service.ts:224-226
**Abort signal is not applied to the inter-poll sleep**

After the child exits and `abort.abort()` fires, a fetch that is in-flight will throw `AbortError` (caught), but the `await new Promise((r) => setTimeout(r, HEALTH_POLL_INTERVAL_MS))` that follows is not abort-aware. This adds an unconditional 100 ms delay to every abort path before `signal.aborted` is re-checked at the top of the loop. In the fast-exit scenario this is negligible, but replacing the plain `setTimeout` with one that is cleared on abort (or checking `signal.aborted` before sleeping) would make the abort truly immediate.

Reviews (1): Last reviewed commit: "fix(workspace-server): diagnose and self..." | Re-trigger Greptile

Comment on lines +167 to +172
if (outcome.kind === "exit") {
throw new WorkspaceServerStartError(
describeEarlyExit(outcome.exit, stderrTail),
true,
);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 All early exits treated as retriable regardless of exit code

Every early child exit is thrown as WorkspaceServerStartError(msg, true) — unconditionally retriable — which means a deterministic failure like env-validation (process.exit(2)) will exhaust all 3 attempts before surfacing the error, adding two extra port-allocation + spawn cycles and ~300 ms of delay. The self-described dominant case (port race) surfaces as EADDRINUSE and is genuinely transient, but exit code 2 explicitly means "the server will never start with this configuration" and is not port-dependent. Differentiating on exit.code === 2 (or checking exit.code !== null && exit.code !== 1) would let deterministic failures fail fast while still recovering the TOCTOU race.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/code/src/main/services/workspace-server/service.ts
Line: 167-172

Comment:
**All early exits treated as retriable regardless of exit code**

Every early child exit is thrown as `WorkspaceServerStartError(msg, true)` — unconditionally retriable — which means a deterministic failure like env-validation (`process.exit(2)`) will exhaust all 3 attempts before surfacing the error, adding two extra port-allocation + spawn cycles and ~300 ms of delay. The self-described dominant case (port race) surfaces as `EADDRINUSE` and is genuinely transient, but exit code 2 explicitly means "the server will never start with this configuration" and is not port-dependent. Differentiating on `exit.code === 2` (or checking `exit.code !== null && exit.code !== 1`) would let deterministic failures fail fast while still recovering the TOCTOU race.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 224 to 226
} catch {}
await new Promise((r) => setTimeout(r, HEALTH_POLL_INTERVAL_MS));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Abort signal is not applied to the inter-poll sleep

After the child exits and abort.abort() fires, a fetch that is in-flight will throw AbortError (caught), but the await new Promise((r) => setTimeout(r, HEALTH_POLL_INTERVAL_MS)) that follows is not abort-aware. This adds an unconditional 100 ms delay to every abort path before signal.aborted is re-checked at the top of the loop. In the fast-exit scenario this is negligible, but replacing the plain setTimeout with one that is cleared on abort (or checking signal.aborted before sleeping) would make the abort truly immediate.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/code/src/main/services/workspace-server/service.ts
Line: 224-226

Comment:
**Abort signal is not applied to the inter-poll sleep**

After the child exits and `abort.abort()` fires, a fetch that is in-flight will throw `AbortError` (caught), but the `await new Promise((r) => setTimeout(r, HEALTH_POLL_INTERVAL_MS))` that follows is not abort-aware. This adds an unconditional 100 ms delay to every abort path before `signal.aborted` is re-checked at the top of the loop. In the fast-exit scenario this is negligible, but replacing the plain `setTimeout` with one that is cleared on abort (or checking `signal.aborted` before sleeping) would make the abort truly immediate.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants