fix(workspace-server): diagnose and self-heal startup failures by posthog[bot] · Pull Request #2675 · PostHog/code

posthog · 2026-06-15T12:47:05Z

Problem

Desktop users hit a hard startup failure where the workspace-server child never becomes healthy, silently disabling all git, fs, and process capabilities for the session. The only signal was an opaque workspace-server failed to become healthy within 5000ms — the child's real failure (env-validation process.exit(2), a stolen port → EADDRINUSE, a bundle/spawn error) was swallowed, and pollHealth() waited out the full timeout even after the child had already died. A tight 5s budget plus a TOCTOU window in findFreePort() (it releases the port before the child re-binds) made this both undiagnosable and unrecoverable.

This was a brand-new regression surfaced in error tracking (first seen 2026-06-15), introduced by the package-split refactor in #2442.

Changes

In apps/code/.../workspace-server/service.ts:

Surface the real cause: capture the child's exit/error events and a bounded stderr tail, and include exit code / signal / stderr in the thrown error.
Fail fast: race readiness against child death and abort health polling the instant the child exits, instead of waiting out the timeout against a dead port.
Self-heal the port race: bounded respawn (3 attempts) on early exit, choosing a fresh free port each attempt — which naturally recovers a stolen port.
Tolerate cold starts: widen HEALTH_POLL_TIMEOUT_MS to 15s; fail-fast means a healthy server still returns the moment it's reachable, so this only helps genuinely-slow starts.

Adds unit tests for the happy path, the diagnostic-error path (exit code + stderr surfaced, retries exhausted), and respawn recovery on a fresh port.

How did you test this?

vitest run on the new service.test.ts — 3 tests pass.
biome check on the changed files — clean.
tsc typecheck — the changed file produces no errors (remaining repo typecheck errors are pre-existing unbuilt-dependency issues, identical on clean main).

Automatic notifications

Publish to changelog?
Alert Sales and Marketing teams?

Created with PostHog Code from an inbox report.

spawnChild() previously swallowed every cause of a failed boot and reported a blanket "failed to become healthy within 5000ms" with no child stderr, and polled the full 5s even after the child had already died. - Capture the child's exit/error events and a bounded stderr tail, and include them in the thrown error so env-validation exits (process.exit(2)), EADDRINUSE, and bundle/spawn failures are diagnosable. - Fail fast: race readiness against child death and abort polling the instant the child exits instead of waiting out the timeout against a dead port. - Bounded respawn (3 attempts) on early exit, picking a fresh port each time, which self-heals the findFreePort() TOCTOU port-steal race. - Widen HEALTH_POLL_TIMEOUT_MS to 15s for slow cold starts; fail-fast means a healthy server still returns the moment it is reachable. Generated-By: PostHog Code Task-Id: 717fc35c-098e-4bb2-9f43-0d650a4dd635

github-actions · 2026-06-15T12:47:36Z

React Doctor found no issues in the changed files. 🎉

_{Reviewed by React Doctor for commit ca8289b.}

greptile-apps · 2026-06-15T12:50:11Z

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
apps/code/src/main/services/workspace-server/service.ts:167-172
**All early exits treated as retriable regardless of exit code**

Every early child exit is thrown as `WorkspaceServerStartError(msg, true)` — unconditionally retriable — which means a deterministic failure like env-validation (`process.exit(2)`) will exhaust all 3 attempts before surfacing the error, adding two extra port-allocation + spawn cycles and ~300 ms of delay. The self-described dominant case (port race) surfaces as `EADDRINUSE` and is genuinely transient, but exit code 2 explicitly means "the server will never start with this configuration" and is not port-dependent. Differentiating on `exit.code === 2` (or checking `exit.code !== null && exit.code !== 1`) would let deterministic failures fail fast while still recovering the TOCTOU race.

### Issue 2 of 2
apps/code/src/main/services/workspace-server/service.ts:224-226
**Abort signal is not applied to the inter-poll sleep**

After the child exits and `abort.abort()` fires, a fetch that is in-flight will throw `AbortError` (caught), but the `await new Promise((r) => setTimeout(r, HEALTH_POLL_INTERVAL_MS))` that follows is not abort-aware. This adds an unconditional 100 ms delay to every abort path before `signal.aborted` is re-checked at the top of the loop. In the fast-exit scenario this is negligible, but replacing the plain `setTimeout` with one that is cleared on abort (or checking `signal.aborted` before sleeping) would make the abort truly immediate.

_{Reviews (1): Last reviewed commit: "fix(workspace-server): diagnose and self..." | Re-trigger Greptile}

greptile-apps · 2026-06-15T12:50:15Z

+    if (outcome.kind === "exit") {
+      throw new WorkspaceServerStartError(
+        describeEarlyExit(outcome.exit, stderrTail),
+        true,
+      );
+    }


All early exits treated as retriable regardless of exit code

Every early child exit is thrown as WorkspaceServerStartError(msg, true) — unconditionally retriable — which means a deterministic failure like env-validation (process.exit(2)) will exhaust all 3 attempts before surfacing the error, adding two extra port-allocation + spawn cycles and ~300 ms of delay. The self-described dominant case (port race) surfaces as EADDRINUSE and is genuinely transient, but exit code 2 explicitly means "the server will never start with this configuration" and is not port-dependent. Differentiating on exit.code === 2 (or checking exit.code !== null && exit.code !== 1) would let deterministic failures fail fast while still recovering the TOCTOU race.

Prompt To Fix With AI

This is a comment left during a code review. Path: apps/code/src/main/services/workspace-server/service.ts Line: 167-172 Comment: **All early exits treated as retriable regardless of exit code** Every early child exit is thrown as `WorkspaceServerStartError(msg, true)` — unconditionally retriable — which means a deterministic failure like env-validation (`process.exit(2)`) will exhaust all 3 attempts before surfacing the error, adding two extra port-allocation + spawn cycles and ~300 ms of delay. The self-described dominant case (port race) surfaces as `EADDRINUSE` and is genuinely transient, but exit code 2 explicitly means "the server will never start with this configuration" and is not port-dependent. Differentiating on `exit.code === 2` (or checking `exit.code !== null && exit.code !== 1`) would let deterministic failures fail fast while still recovering the TOCTOU race. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-15T12:50:16Z

    } catch {}
    await new Promise((r) => setTimeout(r, HEALTH_POLL_INTERVAL_MS));
  }


Abort signal is not applied to the inter-poll sleep

After the child exits and abort.abort() fires, a fetch that is in-flight will throw AbortError (caught), but the await new Promise((r) => setTimeout(r, HEALTH_POLL_INTERVAL_MS)) that follows is not abort-aware. This adds an unconditional 100 ms delay to every abort path before signal.aborted is re-checked at the top of the loop. In the fast-exit scenario this is negligible, but replacing the plain setTimeout with one that is cleared on abort (or checking signal.aborted before sleeping) would make the abort truly immediate.

Prompt To Fix With AI

This is a comment left during a code review. Path: apps/code/src/main/services/workspace-server/service.ts Line: 224-226 Comment: **Abort signal is not applied to the inter-poll sleep** After the child exits and `abort.abort()` fires, a fetch that is in-flight will throw `AbortError` (caught), but the `await new Promise((r) => setTimeout(r, HEALTH_POLL_INTERVAL_MS))` that follows is not abort-aware. This adds an unconditional 100 ms delay to every abort path before `signal.aborted` is re-checked at the top of the loop. In the fast-exit scenario this is negligible, but replacing the plain `setTimeout` with one that is cleared on abort (or checking `signal.aborted` before sleeping) would make the abort truly immediate. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps Bot reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(workspace-server): diagnose and self-heal startup failures#2675

fix(workspace-server): diagnose and self-heal startup failures#2675
posthog[bot] wants to merge 1 commit into
mainfrom
posthog-code/workspace-server-startup-diagnostics

posthog Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

greptile-apps Bot commented Jun 15, 2026

Uh oh!

greptile-apps Bot Jun 15, 2026

Uh oh!

greptile-apps Bot Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

posthog Bot commented Jun 15, 2026

Problem

Changes

How did you test this?

Automatic notifications

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

greptile-apps Bot commented Jun 15, 2026

Uh oh!

greptile-apps Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants