Skip to content

feat(orchestrator): startup hygiene — stale per-PID files + orphan detection#8

Open
evannadeau wants to merge 2 commits into
SpawnBox-dev:mainfrom
evannadeau:feat/startup-hygiene
Open

feat(orchestrator): startup hygiene — stale per-PID files + orphan detection#8
evannadeau wants to merge 2 commits into
SpawnBox-dev:mainfrom
evannadeau:feat/startup-hygiene

Conversation

@evannadeau
Copy link
Copy Markdown

Summary

Two additive startup-time hygiene improvements in plugins/orchestrator/mcp/server.ts. Both run once per MCP startup, both are detection/cleanup (never auto-kill), both are no-ops in the steady state.

1. Reap stale per-PID active-session-<pid> files (b7f43b7)

Per-PID active-session-<pid> files (introduced in 0.30.19+) make session_id lookup race-free under concurrent sessions, but nothing has been reaping them when the owning claude process exits. On a developer machine with many short-lived sessions per day, they accumulate indefinitely — 8 stale files observed in one project on 2026-05-13, from claude PIDs long since dead.

They are cosmetic — the legacy single active-session file remains the primary lookup — but a slow directory listing eventually becomes a real cost.

This patch adds a startup sweep that walks <project>/.orchestrator-state/, matches files of shape active-session-<pid>, probes liveness via process.kill(pid, 0), and unlinks dead-PID entries. Cross-platform; cheap; idempotent; race-safe (only unlinks PIDs verified gone). Lost races with concurrent sessions tolerated — next startup retries.

2. Warn about likely-orphan sibling MCPs (a28388e)

Complements the existing orphan-bun watchdog (which catches "parent dies while I'm alive" for the current process). The watchdog only protects processes that LOADED the watchdog code — older bun processes whose in-memory bytecode predates a fix do not benefit, and can survive forever if their original parent claude died without triggering whatever watchdog they happen to be running.

Concretely observed 2026-05-13: an orphan bun survived ~30 minutes across multiple watchdog tick intervals before manual kill -9. The on-disk dist/server.js had been rebuilt while the orphan was running, so any subsequent watchdog improvements were invisible to it.

This patch adds a startup-time scan (Linux only) that walks /proc for bun processes whose cmdline references orchestrator/dist/server.js and whose parent chain contains no live claude process within 8 hops. Suspects are logged with diagnostic guidance:

[orchestrator] startup hygiene: detected N likely-orphan sibling MCP process(es): pid=A,B,C.
Their parent claude is no longer in the process tree, suggesting they outlived their owning session and may be running stale bytecode whose watchdog never fired.
Diagnose with 'pstree -ps <pid>'; clean up with 'kill -9 <pid>' if confirmed orphan.

Detection only — does NOT auto-kill. Sibling MCPs may co-own infrastructure shared across live sessions (the python sidecar is deliberately shared via .sidecar-port — killing an unrelated bun could take down a live session's embeddings). The operator decides whether to clean up.

Windows is unchanged — killOlderDuplicateMcps already handles a related case (siblings sharing our parent claude). Pure orphans on Windows are rare because parent death typically reaps children.

Why "detection, not auto-kill"

The sidecar reuse pattern at startSidecar() line 338–350 deliberately shares the python embedding server across MCPs. An auto-kill could take down a sidecar that a live session depends on. Surfacing the problem at startup gives the operator the information without the risk.

Files changed

  • plugins/orchestrator/mcp/server.ts — 2 new functions + 1 startup block wiring them in
  • plugins/orchestrator/dist/server.js — rebuilt via bun run build

Imports updated: added readdirSync and unlinkSync to the existing node:fs import line.

Tested

  • bun run typecheck — clean
  • bun test — 516 pass / 0 fail / 38 files / 1207 assertions (no test changes)
  • Manually verified the diagnostic recipe: 8 stale active-session-<pid> files in a real workspace were correctly identified as dead and could be removed by the same logic the reaper applies.

Test plan

  • Fresh install on a machine with no .orchestrator-state/ directory — both functions should be silent no-ops.
  • Fresh install on a machine where prior sessions left stale active-session-<pid> files — verify the startup log line reports the reap count and the files are gone.
  • Multiple live Claude Code sessions in the same project — verify the warner does NOT report them as orphans (each has a live claude ancestor).
  • Orphan reproduction (kill a parent claude process while the bun keeps running) — verify the warner reports the surviving bun PID at the next session startup.

🤖 Generated with Claude Code

evannadeau and others added 2 commits May 13, 2026 19:21
Per-PID active-session-<pid> files (introduced in 0.30.19+) make
session_id lookup race-free under concurrent sessions, but nothing has
been reaping them when the owning claude process exits. On a developer
machine with many short-lived sessions per day, they accumulate
indefinitely — 8 stale files observed in one project on 2026-05-13,
from claude PIDs long since dead.

The files are cosmetic in the sense that the legacy single
`active-session` file remains the primary lookup, but a slow directory
listing eventually becomes a real cost on a hot-spot workstation.

This patch adds a startup sweep that walks `<project>/.orchestrator-state/`,
matches files of shape `active-session-<pid>`, probes liveness via
`process.kill(pid, 0)`, and unlinks dead-PID entries. The probe is
cross-platform via Node's API. Cheap, idempotent, race-safe (we only
unlink files whose PID is verifiably gone). Lost races with concurrent
sessions are tolerated — next startup retries.

Runs once at MCP startup, unconditionally (even when the
no-claude-ancestor branch is about to exit, so future startups benefit).

Tested: bun run typecheck clean, bun test 516 pass / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Complements the existing orphan-bun watchdog (which catches "parent
dies while I'm alive" cases for the current process). The watchdog
only protects processes that LOADED the watchdog code - older bun
processes whose in-memory bytecode predates a fix do not benefit from
that fix, and can survive forever if their original parent claude
died without triggering whatever watchdog they happen to be running.

Concretely: on a developer machine that pulls plugin updates, an MCP
process loaded at time T1 may still be alive after the on-disk
`dist/server.js` is rebuilt at T2 > T1. If the parent claude that
spawned T1's bun dies after T2, the T1 bun's in-memory watchdog code
is the version from T1 - any later improvements to watchdog detection
are invisible to it. We observed this 2026-05-13: an orphan bun
survived ~30 minutes across multiple watchdog tick intervals before
manual cleanup via `kill -9`.

This patch adds a startup-time scan (Linux only) that walks /proc for
bun processes whose cmdline references `orchestrator/dist/server.js`
and whose parent chain contains no live `claude` process within 8
hops. Suspects are logged with diagnostic guidance; we do NOT
auto-kill, because sibling MCPs may co-own infrastructure shared
across live sessions (the python sidecar is deliberately shared via
`.sidecar-port`). Detection surfaces the issue; the operator decides.

Windows is unchanged - killOlderDuplicateMcps already handles a
related case (siblings sharing our parent claude). Pure orphans on
Windows are rare because parent death typically reaps children.

Tested: bun run typecheck clean, bun test 516 pass / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant