feat(orchestrator): startup hygiene — stale per-PID files + orphan detection#8
Open
evannadeau wants to merge 2 commits into
Open
feat(orchestrator): startup hygiene — stale per-PID files + orphan detection#8evannadeau wants to merge 2 commits into
evannadeau wants to merge 2 commits into
Conversation
Per-PID active-session-<pid> files (introduced in 0.30.19+) make session_id lookup race-free under concurrent sessions, but nothing has been reaping them when the owning claude process exits. On a developer machine with many short-lived sessions per day, they accumulate indefinitely — 8 stale files observed in one project on 2026-05-13, from claude PIDs long since dead. The files are cosmetic in the sense that the legacy single `active-session` file remains the primary lookup, but a slow directory listing eventually becomes a real cost on a hot-spot workstation. This patch adds a startup sweep that walks `<project>/.orchestrator-state/`, matches files of shape `active-session-<pid>`, probes liveness via `process.kill(pid, 0)`, and unlinks dead-PID entries. The probe is cross-platform via Node's API. Cheap, idempotent, race-safe (we only unlink files whose PID is verifiably gone). Lost races with concurrent sessions are tolerated — next startup retries. Runs once at MCP startup, unconditionally (even when the no-claude-ancestor branch is about to exit, so future startups benefit). Tested: bun run typecheck clean, bun test 516 pass / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Complements the existing orphan-bun watchdog (which catches "parent dies while I'm alive" cases for the current process). The watchdog only protects processes that LOADED the watchdog code - older bun processes whose in-memory bytecode predates a fix do not benefit from that fix, and can survive forever if their original parent claude died without triggering whatever watchdog they happen to be running. Concretely: on a developer machine that pulls plugin updates, an MCP process loaded at time T1 may still be alive after the on-disk `dist/server.js` is rebuilt at T2 > T1. If the parent claude that spawned T1's bun dies after T2, the T1 bun's in-memory watchdog code is the version from T1 - any later improvements to watchdog detection are invisible to it. We observed this 2026-05-13: an orphan bun survived ~30 minutes across multiple watchdog tick intervals before manual cleanup via `kill -9`. This patch adds a startup-time scan (Linux only) that walks /proc for bun processes whose cmdline references `orchestrator/dist/server.js` and whose parent chain contains no live `claude` process within 8 hops. Suspects are logged with diagnostic guidance; we do NOT auto-kill, because sibling MCPs may co-own infrastructure shared across live sessions (the python sidecar is deliberately shared via `.sidecar-port`). Detection surfaces the issue; the operator decides. Windows is unchanged - killOlderDuplicateMcps already handles a related case (siblings sharing our parent claude). Pure orphans on Windows are rare because parent death typically reaps children. Tested: bun run typecheck clean, bun test 516 pass / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two additive startup-time hygiene improvements in
plugins/orchestrator/mcp/server.ts. Both run once per MCP startup, both are detection/cleanup (never auto-kill), both are no-ops in the steady state.1. Reap stale per-PID
active-session-<pid>files (b7f43b7)Per-PID
active-session-<pid>files (introduced in 0.30.19+) make session_id lookup race-free under concurrent sessions, but nothing has been reaping them when the owning claude process exits. On a developer machine with many short-lived sessions per day, they accumulate indefinitely — 8 stale files observed in one project on 2026-05-13, from claude PIDs long since dead.They are cosmetic — the legacy single
active-sessionfile remains the primary lookup — but a slow directory listing eventually becomes a real cost.This patch adds a startup sweep that walks
<project>/.orchestrator-state/, matches files of shapeactive-session-<pid>, probes liveness viaprocess.kill(pid, 0), and unlinks dead-PID entries. Cross-platform; cheap; idempotent; race-safe (only unlinks PIDs verified gone). Lost races with concurrent sessions tolerated — next startup retries.2. Warn about likely-orphan sibling MCPs (
a28388e)Complements the existing orphan-bun watchdog (which catches "parent dies while I'm alive" for the current process). The watchdog only protects processes that LOADED the watchdog code — older bun processes whose in-memory bytecode predates a fix do not benefit, and can survive forever if their original parent claude died without triggering whatever watchdog they happen to be running.
Concretely observed 2026-05-13: an orphan bun survived ~30 minutes across multiple watchdog tick intervals before manual
kill -9. The on-diskdist/server.jshad been rebuilt while the orphan was running, so any subsequent watchdog improvements were invisible to it.This patch adds a startup-time scan (Linux only) that walks
/procfor bun processes whose cmdline referencesorchestrator/dist/server.jsand whose parent chain contains no liveclaudeprocess within 8 hops. Suspects are logged with diagnostic guidance:Detection only — does NOT auto-kill. Sibling MCPs may co-own infrastructure shared across live sessions (the python sidecar is deliberately shared via
.sidecar-port— killing an unrelated bun could take down a live session's embeddings). The operator decides whether to clean up.Windows is unchanged —
killOlderDuplicateMcpsalready handles a related case (siblings sharing our parent claude). Pure orphans on Windows are rare because parent death typically reaps children.Why "detection, not auto-kill"
The sidecar reuse pattern at
startSidecar()line 338–350 deliberately shares the python embedding server across MCPs. An auto-kill could take down a sidecar that a live session depends on. Surfacing the problem at startup gives the operator the information without the risk.Files changed
plugins/orchestrator/mcp/server.ts— 2 new functions + 1 startup block wiring them inplugins/orchestrator/dist/server.js— rebuilt viabun run buildImports updated: added
readdirSyncandunlinkSyncto the existingnode:fsimport line.Tested
bun run typecheck— cleanbun test— 516 pass / 0 fail / 38 files / 1207 assertions (no test changes)active-session-<pid>files in a real workspace were correctly identified as dead and could be removed by the same logic the reaper applies.Test plan
.orchestrator-state/directory — both functions should be silent no-ops.active-session-<pid>files — verify the startup log line reports the reap count and the files are gone.claudeancestor).🤖 Generated with Claude Code