fix(orch): reclaim reviewer/validator/dead-manager checkouts mid-objective#62
Merged
Conversation
…ctive Mid-active-objective workspace reclaim only freed a finished *worker* checkout once a PR had been published from its branch — it never touched reviewer, validator, or dead-manager checkouts. On a long-lived deno objective those dominate: a single failed reviewer's WPT + build checkout was 47G, and a manager that re-spawns adversarial reviews piled up a dozen multi-GB isolated checkouts that sat on disk for the objective's whole life. That filled the worker box to 100% and every new session (including manager respawns) died at 'clone from cache: No space left on device', stranding the objective on its exhausted manager budget. A reviewer/validator never authors a PR and a respawned manager always gets a fresh checkout (never inherits the old one), so those checkouts are spent the moment the session goes terminal — no publish gate needed. The publish gate now applies only to roles that author a branch/PR (implementer, custom, researcher); everything else terminal is reclaimed, still guarded by the in-use and pending-dependent-inherits checks.
nathanwhit
added a commit
that referenced
this pull request
Jun 24, 2026
…deadlock (#63) * fix(orch): keep reviewer checkouts with unmerged commits during reclaim A reviewer is told to commit any fix it makes, but that commit lives ONLY on its review branch in an ephemeral checkout — never pushed or merged. The mid-objective reclaim (#62) tore those checkouts down the moment the reviewer went terminal, silently destroying the work. (This bit us live: a reviewer's compiled-run follow-up commit was lost when its checkout was reclaimed.) reclaimSpentCheckouts now keeps any non-publishing checkout whose branch has advanced past its base (checkoutHasUnmergedCommits): disk is the cheaper loss. Published-PR checkouts stay exempt — their commits are safely on the PR. The probe is conservative: it only reclaims when there is provably nothing to lose (dir gone / not a repo / HEAD == base) and keeps the checkout on any uncertain probe of a still-present dir. * fix(exec): close stdin for empty feed-and-wait writes (manager-wedge deadlock) An empty Stdin was indistinguishable from an interactive session, so the executor left the stdin pipe open — and over SSH allocated a -tt pty. A feed-and-wait write with empty content (tee'ing a blanked memory file) then hung forever waiting for an EOF that never came. A single 0-byte repo-memory row wedged seedRepoMemory on resume, which deadlocked EVERY manager's StartRun and took the whole fleet down after a redeploy. Add Command.CloseStdin: it forces the pipe written-and-closed even when Stdin is empty, and suppresses the SSH pty so the EOF reaches the remote reader. writeWorkspaceFile (the tee that seeds memory) sets it, so an empty memory file now writes cleanly instead of hanging.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The worker box filled to 100% disk and every new session — including manager respawns — died at
clone from cache: fatal: ... No space left on device, stranding objective a8641186 on its exhausted manager budget ("manager failed repeatedly").Root cause: mid-active-objective workspace reclaim (
reclaimSpentCheckouts, formerlyreclaimSpentWorkerCheckouts) only freed a finished worker checkout once a PR had been published from its branch. It never reclaimed reviewer, validator, or dead-manager checkouts. On a long-lived deno objective those dominate:Fix
A reviewer/validator never authors a PR, and a respawned manager always gets a fresh checkout (never inherits the old one) — so those checkouts are spent the moment the session goes terminal, no publish gate needed. The publish gate now applies only to roles that author a branch/PR (
implementer,custom,researcher) viarolePublishesPR; every other terminal isolated checkout is reclaimed, still guarded by the existing in-use and pending-dependent-inherits checks.The live manager is excluded automatically: an active manager is never terminal.
Test
TestReclaimWorkspaces_EagerWorkerCheckoutsupdated — the old test asserted a terminal manager checkout must be kept (the bug); it now models the live manager asrunning(kept via in-use) and adds cases proving a terminal reviewer and a dead manager checkout are reclaimed.Companion to #60 (skip oversized submodules shrinks each checkout); this PR stops them accumulating.