Skip to content

fix(orch): reclaim reviewer/validator/dead-manager checkouts mid-objective#62

Merged
nathanwhit merged 1 commit into
mainfrom
reclaim-non-worker-checkouts
Jun 24, 2026
Merged

fix(orch): reclaim reviewer/validator/dead-manager checkouts mid-objective#62
nathanwhit merged 1 commit into
mainfrom
reclaim-non-worker-checkouts

Conversation

@nathanwhit

Copy link
Copy Markdown
Owner

Problem

The worker box filled to 100% disk and every new session — including manager respawns — died at clone from cache: fatal: ... No space left on device, stranding objective a8641186 on its exhausted manager budget ("manager failed repeatedly").

Root cause: mid-active-objective workspace reclaim (reclaimSpentCheckouts, formerly reclaimSpentWorkerCheckouts) only freed a finished worker checkout once a PR had been published from its branch. It never reclaimed reviewer, validator, or dead-manager checkouts. On a long-lived deno objective those dominate:

  • a single failed reviewer's WPT + rust-build checkout was 47G
  • a manager that re-spawned adversarial reviews piled up 12 multi-GB isolated reviewer checkouts, all sitting on disk for the objective's entire life

Fix

A reviewer/validator never authors a PR, and a respawned manager always gets a fresh checkout (never inherits the old one) — so those checkouts are spent the moment the session goes terminal, no publish gate needed. The publish gate now applies only to roles that author a branch/PR (implementer, custom, researcher) via rolePublishesPR; every other terminal isolated checkout is reclaimed, still guarded by the existing in-use and pending-dependent-inherits checks.

The live manager is excluded automatically: an active manager is never terminal.

Test

TestReclaimWorkspaces_EagerWorkerCheckouts updated — the old test asserted a terminal manager checkout must be kept (the bug); it now models the live manager as running (kept via in-use) and adds cases proving a terminal reviewer and a dead manager checkout are reclaimed.

Companion to #60 (skip oversized submodules shrinks each checkout); this PR stops them accumulating.

…ctive

Mid-active-objective workspace reclaim only freed a finished *worker*
checkout once a PR had been published from its branch — it never touched
reviewer, validator, or dead-manager checkouts. On a long-lived deno
objective those dominate: a single failed reviewer's WPT + build checkout
was 47G, and a manager that re-spawns adversarial reviews piled up a dozen
multi-GB isolated checkouts that sat on disk for the objective's whole life.
That filled the worker box to 100% and every new session (including manager
respawns) died at 'clone from cache: No space left on device', stranding the
objective on its exhausted manager budget.

A reviewer/validator never authors a PR and a respawned manager always gets
a fresh checkout (never inherits the old one), so those checkouts are spent
the moment the session goes terminal — no publish gate needed. The publish
gate now applies only to roles that author a branch/PR (implementer, custom,
researcher); everything else terminal is reclaimed, still guarded by the
in-use and pending-dependent-inherits checks.
@nathanwhit nathanwhit merged commit 99638e8 into main Jun 24, 2026
1 check passed
nathanwhit added a commit that referenced this pull request Jun 24, 2026
…deadlock (#63)

* fix(orch): keep reviewer checkouts with unmerged commits during reclaim

A reviewer is told to commit any fix it makes, but that commit lives ONLY on
its review branch in an ephemeral checkout — never pushed or merged. The
mid-objective reclaim (#62) tore those checkouts down the moment the reviewer
went terminal, silently destroying the work. (This bit us live: a reviewer's
compiled-run follow-up commit was lost when its checkout was reclaimed.)

reclaimSpentCheckouts now keeps any non-publishing checkout whose branch has
advanced past its base (checkoutHasUnmergedCommits): disk is the cheaper loss.
Published-PR checkouts stay exempt — their commits are safely on the PR. The
probe is conservative: it only reclaims when there is provably nothing to lose
(dir gone / not a repo / HEAD == base) and keeps the checkout on any uncertain
probe of a still-present dir.

* fix(exec): close stdin for empty feed-and-wait writes (manager-wedge deadlock)

An empty Stdin was indistinguishable from an interactive session, so the
executor left the stdin pipe open — and over SSH allocated a -tt pty. A
feed-and-wait write with empty content (tee'ing a blanked memory file) then
hung forever waiting for an EOF that never came. A single 0-byte repo-memory
row wedged seedRepoMemory on resume, which deadlocked EVERY manager's StartRun
and took the whole fleet down after a redeploy.

Add Command.CloseStdin: it forces the pipe written-and-closed even when Stdin
is empty, and suppresses the SSH pty so the EOF reaches the remote reader.
writeWorkspaceFile (the tee that seeds memory) sets it, so an empty memory
file now writes cleanly instead of hanging.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant