fix(orch): reclaim reviewer/validator/dead-manager checkouts mid-objective by nathanwhit · Pull Request #62 · nathanwhit/orcha

nathanwhit · 2026-06-24T08:14:17Z

Problem

The worker box filled to 100% disk and every new session — including manager respawns — died at clone from cache: fatal: ... No space left on device, stranding objective a8641186 on its exhausted manager budget ("manager failed repeatedly").

Root cause: mid-active-objective workspace reclaim (reclaimSpentCheckouts, formerly reclaimSpentWorkerCheckouts) only freed a finished worker checkout once a PR had been published from its branch. It never reclaimed reviewer, validator, or dead-manager checkouts. On a long-lived deno objective those dominate:

a single failed reviewer's WPT + rust-build checkout was 47G
a manager that re-spawned adversarial reviews piled up 12 multi-GB isolated reviewer checkouts, all sitting on disk for the objective's entire life

Fix

A reviewer/validator never authors a PR, and a respawned manager always gets a fresh checkout (never inherits the old one) — so those checkouts are spent the moment the session goes terminal, no publish gate needed. The publish gate now applies only to roles that author a branch/PR (implementer, custom, researcher) via rolePublishesPR; every other terminal isolated checkout is reclaimed, still guarded by the existing in-use and pending-dependent-inherits checks.

The live manager is excluded automatically: an active manager is never terminal.

Test

TestReclaimWorkspaces_EagerWorkerCheckouts updated — the old test asserted a terminal manager checkout must be kept (the bug); it now models the live manager as running (kept via in-use) and adds cases proving a terminal reviewer and a dead manager checkout are reclaimed.

Companion to #60 (skip oversized submodules shrinks each checkout); this PR stops them accumulating.

…ctive Mid-active-objective workspace reclaim only freed a finished *worker* checkout once a PR had been published from its branch — it never touched reviewer, validator, or dead-manager checkouts. On a long-lived deno objective those dominate: a single failed reviewer's WPT + build checkout was 47G, and a manager that re-spawns adversarial reviews piled up a dozen multi-GB isolated checkouts that sat on disk for the objective's whole life. That filled the worker box to 100% and every new session (including manager respawns) died at 'clone from cache: No space left on device', stranding the objective on its exhausted manager budget. A reviewer/validator never authors a PR and a respawned manager always gets a fresh checkout (never inherits the old one), so those checkouts are spent the moment the session goes terminal — no publish gate needed. The publish gate now applies only to roles that author a branch/PR (implementer, custom, researcher); everything else terminal is reclaimed, still guarded by the in-use and pending-dependent-inherits checks.

…deadlock (#63) * fix(orch): keep reviewer checkouts with unmerged commits during reclaim A reviewer is told to commit any fix it makes, but that commit lives ONLY on its review branch in an ephemeral checkout — never pushed or merged. The mid-objective reclaim (#62) tore those checkouts down the moment the reviewer went terminal, silently destroying the work. (This bit us live: a reviewer's compiled-run follow-up commit was lost when its checkout was reclaimed.) reclaimSpentCheckouts now keeps any non-publishing checkout whose branch has advanced past its base (checkoutHasUnmergedCommits): disk is the cheaper loss. Published-PR checkouts stay exempt — their commits are safely on the PR. The probe is conservative: it only reclaims when there is provably nothing to lose (dir gone / not a repo / HEAD == base) and keeps the checkout on any uncertain probe of a still-present dir. * fix(exec): close stdin for empty feed-and-wait writes (manager-wedge deadlock) An empty Stdin was indistinguishable from an interactive session, so the executor left the stdin pipe open — and over SSH allocated a -tt pty. A feed-and-wait write with empty content (tee'ing a blanked memory file) then hung forever waiting for an EOF that never came. A single 0-byte repo-memory row wedged seedRepoMemory on resume, which deadlocked EVERY manager's StartRun and took the whole fleet down after a redeploy. Add Command.CloseStdin: it forces the pipe written-and-closed even when Stdin is empty, and suppresses the SSH pty so the EOF reaches the remote reader. writeWorkspaceFile (the tee that seeds memory) sets it, so an empty memory file now writes cleanly instead of hanging.

nathanwhit merged commit 99638e8 into main Jun 24, 2026
1 check passed

nathanwhit mentioned this pull request Jun 24, 2026

fix: keep unmerged reviewer commits during reclaim + fix empty-stdin deadlock #63

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(orch): reclaim reviewer/validator/dead-manager checkouts mid-objective#62

fix(orch): reclaim reviewer/validator/dead-manager checkouts mid-objective#62
nathanwhit merged 1 commit into
mainfrom
reclaim-non-worker-checkouts

nathanwhit commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nathanwhit commented Jun 24, 2026

Problem

Fix

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant