fix(orch): exempt managers from per-target worker capacity by nathanwhit · Pull Request #67 · nathanwhit/orcha

nathanwhit · 2026-06-25T01:16:47Z

Symptom

On objective d3bdf790 (PR #35461), a user PR comment went unanswered for >5 min. Events showed pr_feedback_ingested but no followup_spawned after it — unlike every prior feedback round, which spawns a follow-up within ~2 min. The comment only got a follow-up ~9 min later, and only because a local slot happened to free.

Root cause

SyncPRFeedback runs IngestFeedback → ProcessFeedback → spawnPRFollowup, whose first step is SelectTarget(TargetRequest{}). It returned ErrNoTarget because every target was unplaceable:

local (cap 4): available_sessions = 0 — full of three idle interactive managers, each holding a per-target slot for its whole life.
Vultr (cap 16, 15 free): load_per_core = 1.89 ≥ the default -max-load-per-core 1.5, so the load gate skipped it.

spawnPRFollowup returns that error without spawning, marking the feedback handled, or emitting any event — so it is silently swallowed and retried invisibly every 1-min monitor tick (ingest dedups → no event; spawn keeps failing → no event). Nothing in the events DB or orcha stdout surfaces it.

The design gap

orcha already exempts managers from the global worker-concurrency cap (Store.CountActiveWorkerSessions skips RoleManager; Scheduler.Tick lets managers bypass the budget), precisely because "managers sit idle most of their life ... counting them starves new work." But it left them charged against per-target capacity (available_sessions). So sum(capacity) (here 4+16=20) is a hard ceiling on resident managers — 20 managers ⇒ zero worker slots ⇒ fleet deadlock.

Fix

Extend the existing global-cap exemption to per-target capacity:

TargetRequest.IgnoreWorkerCapacity (set by targetRequestFor for managers) makes SelectTarget skip the AvailableSessions <= 0 check.
PlaceSession / releaseTargetSlot skip the claim/release for managers — both keyed on a single sessionExemptFromCapacity(sess) predicate (role == manager) so claim and release can never disagree.
Managers stay bounded by the load gate and the per-objective respawn limit.

available_sessions is an incrementally-maintained counter, so make it self-healing: Store.ReconcileTargetSlots rebuilds it from live non-terminal, non-manager occupancy, called once in RecoverInterrupted at startup (before the scheduler's first tick). This both heals crash drift and frees the slots currently-running managers claimed under the old accounting when this deploys.

Tests

New internal/orch/capacity_test.go:

manager places on a full target while a worker is refused
placing/releasing a manager doesn't move available_sessions; a worker does
ReconcileTargetSlots excludes managers + terminal rows, heals a drifted counter, and is idempotent

go test ./... and go vet ./internal/... pass.

Not in this PR (follow-up)

The silent ErrNoTarget swallow in spawnPRFollowup — a placement-starved follow-up should emit a throttled audit event so it's visible instead of invisible. Flagged for a small follow-up.

Interactive managers are long-lived per-objective supervisors that sit idle most of their life, yet each held a per-target session slot for its whole life. With sum(capacity) slots across the fleet (e.g. 4+16=20), an objective's PR feedback could not be placed once idle managers filled a target and the only other target was load-gated — the wedge that left a PR comment unanswered: SelectTarget -> ErrNoTarget, swallowed silently in spawnPRFollowup and retried invisibly every monitor tick until a slot happened to free. The codebase already exempts managers from the GLOBAL worker-concurrency cap (CountActiveWorkerSessions / Scheduler.Tick) for exactly this reason, but left them charged against PER-TARGET capacity. Extend the same exemption: managers place without claiming a worker slot (keyed on the durable session role so claim and release stay in lockstep), and SelectTarget ignores worker capacity for a manager request. Managers stay bounded by the load gate and the per-objective respawn limit. available_sessions is an incrementally-maintained counter, so make it self-healing: ReconcileTargetSlots rebuilds it from live non-terminal, non-manager occupancy at startup (before the scheduler's first tick), healing both crash drift and the slots currently-running managers claimed under the old accounting.

nathanwhit merged commit 5e20e1b into main Jun 25, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(orch): exempt managers from per-target worker capacity#67

fix(orch): exempt managers from per-target worker capacity#67
nathanwhit merged 1 commit into
mainfrom
manager-capacity-exempt

nathanwhit commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nathanwhit commented Jun 25, 2026

Symptom

Root cause

The design gap

Fix

Tests

Not in this PR (follow-up)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant