Skip to content

fix(orch): exempt managers from per-target worker capacity#67

Merged
nathanwhit merged 1 commit into
mainfrom
manager-capacity-exempt
Jun 25, 2026
Merged

fix(orch): exempt managers from per-target worker capacity#67
nathanwhit merged 1 commit into
mainfrom
manager-capacity-exempt

Conversation

@nathanwhit

Copy link
Copy Markdown
Owner

Symptom

On objective d3bdf790 (PR #35461), a user PR comment went unanswered for >5 min. Events showed pr_feedback_ingested but no followup_spawned after it — unlike every prior feedback round, which spawns a follow-up within ~2 min. The comment only got a follow-up ~9 min later, and only because a local slot happened to free.

Root cause

SyncPRFeedback runs IngestFeedback → ProcessFeedback → spawnPRFollowup, whose first step is SelectTarget(TargetRequest{}). It returned ErrNoTarget because every target was unplaceable:

  • local (cap 4): available_sessions = 0 — full of three idle interactive managers, each holding a per-target slot for its whole life.
  • Vultr (cap 16, 15 free): load_per_core = 1.89 ≥ the default -max-load-per-core 1.5, so the load gate skipped it.

spawnPRFollowup returns that error without spawning, marking the feedback handled, or emitting any event — so it is silently swallowed and retried invisibly every 1-min monitor tick (ingest dedups → no event; spawn keeps failing → no event). Nothing in the events DB or orcha stdout surfaces it.

The design gap

orcha already exempts managers from the global worker-concurrency cap (Store.CountActiveWorkerSessions skips RoleManager; Scheduler.Tick lets managers bypass the budget), precisely because "managers sit idle most of their life ... counting them starves new work." But it left them charged against per-target capacity (available_sessions). So sum(capacity) (here 4+16=20) is a hard ceiling on resident managers — 20 managers ⇒ zero worker slots ⇒ fleet deadlock.

Fix

Extend the existing global-cap exemption to per-target capacity:

  • TargetRequest.IgnoreWorkerCapacity (set by targetRequestFor for managers) makes SelectTarget skip the AvailableSessions <= 0 check.
  • PlaceSession / releaseTargetSlot skip the claim/release for managers — both keyed on a single sessionExemptFromCapacity(sess) predicate (role == manager) so claim and release can never disagree.
  • Managers stay bounded by the load gate and the per-objective respawn limit.

available_sessions is an incrementally-maintained counter, so make it self-healing: Store.ReconcileTargetSlots rebuilds it from live non-terminal, non-manager occupancy, called once in RecoverInterrupted at startup (before the scheduler's first tick). This both heals crash drift and frees the slots currently-running managers claimed under the old accounting when this deploys.

Tests

New internal/orch/capacity_test.go:

  • manager places on a full target while a worker is refused
  • placing/releasing a manager doesn't move available_sessions; a worker does
  • ReconcileTargetSlots excludes managers + terminal rows, heals a drifted counter, and is idempotent

go test ./... and go vet ./internal/... pass.

Not in this PR (follow-up)

The silent ErrNoTarget swallow in spawnPRFollowup — a placement-starved follow-up should emit a throttled audit event so it's visible instead of invisible. Flagged for a small follow-up.

Interactive managers are long-lived per-objective supervisors that sit
idle most of their life, yet each held a per-target session slot for its
whole life. With sum(capacity) slots across the fleet (e.g. 4+16=20), an
objective's PR feedback could not be placed once idle managers filled a
target and the only other target was load-gated — the wedge that left a
PR comment unanswered: SelectTarget -> ErrNoTarget, swallowed silently in
spawnPRFollowup and retried invisibly every monitor tick until a slot
happened to free.

The codebase already exempts managers from the GLOBAL worker-concurrency
cap (CountActiveWorkerSessions / Scheduler.Tick) for exactly this reason,
but left them charged against PER-TARGET capacity. Extend the same
exemption: managers place without claiming a worker slot (keyed on the
durable session role so claim and release stay in lockstep), and
SelectTarget ignores worker capacity for a manager request. Managers stay
bounded by the load gate and the per-objective respawn limit.

available_sessions is an incrementally-maintained counter, so make it
self-healing: ReconcileTargetSlots rebuilds it from live non-terminal,
non-manager occupancy at startup (before the scheduler's first tick),
healing both crash drift and the slots currently-running managers claimed
under the old accounting.
@nathanwhit nathanwhit merged commit 5e20e1b into main Jun 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant