fix(orch): exempt managers from per-target worker capacity#67
Merged
Conversation
Interactive managers are long-lived per-objective supervisors that sit idle most of their life, yet each held a per-target session slot for its whole life. With sum(capacity) slots across the fleet (e.g. 4+16=20), an objective's PR feedback could not be placed once idle managers filled a target and the only other target was load-gated — the wedge that left a PR comment unanswered: SelectTarget -> ErrNoTarget, swallowed silently in spawnPRFollowup and retried invisibly every monitor tick until a slot happened to free. The codebase already exempts managers from the GLOBAL worker-concurrency cap (CountActiveWorkerSessions / Scheduler.Tick) for exactly this reason, but left them charged against PER-TARGET capacity. Extend the same exemption: managers place without claiming a worker slot (keyed on the durable session role so claim and release stay in lockstep), and SelectTarget ignores worker capacity for a manager request. Managers stay bounded by the load gate and the per-objective respawn limit. available_sessions is an incrementally-maintained counter, so make it self-healing: ReconcileTargetSlots rebuilds it from live non-terminal, non-manager occupancy at startup (before the scheduler's first tick), healing both crash drift and the slots currently-running managers claimed under the old accounting.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
On objective
d3bdf790(PR #35461), a user PR comment went unanswered for >5 min. Events showedpr_feedback_ingestedbut nofollowup_spawnedafter it — unlike every prior feedback round, which spawns a follow-up within ~2 min. The comment only got a follow-up ~9 min later, and only because alocalslot happened to free.Root cause
SyncPRFeedbackrunsIngestFeedback → ProcessFeedback → spawnPRFollowup, whose first step isSelectTarget(TargetRequest{}). It returnedErrNoTargetbecause every target was unplaceable:local(cap 4):available_sessions = 0— full of three idle interactive managers, each holding a per-target slot for its whole life.Vultr(cap 16, 15 free):load_per_core = 1.89 ≥the default-max-load-per-core 1.5, so the load gate skipped it.spawnPRFollowupreturns that error without spawning, marking the feedback handled, or emitting any event — so it is silently swallowed and retried invisibly every 1-min monitor tick (ingest dedups → no event; spawn keeps failing → no event). Nothing in the events DB or orcha stdout surfaces it.The design gap
orcha already exempts managers from the global worker-concurrency cap (
Store.CountActiveWorkerSessionsskipsRoleManager;Scheduler.Ticklets managers bypass the budget), precisely because "managers sit idle most of their life ... counting them starves new work." But it left them charged against per-target capacity (available_sessions). Sosum(capacity)(here 4+16=20) is a hard ceiling on resident managers — 20 managers ⇒ zero worker slots ⇒ fleet deadlock.Fix
Extend the existing global-cap exemption to per-target capacity:
TargetRequest.IgnoreWorkerCapacity(set bytargetRequestForfor managers) makesSelectTargetskip theAvailableSessions <= 0check.PlaceSession/releaseTargetSlotskip the claim/release for managers — both keyed on a singlesessionExemptFromCapacity(sess)predicate (role == manager) so claim and release can never disagree.available_sessionsis an incrementally-maintained counter, so make it self-healing:Store.ReconcileTargetSlotsrebuilds it from live non-terminal, non-manager occupancy, called once inRecoverInterruptedat startup (before the scheduler's first tick). This both heals crash drift and frees the slots currently-running managers claimed under the old accounting when this deploys.Tests
New
internal/orch/capacity_test.go:available_sessions; a worker doesReconcileTargetSlotsexcludes managers + terminal rows, heals a drifted counter, and is idempotentgo test ./...andgo vet ./internal/...pass.Not in this PR (follow-up)
The silent
ErrNoTargetswallow inspawnPRFollowup— a placement-starved follow-up should emit a throttled audit event so it's visible instead of invisible. Flagged for a small follow-up.