fix(orch): don't let infra-failed managers exhaust the respawn budget#64
Merged
Conversation
A manager that dies at workspace-prepare without ever running is an INFRA failure (a full disk, an unreachable target), not a stuck objective — yet it counted the same as a genuine failure against the 4-manager respawn cap. A transient host outage therefore burned the whole budget and permanently parked the objective even after the infra healed: exactly what stranded a8641186 behind three disk-full respawns. managerRespawnExhausted now counts only GENUINE attempts (a manager that reached running, plus any still in flight — StartedAt is stamped only on the transition to running) against maxManagerSessions. A separate hard ceiling (maxManagerSpawns) on TOTAL spawns still bounds an objective whose target is persistently unplaceable, so it escalates instead of respawning forever.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A manager that dies at workspace-prepare without ever running is an infra failure (full disk, unreachable target), not a stuck objective — yet it counted the same as a genuine failure against the 4-manager respawn cap. A transient host outage therefore burned the whole budget and permanently parked the objective even after the infra healed (this stranded a8641186 behind three disk-full respawns).
managerRespawnExhaustednow counts only genuine attempts (a manager that reachedrunning, plus any still in flight —StartedAtis stamped only on the transition to running) againstmaxManagerSessions. A separate hard ceiling (maxManagerSpawns) on total spawns still bounds an objective whose target is persistently unplaceable, so it escalates instead of respawning forever.Tests:
TestManagerRespawn_InfraDeathsDontExhaustBudget,TestManagerRespawn_HardCeilingBoundsInfraLoop.