Skip to content

Feat/provider overload break#242

Merged
rennf93 merged 2 commits into
masterfrom
feat/provider-overload-break
Jun 22, 2026
Merged

Feat/provider overload break#242
rennf93 merged 2 commits into
masterfrom
feat/provider-overload-break

Conversation

@rennf93

@rennf93 rennf93 commented Jun 22, 2026

Copy link
Copy Markdown
Owner

No description provided.

rennf93 added 2 commits June 22, 2026 03:25
…500)

A 429 rate limit already parks a provider — queue its spawns, probe until it
recovers — but a persistent 529/500/503 overload had no such break: the run
died and the orchestrator crash-retried straight back into the overload,
burning tokens in a respawn loop.

Generalize the park to provider-unavailability. On a non-graceful Anthropic
agent exit, match the API's overload markers (overloaded_error /
internal_server_error / "API Error: 5xx") against the dead container's own
output and park the provider with kind="overloaded"; the existing spawn gate
already queues any parked provider, and the probe-resume loop revives the task
when it recovers. Grok keeps its exit-75 path; both now route through one
_park_provider_unavailable helper. Markers are kept specific so an agent that
merely writes about HTTP 500/529 can't trip the break.

Fix the recovery probe to require a 2xx: it treated any non-429 as recovered,
so a probe that itself got a 529 would have resumed agents straight back into
the overload — wrong for the new path and for a 429 that lifts into a 5xx.

Gated by ROBOCO_OVERLOAD_BREAK_ENABLED (default on; off => crash-retry).
@github-actions github-actions Bot added dependencies pyproject.toml / uv.lock or panel package manifests area: services Touches roboco/services/ (business logic, side effects) tests Test suite changes area: gateway Touches roboco/services/gateway/ (Choreographer, verb surface) area: orchestrator Touches roboco/runtime/ (agent spawner, dispatch loops) labels Jun 22, 2026
@rennf93 rennf93 merged commit 01e10ad into master Jun 22, 2026
7 of 8 checks passed
@github-actions

Copy link
Copy Markdown

Thanks for opening your first pull request on RoboCo!

Quick checklist before review (most of these are enforced by CI, but worth a glance):

  • make quality — ruff format check, ruff check, mypy, pytest (≥80% coverage), and the rest of the gate
  • Panel changes pass pnpm lint and pnpm exec tsc --noEmit (run from panel/)
  • No # noqa / # type: ignore shortcuts; pre-existing violations in touched files are fixed
  • Added an entry under ## [Unreleased] in CHANGELOG.md
  • Signed the CLA (the bot will prompt you on this PR)
  • Signed your commits — master requires verified signatures (SSH signing setup)
  • Updated any affected docs under docs/

See CONTRIBUTING.md for the full workflow and the Code of Conduct for the community standards we follow.

Welcome aboard — a maintainer will review shortly.

@github-project-automation github-project-automation Bot moved this from Backlog to Done in RoboCo Kanban Jun 22, 2026
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 22, 2026
@rennf93 rennf93 deleted the feat/provider-overload-break branch June 22, 2026 01:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area: gateway Touches roboco/services/gateway/ (Choreographer, verb surface) area: orchestrator Touches roboco/runtime/ (agent spawner, dispatch loops) area: services Touches roboco/services/ (business logic, side effects) dependencies pyproject.toml / uv.lock or panel package manifests tests Test suite changes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant