reconnect workspace adapters after liveness loss by QuanCheng-QC · Pull Request #526 · openagents-org/openagents

QuanCheng-QC · 2026-06-27T08:39:20Z

Fix workspace adapters getting stuck running but offline.

Adds automatic join retry and reconnect after heartbeat failures, while avoiding unsafe remote leave calls during session changes.

vercel · 2026-06-27T08:39:26Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
openagents-workspace	Ready	Preview, Comment	Jun 27, 2026 8:39am

zomux

Thanks for this — the reconnect design is solid (generation-counter model, AND-gated trigger, bounded jittered backoff, cursor/dedup preserved, never POSTs /leave on auto-exit). A few things to address before merge:

1. (Required) Control events issued during an outage are silently dropped on reconnect.
_runConnectedSession calls _skipExistingControlEvents() on every session start, including reconnects, which advances _lastControlId to head. So a /restart, /stop, or set_mode a user issues while the agent is disconnected is skipped after reconnect and never acted on. This is asymmetric with message events, which you correctly preserve via _lastEventId. Please gate _skipExistingControlEvents() on this._firstConnect (mirroring how _skipExistingEvents() is already gated), so stale control events are only skipped on the first connect after a fresh daemon start, not on in-process reconnects.

2. (Recommended) Classify HTTP 429 (and probably 408) as retryable.
_classifyError currently routes unexpected 4xx (incl. 429 rate-limit) to ambiguous, which is capped at 5 attempts then terminal join_failed. During a mass fleet reconnect after a server restart (the #492 scenario this helps with), 429s could exhaust the ambiguous budget and permanently drop agents that should keep retrying. Treat 429/408 as retryable.

3. (Please verify) Control poller isn't generation-scoped.
_controlPollerLoop loops on _running only, not gen. On reconnect the finally awaits the per-session control poller after _wakeControlPoller(); under a slow control endpoint this could stall re-join. A gen === this._activeGen guard in the control loop (or starting it once in the supervisor instead of per-session) would make this robust. The 728-line test stubs _controlPollerLoop to a no-op, so this path is currently untested — a test exercising the real loop on reconnect would help.

Also note: this branch is now CONFLICTING against develop (base.js has changed substantially via the merged #527 readiness work). It'll need a rebase. Happy to re-review promptly once these are addressed.

reconnect workspace adapters after liveness loss

0f83f3a

zomux requested changes Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reconnect workspace adapters after liveness loss#526

reconnect workspace adapters after liveness loss#526
QuanCheng-QC wants to merge 1 commit into
developfrom
bugfix/workspace-adapter-reconnect

QuanCheng-QC commented Jun 27, 2026

Uh oh!

vercel Bot commented Jun 27, 2026

Uh oh!

zomux left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

QuanCheng-QC commented Jun 27, 2026

Uh oh!

vercel Bot commented Jun 27, 2026

Uh oh!

zomux left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants