Skip to content

reconnect workspace adapters after liveness loss#526

Open
QuanCheng-QC wants to merge 1 commit into
developfrom
bugfix/workspace-adapter-reconnect
Open

reconnect workspace adapters after liveness loss#526
QuanCheng-QC wants to merge 1 commit into
developfrom
bugfix/workspace-adapter-reconnect

Conversation

@QuanCheng-QC

Copy link
Copy Markdown
Collaborator

Fix workspace adapters getting stuck running but offline.

Adds automatic join retry and reconnect after heartbeat failures, while avoiding unsafe remote leave calls during session changes.

@vercel

vercel Bot commented Jun 27, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
openagents-workspace Ready Ready Preview, Comment Jun 27, 2026 8:39am

Request Review

@zomux zomux left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — the reconnect design is solid (generation-counter model, AND-gated trigger, bounded jittered backoff, cursor/dedup preserved, never POSTs /leave on auto-exit). A few things to address before merge:

1. (Required) Control events issued during an outage are silently dropped on reconnect.
_runConnectedSession calls _skipExistingControlEvents() on every session start, including reconnects, which advances _lastControlId to head. So a /restart, /stop, or set_mode a user issues while the agent is disconnected is skipped after reconnect and never acted on. This is asymmetric with message events, which you correctly preserve via _lastEventId. Please gate _skipExistingControlEvents() on this._firstConnect (mirroring how _skipExistingEvents() is already gated), so stale control events are only skipped on the first connect after a fresh daemon start, not on in-process reconnects.

2. (Recommended) Classify HTTP 429 (and probably 408) as retryable.
_classifyError currently routes unexpected 4xx (incl. 429 rate-limit) to ambiguous, which is capped at 5 attempts then terminal join_failed. During a mass fleet reconnect after a server restart (the #492 scenario this helps with), 429s could exhaust the ambiguous budget and permanently drop agents that should keep retrying. Treat 429/408 as retryable.

3. (Please verify) Control poller isn't generation-scoped.
_controlPollerLoop loops on _running only, not gen. On reconnect the finally awaits the per-session control poller after _wakeControlPoller(); under a slow control endpoint this could stall re-join. A gen === this._activeGen guard in the control loop (or starting it once in the supervisor instead of per-session) would make this robust. The 728-line test stubs _controlPollerLoop to a no-op, so this path is currently untested — a test exercising the real loop on reconnect would help.

Also note: this branch is now CONFLICTING against develop (base.js has changed substantially via the merged #527 readiness work). It'll need a rebase. Happy to re-review promptly once these are addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants