Summary
A busy multi-agent workspace can enter a self-amplifying loop after restart: agents resume old channel work, emit large numbers of transient status / thinking / todos / queued status messages, poll /v1/events concurrently, exhaust the backend DB connection pool, and in some cases cause the local agent daemon or browser/frontend to become unusable due to memory and request pressure.
This is not just stale UI noise. The underlying issue is that transient execution state is represented as durable workspace.message.posted events, while the runtime queue and frontend message state do not appear to have strong bounds, TTLs, or reconciliation for orphaned work.
Scenario
This happened in a long-lived self-hosted workspace with several connected coding agents, using roles similar to:
- project-manager / orchestrator
- engineer
- QA
- reviewer
- architect
- requirements/test agents
The workspace had multiple active channels with historical collaboration traffic. Some agents had unfinished task reminders and some CLI-backed agents could not reliably produce a final response. After restarting the workspace services and agent daemon, the agents resumed old channels and started emitting process messages again.
The observed loop was roughly:
- The agent daemon starts and multiple agents join the same workspace.
- Each agent begins polling message/control/tool-result endpoints and sending heartbeats.
- Agents resume old channel context and/or plan reminders.
- The orchestrator agent dispatches follow-up work to other agents.
- Some agents are busy or fail/no-response, so new messages are queued.
- The adapter emits durable status events such as:
message queued -- will process after current task
processing queued message
thinking...
Still processing...
No response generated. Please try again.
todos status messages
- These events are persisted and later reloaded as workspace history.
- On restart, remaining work/reminders cause the same agents to generate fresh process events again.
In the affected workspace, deleting old transient message events reduced load temporarily, but restarting the agent daemon regenerated a large number of new process events within minutes. That suggests the issue is not only accumulated history; it is also the lack of runtime guards/reconciliation when agents restart into a busy historical workspace.
Observed symptoms
- The workspace page shows many historical
Queued: items and process messages.
- Restarting the agent daemon can regenerate hundreds/thousands of transient
status / thinking / todos events quickly.
- Backend logs show DB connection pool exhaustion under concurrent polling/heartbeat/event traffic, e.g. pattern:
sqlalchemy.exc.TimeoutError: QueuePool limit of size 40 overflow 8 reached, connection timed out
POST /v1/heartbeat -> 500
GET /v1/events?...type=workspace.message.posted&limit=500&after=... -> 500
- The agent daemon may need to be killed during shutdown or recovery.
- Browser/frontend can become difficult to load because it rehydrates and renders a large amount of transient history.
- The system can become unusable unless the agent daemon is stopped or the workspace history/state is manually cleaned.
Expected behavior
A workspace with stale/incomplete transient execution state should remain recoverable after restart.
At minimum:
- Queued work should have a bounded lifetime or explicit terminal state.
- Runtime per-channel queues should not grow without limits.
- Agents should not indefinitely replay or regenerate transient status messages from old reminders after restart.
- Frontend status bars should not infer active runtime queue state from unbounded historical message events alone.
- Backend polling should degrade gracefully under many agents and not exhaust the SQLAlchemy connection pool.
Actual behavior
Transient execution state is stored as durable message events. If the expected terminal event is missing, or if agents resume old reminders on startup, the workspace can repeatedly recreate process/status messages. Multiple agents then poll the backend concurrently and can overload /v1/events, while the frontend keeps rehydrating/rendering historical transient messages.
Suspected code paths
Potentially related areas:
-
packages/agent-connector/src/adapters/base.js
- Per-channel
_channelQueues[channel] receives messages while the channel is busy.
- Queue entries appear to be in-memory, with no obvious max size, TTL, or backpressure.
queued_message is emitted as a durable status event; queue_status: processed is only emitted when the queue is later drained.
-
workspace/frontend/components/chat/thread-status-bar.tsx
- Derives
Queued: rows by scanning loaded message history for metadata.queue_id + metadata.queued_message and hiding only if a matching queue_status: processed exists.
- This can treat orphaned historical status events as currently active queue state.
-
workspace/frontend/hooks/use-polling.ts
- Message state grows by appending new messages; there does not appear to be a sliding window for status/process messages.
- Polling loops can amplify load if many events arrive quickly.
-
workspace/backend/app/routers/events.py
/v1/events is heavily polled by agents/frontend.
- Under multi-agent restart/load, backend logs can show SQLAlchemy connection pool exhaustion.
-
Todo/reminder/system-message handling
- On daemon restart, agents may continue processing historical plan reminders/system todos, causing fresh process-message generation even after old transient events are cleaned up.
Possible mitigations
Some ideas that may help:
- Add max queue size, TTL, and backpressure to per-channel runtime queues in the agent connector.
- Add hard timeouts and
finally cleanup around channel busy state and queued work.
- Emit terminal states (
failed, expired, cancelled) for queued work when an agent exits, restarts, times out, or discards queued messages.
- Do not render active queued state purely from durable message history, or apply a short TTL to queue UI entries.
- Add a separate queue projection/state table or endpoint for active queue state instead of inferring it from event history.
- Bound frontend message state/rendering for transient message types.
- Add startup reconciliation so old reminders/queued work cannot create an unbounded multi-agent replay storm.
- Add backend safeguards/rate limiting/coalescing around agent polling, especially
workspace.message.posted with high limits.
Notes
I intentionally omitted local workspace identifiers, URLs, tokens, file system paths, and private agent names. The issue is reproducible conceptually with any long-lived self-hosted workspace containing multiple active agents, stale plan reminders, and at least one agent that can hang or fail to produce a final response.
Summary
A busy multi-agent workspace can enter a self-amplifying loop after restart: agents resume old channel work, emit large numbers of transient
status/thinking/todos/ queued status messages, poll/v1/eventsconcurrently, exhaust the backend DB connection pool, and in some cases cause the local agent daemon or browser/frontend to become unusable due to memory and request pressure.This is not just stale UI noise. The underlying issue is that transient execution state is represented as durable
workspace.message.postedevents, while the runtime queue and frontend message state do not appear to have strong bounds, TTLs, or reconciliation for orphaned work.Scenario
This happened in a long-lived self-hosted workspace with several connected coding agents, using roles similar to:
The workspace had multiple active channels with historical collaboration traffic. Some agents had unfinished task reminders and some CLI-backed agents could not reliably produce a final response. After restarting the workspace services and agent daemon, the agents resumed old channels and started emitting process messages again.
The observed loop was roughly:
message queued -- will process after current taskprocessing queued messagethinking...Still processing...No response generated. Please try again.todosstatus messagesIn the affected workspace, deleting old transient message events reduced load temporarily, but restarting the agent daemon regenerated a large number of new process events within minutes. That suggests the issue is not only accumulated history; it is also the lack of runtime guards/reconciliation when agents restart into a busy historical workspace.
Observed symptoms
Queued:items and process messages.status/thinking/todosevents quickly.Expected behavior
A workspace with stale/incomplete transient execution state should remain recoverable after restart.
At minimum:
Actual behavior
Transient execution state is stored as durable message events. If the expected terminal event is missing, or if agents resume old reminders on startup, the workspace can repeatedly recreate process/status messages. Multiple agents then poll the backend concurrently and can overload
/v1/events, while the frontend keeps rehydrating/rendering historical transient messages.Suspected code paths
Potentially related areas:
packages/agent-connector/src/adapters/base.js_channelQueues[channel]receives messages while the channel is busy.queued_messageis emitted as a durable status event;queue_status: processedis only emitted when the queue is later drained.workspace/frontend/components/chat/thread-status-bar.tsxQueued:rows by scanning loaded message history formetadata.queue_id+metadata.queued_messageand hiding only if a matchingqueue_status: processedexists.workspace/frontend/hooks/use-polling.tsworkspace/backend/app/routers/events.py/v1/eventsis heavily polled by agents/frontend.Todo/reminder/system-message handling
Possible mitigations
Some ideas that may help:
finallycleanup around channel busy state and queued work.failed,expired,cancelled) for queued work when an agent exits, restarts, times out, or discards queued messages.workspace.message.postedwith high limits.Notes
I intentionally omitted local workspace identifiers, URLs, tokens, file system paths, and private agent names. The issue is reproducible conceptually with any long-lived self-hosted workspace containing multiple active agents, stale plan reminders, and at least one agent that can hang or fail to produce a final response.