Multi-agent workspace can enter runaway status/event loop after restart

## Summary

A busy multi-agent workspace can enter a self-amplifying loop after restart: agents resume old channel work, emit large numbers of transient `status` / `thinking` / `todos` / queued status messages, poll `/v1/events` concurrently, exhaust the backend DB connection pool, and in some cases cause the local agent daemon or browser/frontend to become unusable due to memory and request pressure.

This is not just stale UI noise. The underlying issue is that transient execution state is represented as durable `workspace.message.posted` events, while the runtime queue and frontend message state do not appear to have strong bounds, TTLs, or reconciliation for orphaned work.

## Scenario

This happened in a long-lived self-hosted workspace with several connected coding agents, using roles similar to:

- project-manager / orchestrator
- engineer
- QA
- reviewer
- architect
- requirements/test agents

The workspace had multiple active channels with historical collaboration traffic. Some agents had unfinished task reminders and some CLI-backed agents could not reliably produce a final response. After restarting the workspace services and agent daemon, the agents resumed old channels and started emitting process messages again.

The observed loop was roughly:

1. The agent daemon starts and multiple agents join the same workspace.
2. Each agent begins polling message/control/tool-result endpoints and sending heartbeats.
3. Agents resume old channel context and/or plan reminders.
4. The orchestrator agent dispatches follow-up work to other agents.
5. Some agents are busy or fail/no-response, so new messages are queued.
6. The adapter emits durable status events such as:
   - `message queued -- will process after current task`
   - `processing queued message`
   - `thinking...`
   - `Still processing...`
   - `No response generated. Please try again.`
   - `todos` status messages
7. These events are persisted and later reloaded as workspace history.
8. On restart, remaining work/reminders cause the same agents to generate fresh process events again.

In the affected workspace, deleting old transient message events reduced load temporarily, but restarting the agent daemon regenerated a large number of new process events within minutes. That suggests the issue is not only accumulated history; it is also the lack of runtime guards/reconciliation when agents restart into a busy historical workspace.

## Observed symptoms

- The workspace page shows many historical `Queued:` items and process messages.
- Restarting the agent daemon can regenerate hundreds/thousands of transient `status` / `thinking` / `todos` events quickly.
- Backend logs show DB connection pool exhaustion under concurrent polling/heartbeat/event traffic, e.g. pattern:

```text
sqlalchemy.exc.TimeoutError: QueuePool limit of size 40 overflow 8 reached, connection timed out
POST /v1/heartbeat -> 500
GET /v1/events?...type=workspace.message.posted&limit=500&after=... -> 500
```

- The agent daemon may need to be killed during shutdown or recovery.
- Browser/frontend can become difficult to load because it rehydrates and renders a large amount of transient history.
- The system can become unusable unless the agent daemon is stopped or the workspace history/state is manually cleaned.

## Expected behavior

A workspace with stale/incomplete transient execution state should remain recoverable after restart.

At minimum:

- Queued work should have a bounded lifetime or explicit terminal state.
- Runtime per-channel queues should not grow without limits.
- Agents should not indefinitely replay or regenerate transient status messages from old reminders after restart.
- Frontend status bars should not infer active runtime queue state from unbounded historical message events alone.
- Backend polling should degrade gracefully under many agents and not exhaust the SQLAlchemy connection pool.

## Actual behavior

Transient execution state is stored as durable message events. If the expected terminal event is missing, or if agents resume old reminders on startup, the workspace can repeatedly recreate process/status messages. Multiple agents then poll the backend concurrently and can overload `/v1/events`, while the frontend keeps rehydrating/rendering historical transient messages.

## Suspected code paths

Potentially related areas:

- `packages/agent-connector/src/adapters/base.js`
  - Per-channel `_channelQueues[channel]` receives messages while the channel is busy.
  - Queue entries appear to be in-memory, with no obvious max size, TTL, or backpressure.
  - `queued_message` is emitted as a durable status event; `queue_status: processed` is only emitted when the queue is later drained.

- `workspace/frontend/components/chat/thread-status-bar.tsx`
  - Derives `Queued:` rows by scanning loaded message history for `metadata.queue_id` + `metadata.queued_message` and hiding only if a matching `queue_status: processed` exists.
  - This can treat orphaned historical status events as currently active queue state.

- `workspace/frontend/hooks/use-polling.ts`
  - Message state grows by appending new messages; there does not appear to be a sliding window for status/process messages.
  - Polling loops can amplify load if many events arrive quickly.

- `workspace/backend/app/routers/events.py`
  - `/v1/events` is heavily polled by agents/frontend.
  - Under multi-agent restart/load, backend logs can show SQLAlchemy connection pool exhaustion.

- Todo/reminder/system-message handling
  - On daemon restart, agents may continue processing historical plan reminders/system todos, causing fresh process-message generation even after old transient events are cleaned up.

## Possible mitigations

Some ideas that may help:

1. Add max queue size, TTL, and backpressure to per-channel runtime queues in the agent connector.
2. Add hard timeouts and `finally` cleanup around channel busy state and queued work.
3. Emit terminal states (`failed`, `expired`, `cancelled`) for queued work when an agent exits, restarts, times out, or discards queued messages.
4. Do not render active queued state purely from durable message history, or apply a short TTL to queue UI entries.
5. Add a separate queue projection/state table or endpoint for active queue state instead of inferring it from event history.
6. Bound frontend message state/rendering for transient message types.
7. Add startup reconciliation so old reminders/queued work cannot create an unbounded multi-agent replay storm.
8. Add backend safeguards/rate limiting/coalescing around agent polling, especially `workspace.message.posted` with high limits.

## Notes

I intentionally omitted local workspace identifiers, URLs, tokens, file system paths, and private agent names. The issue is reproducible conceptually with any long-lived self-hosted workspace containing multiple active agents, stale plan reminders, and at least one agent that can hang or fail to produce a final response.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-agent workspace can enter runaway status/event loop after restart #492

Summary

Scenario

Observed symptoms

Expected behavior

Actual behavior

Suspected code paths

Possible mitigations

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Multi-agent workspace can enter runaway status/event loop after restart #492

Description

Summary

Scenario

Observed symptoms

Expected behavior

Actual behavior

Suspected code paths

Possible mitigations

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions