Skip to content

Web UI silently hydrates long conversations to oldest 200 messages — newest messages disappear after refresh or mid-session SSE recovery (DB is OK, UI bug) #1531

@ztech-gthb

Description

@ztech-gthb

Summary

For any conversation with more than 200 persisted messages, the Web UI shows only the oldest 200 and silently drops everything newer. Three independent defects in core, server, and web add up to the visible behaviour:

  1. DB query returns oldest 200 (ORDER BY created_at ASC LIMIT 200)
  2. HTTP API hardcoded default limit 200, max 500, no cursor-based pagination
  3. UI's stuck-placeholder recovery re-runs the same query and overwrites the React message state — including live messages that arrived via SSE during the current session

The DB preserves all messages. The loss is purely UI-side. But it is silent: no error, no "loaded 200 of N" indicator, no "load more" affordance, no tags-day-separator that could tip the user off about a temporal jump.

Important: this can happen without a page refresh

Most users would assume the cut only occurs after a hard refresh. Defect (3) makes it happen mid-session as well: any condition that triggers onLockChange(false) while a streaming placeholder still has empty content (isStreaming: true, content: '') leads ChatInterface.tsx:443-461 to fetch from the server and replace local state. Conditions that trigger this:

  • conversation_lock=false SSE event without preceding text events (e.g. transient SSE drop, deterministic command path)
  • workflow_status SSE event with completed | failed | cancelled (useSSE.ts:176-185)
  • The mount-time fetch race the surrounding comment explicitly documents: "On the first message of a new conversation, navigate() causes a component remount and a fresh SSE connection; if text events were emitted before the connection established, only the lock-release event is received. We detect this race and re-fetch via REST."

When the user is in a long-running conversation and a workflow completes, or an SSE connection blips during a busy period (e.g. multiple workspace-sync events firing in quick succession from #1516), the React state can be silently rewound to the oldest 200 messages without any user action.

Trigger conditions

For (1)+(2) (refresh path):

  • Conversation has > 200 persisted messages, AND
  • User refreshes page, opens conversation in new tab, or otherwise causes ChatInterface to mount

For (3) (mid-session path), additionally:

  • React state contains at least one streaming placeholder (isStreaming: true, content: '') at the moment of an onLockChange(false) event

Reproduction

Repro A — refresh path:

  1. Have or build a conversation with > 200 messages (any free-text chat or workflow run that emits many SSE text events; >200 turns, or a longer-running workflow with many node messages).
  2. In Web UI, refresh the page (or open the conversation URL in a new tab).
  3. Scroll to the top, then to the bottom: the UI ends at message 200 (chronologically). All messages numbered 201 to N are absent.
  4. Confirm via DB:
SELECT COUNT(*) FROM remote_agent_messages WHERE conversation_id = '<id>';
-- > 200, but UI shows only 200
  1. Confirm via API:
curl '/api/conversations/<id>/messages' | jq 'length'   # 200
curl '/api/conversations/<id>/messages?limit=500' | jq 'length'  # up to 500, still ASC

Repro B — mid-session path (no refresh):

  1. Same conversation > 200 messages, you're actively chatting (live SSE).
  2. Trigger any of:
  • Run a workflow (/workflow run X); when workflow_status: completed | failed | cancelled fires, onLockChange(false) is dispatched.
  • Wait for any transient SSE reconnect during a free-text response with a slow first text event (the comment in code says this is observable on first message of a new conversation, but it generalises).
  1. Observe: any messages currently in React state but outside the server's oldest-200 window vanish from the UI. The "system status" messages (sync events) are interleaved back in, but real chat content older than the server's response disappears.

In a real-world correlated case, this is what happens during the #1516 sync-burst phase: rapid system_status events plus stream timing make stuck-placeholder recovery measurably more likely.

Code anchors

Location What it does
packages/core/src/db/messages.ts:51-66 listMessagesORDER BY created_at ASC LIMIT $2. Comment: "List messages for a conversation, oldest first."
packages/server/src/routes/api.ts:1265-1287 GET /api/conversations/:id/messages route. Math.min(Number(c.req.query('limit') ?? '200'), 500). No ?before= / ?after= cursor support.
packages/web/src/lib/api.ts:186-190 getMessages(conversationId, limit = 200) — client default 200, no pagination support.
packages/web/src/components/chat/ChatInterface.tsx:149-152 Mount-time fetch: void getMessages(conversationId) — no limit arg, uses default 200.
packages/web/src/components/chat/ChatInterface.tsx:441-461 Stuck-placeholder recovery: refetches via getMessages, then setMessages(prev => ... return hydrated)replaces state instead of merging.
packages/web/src/hooks/useSSE.ts:176-185 workflow_status with `completed

Why this is especially hard for users to diagnose

  • No visual indicator that messages were truncated. The UI just renders 200 messages. There is no "200 of N" badge, no "load older" button, no error.
  • No day-separator between messages spanning multiple calendar days. A truncation that drops two days of history looks identical to a one-hour gap.
  • Mid-session truncation has no user-visible trigger. The user's last action might have been typing a message; suddenly the chat looks shorter. Without DB access, this is indistinguishable from "I imagined those messages".
  • DB is fine. Once a user does check the DB, all messages are there — which can lead to suspicion that the loss was a sync/reset bug (cf. git reset --hard origin/<default_branch> on source/ every message — destroys any local state that has diverged from origin for managed clones #1516). Easy to confuse two distinct mechanisms.

Concrete proposal

Three changes, each independently useful:

  1. DB query: change ASC to DESC. messages.ts:51 becomes ORDER BY created_at DESC LIMIT $2. Reverse the result on the client (or in the API handler). With > 200 messages, the user sees the newest 200 — which matches every other chat UI's expectation. Old messages then need pagination (next point) but they are not the user's primary interaction.

  2. API + client: cursor-based pagination. Add ?before=<message_id>&limit=200 (and optionally ?after=). The Web UI, when scrolling up past the loaded range, fetches the next batch. This makes long conversations fully navigable without the cap.

  3. Stuck-placeholder recovery: merge instead of replace. ChatInterface.tsx:441-461 should not call setMessages(... => return hydrated). It should merge the fetched rows by ID into the existing state, only adding messages that aren't already there. Live SSE-only state must survive recovery. (The current logic preserves system messages explicitly via systemMessages.filter then re-interleave — extending that to all client-only messages is the same shape.)

(2) and (3) are the load-bearing fixes. (1) makes the default UX correct without any pagination work; (2) makes "scroll to load older" possible; (3) prevents silent mid-session loss.

Where this is not the problem

  • /workflow list, /workflow reload, and slash commands generally read workflow YAMLs directly from source/.archon/workflows/ — they don't depend on the message-list path and aren't affected.
  • The CLI (packages/cli/src/commands/chat.ts) is single-shot and doesn't render history — also not affected.
  • The DB schema and persistence (messageDb.addMessage) are correct; messages are not lost there.

Relationship to #1516

Independent root causes, but observably correlated. The hard-reset path of #1516 emits a stream of system_status SSE events during sync bursts, which raises the probability of stuck-placeholder recovery (defect 3) firing. Users who experience #1516 frequently see this UI truncation as well, and the two have been confused for one in user-reports. Fixing #1516 makes this less probable but does not fix it — defects (1)+(2)+(3) remain triggerable independently.


Happy to PR. The minimal fix is (1) + (3): one-line DB change + a merge instead of replace in ChatInterface. (2) is a follow-up to make >500-message conversations fully navigable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is broken

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions