Skip to content

fix(orchestrator): surface terminal agent-run failures to the chat user (LIA-371)#984

Merged
sliamh11 merged 2 commits into
mainfrom
fix-silent-agent-errors
Jul 3, 2026
Merged

fix(orchestrator): surface terminal agent-run failures to the chat user (LIA-371)#984
sliamh11 merged 2 commits into
mainfrom
fix-silent-agent-errors

Conversation

@sliamh11

@sliamh11 sliamh11 commented Jul 3, 2026

Copy link
Copy Markdown
Owner

Summary

  • Persistent backend errors (auth expiry, container crash, dead session) previously failed completely silently: the user saw the typing indicator flick on and off through ~2.5 minutes of hidden exponential-backoff retries, then total silence, with no way to distinguish "thinking" from "broken."
  • Adds GroupQueue.setOnTerminalFailure (mirrors the existing processMessagesFn single-slot callback pattern), fired once from scheduleRetry's terminal-retry branch after MAX_RETRIES (5) is exhausted, wrapped in try/catch so a misbehaving callback can't destabilize the retry state machine.
  • Wired in message-orchestrator.ts to resolve the channel via the existing findChannel helper and send a terse notice via channel.sendMessage(...).catch(...) — matching the precedent already set by session-commands.ts for slash-command failures, and using .catch (not bare await) since channel.sendMessage now throws on delivery failure per the just-merged fix(channels): throw on send failure instead of swallowing it (LIA-359, LIA-360) #983.

Review process

Went through 3 code-reviewer rounds during development, each fixing a real finding:

  1. GPT backend flagged a fragile .resolves.not.toThrow() matcher in a test — fixed to a plain await.
  2. code-reviewer mutation-tested the resulting test and proved it didn't actually discriminate whether the try/catch was present (an uncaught throw is silently absorbed by a different catch block, quietly starting a fresh retry cycle) — fixed by mocking logger and asserting the specific log message; independently re-verified via my own mutation test (removed the try/catch, confirmed the test fails, restored).
  3. Minor missing assertion on the "no channel resolves" no-op path — added.

Final state: plan-reviewer SHIP (native + GPT), code-reviewer SHIP (native + GPT + GLM), verification-gate SHIP with full command evidence.

Note on merge order

This PR and #367 (auto-compact concurrency fix, separate PR) both touch message-orchestrator.test.ts's makeQueue() test helper (different new keys) and the top of createMessageOrchestrator. Whichever merges second will need a small rebase to reconcile both.

Test plan

  • npx tsc --noEmit && npx tsc — clean
  • npx vitest run — 109 files, 1943 tests, 0 failures
  • New tests: group-queue.test.ts (terminal-failure fires once, only after final retry; throwing callback doesn't propagate — mutation-tested), message-orchestrator.test.ts (wiring resolves channel + sends notice; no-op + logs warning when unresolvable)

🤖 Generated with Claude Code

@github-actions github-actions Bot added the core label Jul 3, 2026
sliamh11 added a commit that referenced this pull request Jul 3, 2026
…e.enqueueTask (LIA-367) (#985)

## Summary
- The `/compact` self-dispatch in `message-orchestrator.ts` fired
`runAgent(group, '/compact', ...)` un-awaited, mid-callback, guarded
only by a closure-scoped `Set<string>`. This bypassed `GroupQueue`'s
one-container-per-group serialization entirely — a second container
could spawn and race the primary turn's container on the same host IPC
directory (neither call site sets `ipcRunKey`, so both resolve to the
identical unscoped path).
- Reroutes the dispatch through the existing `GroupQueue.enqueueTask`
primitive — the same pattern already used identically in
`odysseus-server.ts` and `task-scheduler.ts` for "serialize a
self-dispatch against the group's normal turn." The compact dispatch now
only fires after the primary `runAgent` call has fully resolved, and
because `GroupQueue.state.active` is still true at that point,
`enqueueTask` always queues rather than running concurrently —
`drainGroup` picks it up under full serialization once the primary
container's state is cleanly reset.

## Important context: this specific race is currently unreachable in
production
Traced the full data path for
`result.contextStats`/`result.compactionEvent` (the fields that gate the
dispatch): `ContainerRuntime.runTurn`'s `onOutput` only forwards
`output_text/activity/session/error/turn_complete`; `RuntimeEvent` has
no variant carrying `contextStats`/`compactionEvent` at all;
`runAgent`'s `eventSink` only synthesizes `{status,result}`-shaped
callback objects. So `result.contextStats?.autoCompact` is always
`undefined` today — confirmed independently by an existing code comment
at `src/config.ts:184-191` ("the runtime eventSink does not forward
contextStats... tracked follow-up") and by `git log -S contextStats`
showing zero commits ever wiring that forwarding path since the
auto-compact feature was introduced (`a08d0e71`, LIA-94).

This means DEUS's own proactive auto-compact trigger has been silently
inert since introduction — the SDK's own internal auto-compaction (a
separate mechanism) still works, so nothing is on fire. This PR ships as
cheap, precedented preventive hardening for a landmine that would
activate the moment the forwarding gap closes (a follow-up ticket will
track that separately) rather than as a fix for an actively-reproducing
bug — flagging so reviewers don't expect an observable behavior change
or an E2E repro.

## Test plan
- [x] `npx tsc --noEmit && npx tsc` — clean
- [x] `npx vitest run` — 109 files, 1942 tests, 0 failures
- [x] New tests: `message-orchestrator.test.ts` (regression-pin: real
event pipeline never calls `queue.enqueueTask`, documenting current
unreachability), `group-queue.test.ts` (pending-task dedup branch,
previously only covered for running tasks)

## Note on merge order
This PR and #984 (silent agent-run failure notice, separate PR) both
touch `message-orchestrator.test.ts`'s `makeQueue()` test helper
(different new keys) and the top of `createMessageOrchestrator`.
Whichever merges second will need a small rebase to reconcile both.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 5 <noreply@anthropic.com>
sliamh11 and others added 2 commits July 3, 2026 22:09
…er (LIA-371)

Persistent backend errors (auth expiry, container crash, dead session)
previously failed silently: the user saw the typing indicator flick on and
off through ~2.5 minutes of hidden exponential-backoff retries, then total
silence, with no way to distinguish "thinking" from "broken". Add a
GroupQueue.setOnTerminalFailure callback (mirroring the existing
processMessagesFn single-slot pattern) fired once retries are exhausted,
wired in message-orchestrator.ts to send a terse in-chat notice via the
existing findChannel/trySend pattern — matching the precedent already set
by session-commands.ts for slash-command failures.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
@sliamh11 sliamh11 force-pushed the fix-silent-agent-errors branch from cfd6255 to 2d030b4 Compare July 3, 2026 19:09
@sliamh11 sliamh11 merged commit a5ac51d into main Jul 3, 2026
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant