fix(orchestrator): surface terminal agent-run failures to the chat user (LIA-371)#984
Merged
Conversation
Merged
3 tasks
sliamh11
added a commit
that referenced
this pull request
Jul 3, 2026
…e.enqueueTask (LIA-367) (#985) ## Summary - The `/compact` self-dispatch in `message-orchestrator.ts` fired `runAgent(group, '/compact', ...)` un-awaited, mid-callback, guarded only by a closure-scoped `Set<string>`. This bypassed `GroupQueue`'s one-container-per-group serialization entirely — a second container could spawn and race the primary turn's container on the same host IPC directory (neither call site sets `ipcRunKey`, so both resolve to the identical unscoped path). - Reroutes the dispatch through the existing `GroupQueue.enqueueTask` primitive — the same pattern already used identically in `odysseus-server.ts` and `task-scheduler.ts` for "serialize a self-dispatch against the group's normal turn." The compact dispatch now only fires after the primary `runAgent` call has fully resolved, and because `GroupQueue.state.active` is still true at that point, `enqueueTask` always queues rather than running concurrently — `drainGroup` picks it up under full serialization once the primary container's state is cleanly reset. ## Important context: this specific race is currently unreachable in production Traced the full data path for `result.contextStats`/`result.compactionEvent` (the fields that gate the dispatch): `ContainerRuntime.runTurn`'s `onOutput` only forwards `output_text/activity/session/error/turn_complete`; `RuntimeEvent` has no variant carrying `contextStats`/`compactionEvent` at all; `runAgent`'s `eventSink` only synthesizes `{status,result}`-shaped callback objects. So `result.contextStats?.autoCompact` is always `undefined` today — confirmed independently by an existing code comment at `src/config.ts:184-191` ("the runtime eventSink does not forward contextStats... tracked follow-up") and by `git log -S contextStats` showing zero commits ever wiring that forwarding path since the auto-compact feature was introduced (`a08d0e71`, LIA-94). This means DEUS's own proactive auto-compact trigger has been silently inert since introduction — the SDK's own internal auto-compaction (a separate mechanism) still works, so nothing is on fire. This PR ships as cheap, precedented preventive hardening for a landmine that would activate the moment the forwarding gap closes (a follow-up ticket will track that separately) rather than as a fix for an actively-reproducing bug — flagging so reviewers don't expect an observable behavior change or an E2E repro. ## Test plan - [x] `npx tsc --noEmit && npx tsc` — clean - [x] `npx vitest run` — 109 files, 1942 tests, 0 failures - [x] New tests: `message-orchestrator.test.ts` (regression-pin: real event pipeline never calls `queue.enqueueTask`, documenting current unreachability), `group-queue.test.ts` (pending-task dedup branch, previously only covered for running tasks) ## Note on merge order This PR and #984 (silent agent-run failure notice, separate PR) both touch `message-orchestrator.test.ts`'s `makeQueue()` test helper (different new keys) and the top of `createMessageOrchestrator`. Whichever merges second will need a small rebase to reconcile both. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 5 <noreply@anthropic.com>
…er (LIA-371) Persistent backend errors (auth expiry, container crash, dead session) previously failed silently: the user saw the typing indicator flick on and off through ~2.5 minutes of hidden exponential-backoff retries, then total silence, with no way to distinguish "thinking" from "broken". Add a GroupQueue.setOnTerminalFailure callback (mirroring the existing processMessagesFn single-slot pattern) fired once retries are exhausted, wired in message-orchestrator.ts to send a terse in-chat notice via the existing findChannel/trySend pattern — matching the precedent already set by session-commands.ts for slash-command failures. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
cfd6255 to
2d030b4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GroupQueue.setOnTerminalFailure(mirrors the existingprocessMessagesFnsingle-slot callback pattern), fired once fromscheduleRetry's terminal-retry branch afterMAX_RETRIES(5) is exhausted, wrapped in try/catch so a misbehaving callback can't destabilize the retry state machine.message-orchestrator.tsto resolve the channel via the existingfindChannelhelper and send a terse notice viachannel.sendMessage(...).catch(...)— matching the precedent already set bysession-commands.tsfor slash-command failures, and using.catch(not bare await) sincechannel.sendMessagenow throws on delivery failure per the just-merged fix(channels): throw on send failure instead of swallowing it (LIA-359, LIA-360) #983.Review process
Went through 3 code-reviewer rounds during development, each fixing a real finding:
.resolves.not.toThrow()matcher in a test — fixed to a plain await.loggerand asserting the specific log message; independently re-verified via my own mutation test (removed the try/catch, confirmed the test fails, restored).Final state: plan-reviewer SHIP (native + GPT), code-reviewer SHIP (native + GPT + GLM), verification-gate SHIP with full command evidence.
Note on merge order
This PR and #367 (auto-compact concurrency fix, separate PR) both touch
message-orchestrator.test.ts'smakeQueue()test helper (different new keys) and the top ofcreateMessageOrchestrator. Whichever merges second will need a small rebase to reconcile both.Test plan
npx tsc --noEmit && npx tsc— cleannpx vitest run— 109 files, 1943 tests, 0 failuresgroup-queue.test.ts(terminal-failure fires once, only after final retry; throwing callback doesn't propagate — mutation-tested),message-orchestrator.test.ts(wiring resolves channel + sends notice; no-op + logs warning when unresolvable)🤖 Generated with Claude Code