fix(runtime): keep architect alive when a session's terminal output errors#322
Merged
Conversation
…rrors Issue: Sending messages to codex would cause architect to exit silently, with no crash dialog, no confirmation, and no persistence. Logs showed ghostty-vt returning HyperlinkSetOutOfMemory from stream.nextSlice once codex had emitted enough OSC 8 hyperlinks to exhaust the per-page hyperlink set. The error propagated through processOutput, up through the runtime main loop, and out of main(), where Zig's default !void handler printed a stack trace and called exit(1) before any teardown ran. Solution: Catch per-session processOutput and flushPendingWrites errors in the main loop. Log them, mark the session dead via a new failSession helper, and continue iterating instead of unwinding the whole app. The session lands in the same state as a normal child exit, so the user can restart it or quit cleanly with persistence intact.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e8e71763f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR changes the runtime loop so per-session terminal output/write failures no longer terminate the entire Architect process, allowing other sessions and persistence to continue working.
Changes:
- Adds
failSessionto mark failed sessions dead and dirty. - Handles
processOutputandflushPendingWriteserrors per session instead of returning fromrun. - Adds a unit test for the
failSessiondead/render epoch behavior.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ession Addresses unresolved PR review threads (codex P1; copilot x2) on #322. Before this change, the runtime's failSession helper only set dead=true. Three problems followed: the still-spawned child kept running invisibly behind a [Process completed] UI and could go on touching files; the normal teardown SIGTERM was gated on spawned && !dead so app quit no longer killed the child either; and pending_write was left populated, so flushPendingWrites retried and re-logged every frame on a permanent PTY error. Move the failure logic to SessionState.failAndTerminate. It now mirrors teardown's kill: SIGTERM the child while spawned && !dead, clear pending_write so flushPendingWrites short-circuits on its empty-buffer guard, then flip dead and bump the render epoch. The shell, terminal, and stream wrappers stay alive so scrollback remains visible and the existing restart button path keeps working.
Adds two pieces of diagnostic context next to the existing process-output error path so that future investigations of similar failures can move faster: 1. The existing log.err line now includes the detected agent name (codex / claude / gemini / none) and the session cwd. The original incident took a while to attribute to codex because the line only carried the session id and error name. 2. A structured session_failed event is emitted via the existing writeRuntimeEvent helper, with key=value fields session, agent, error, and source. Grepping event=session_failed in architect.log now lists every session that hit an unrecoverable runtime error and which call site (process_output or flush_pending_writes) tripped it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Solution
Sending messages to codex was causing architect to exit silently, with no crash dialog, no confirmation, and no session persistence. Investigation traced the cause:
~/Library/Logs/Architect/architect.logshowedsession N: process output failed: error.HyperlinkSetOutOfMemoryimmediately followed by theshutdownevent marker.launchdconfirmed the process exited viaexit(1), not a signal (which is why there were no.ipscrash reports).stream.nextSliceonce enough OSC 8 hyperlinks (codex emits many clickable file paths) exhaust the per-page hyperlink set, then propagates throughSessionState.processOutput, up through the runtime main loop'sreturn err, out ofmain(), and into Zig's default!voiderror handler, which callsexit(1)before any teardown can run.The fix catches per-session errors from
processOutputandflushPendingWritesinside the main loop. They are logged, the offending session is marked dead via a newfailSessionhelper (which also bumpsrender_epochso the UI updates), and the loop continues to the next session. The session ends up in the same state as a normal child exit, so the user can restart it from the UI or quit cleanly with persistence intact. Other recoverable session-resource errors covered byProcessOutputError(StyleSetOutOfMemory,GraphemeMapOutOfMemory, and so on) are handled the same way, since they share the same root cause: per-session terminal-buffer exhaustion that should not take the whole app down.failSessionis a tiny helper, but it has a unit test so that the dead/dirty contract stays explicit.Test plan
~/Library/Logs/Architect/architect.lognow contains amarking session deadlog line in that scenario rather than an immediateshutdownevent.~/.config/architect/persistence.tomlis written.