Skip to content

fix(runtime): keep architect alive when a session's terminal output errors#322

Merged
forketyfork merged 4 commits into
mainfrom
fix/session-error-no-app-exit
May 29, 2026
Merged

fix(runtime): keep architect alive when a session's terminal output errors#322
forketyfork merged 4 commits into
mainfrom
fix/session-error-no-app-exit

Conversation

@forketyfork
Copy link
Copy Markdown
Owner

Solution

Sending messages to codex was causing architect to exit silently, with no crash dialog, no confirmation, and no session persistence. Investigation traced the cause:

  • ~/Library/Logs/Architect/architect.log showed session N: process output failed: error.HyperlinkSetOutOfMemory immediately followed by the shutdown event marker.
  • launchd confirmed the process exited via exit(1), not a signal (which is why there were no .ips crash reports).
  • The error originates in ghostty-vt's stream.nextSlice once enough OSC 8 hyperlinks (codex emits many clickable file paths) exhaust the per-page hyperlink set, then propagates through SessionState.processOutput, up through the runtime main loop's return err, out of main(), and into Zig's default !void error handler, which calls exit(1) before any teardown can run.

The fix catches per-session errors from processOutput and flushPendingWrites inside the main loop. They are logged, the offending session is marked dead via a new failSession helper (which also bumps render_epoch so the UI updates), and the loop continues to the next session. The session ends up in the same state as a normal child exit, so the user can restart it from the UI or quit cleanly with persistence intact. Other recoverable session-resource errors covered by ProcessOutputError (StyleSetOutOfMemory, GraphemeMapOutOfMemory, and so on) are handled the same way, since they share the same root cause: per-session terminal-buffer exhaustion that should not take the whole app down.

failSession is a tiny helper, but it has a unit test so that the dead/dirty contract stays explicit.

Test plan

  • Run architect, start a codex session, and use it for an extended period (or any workload that emits many OSC 8 hyperlinks). The app should stay running; if the per-session limit is hit, the affected session should show as dead while other sessions and persistence keep working.
  • Confirm that ~/Library/Logs/Architect/architect.log now contains a marking session dead log line in that scenario rather than an immediate shutdown event.
  • Quit architect via the normal close path and verify ~/.config/architect/persistence.toml is written.

…rrors

Issue: Sending messages to codex would cause architect to exit silently, with no crash dialog, no confirmation, and no persistence. Logs showed ghostty-vt returning HyperlinkSetOutOfMemory from stream.nextSlice once codex had emitted enough OSC 8 hyperlinks to exhaust the per-page hyperlink set. The error propagated through processOutput, up through the runtime main loop, and out of main(), where Zig's default !void handler printed a stack trace and called exit(1) before any teardown ran.
Solution: Catch per-session processOutput and flushPendingWrites errors in the main loop. Log them, mark the session dead via a new failSession helper, and continue iterating instead of unwinding the whole app. The session lands in the same state as a normal child exit, so the user can restart it or quit cleanly with persistence intact.
@forketyfork forketyfork requested a review from Copilot May 28, 2026 20:46
@forketyfork forketyfork marked this pull request as ready for review May 28, 2026 20:46
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e8e71763f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/app/runtime.zig Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the runtime loop so per-session terminal output/write failures no longer terminate the entire Architect process, allowing other sessions and persistence to continue working.

Changes:

  • Adds failSession to mark failed sessions dead and dirty.
  • Handles processOutput and flushPendingWrites errors per session instead of returning from run.
  • Adds a unit test for the failSession dead/render epoch behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/app/runtime.zig Outdated
Comment thread src/app/runtime.zig
…ession

Addresses unresolved PR review threads (codex P1; copilot x2) on #322.

Before this change, the runtime's failSession helper only set dead=true.
Three problems followed: the still-spawned child kept running invisibly
behind a [Process completed] UI and could go on touching files; the normal
teardown SIGTERM was gated on spawned && !dead so app quit no longer killed
the child either; and pending_write was left populated, so flushPendingWrites
retried and re-logged every frame on a permanent PTY error.

Move the failure logic to SessionState.failAndTerminate. It now mirrors
teardown's kill: SIGTERM the child while spawned && !dead, clear pending_write
so flushPendingWrites short-circuits on its empty-buffer guard, then flip
dead and bump the render epoch. The shell, terminal, and stream wrappers
stay alive so scrollback remains visible and the existing restart button
path keeps working.
Adds two pieces of diagnostic context next to the existing process-output
error path so that future investigations of similar failures can move
faster:

1. The existing log.err line now includes the detected agent name
   (codex / claude / gemini / none) and the session cwd. The original
   incident took a while to attribute to codex because the line only
   carried the session id and error name.
2. A structured session_failed event is emitted via the existing
   writeRuntimeEvent helper, with key=value fields session, agent, error,
   and source. Grepping event=session_failed in architect.log now lists
   every session that hit an unrecoverable runtime error and which call
   site (process_output or flush_pending_writes) tripped it.
@forketyfork forketyfork merged commit ccd7dc5 into main May 29, 2026
4 checks passed
@forketyfork forketyfork deleted the fix/session-error-no-app-exit branch May 29, 2026 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants