fix(codex): recover idle session resume when rollout file is missing#1043
fix(codex): recover idle session resume when rollout file is missing#1043pedramamini wants to merge 1 commit into
Conversation
Resuming an idle Codex session ran 'codex exec resume <id>', which exits 1 with stderr 'no rollout found for thread id <uuid>' when the rollout file backing the thread is gone (pruned or written by a different/older Codex version). That string matched none of Maestro's error patterns, so it fell through to a dead-end 'Agent exited with code 1' crash. Classify it as session_not_found so the existing in-place recovery kicks in: the stale agentSessionId is cleared and the prior conversation is re-seeded from the tab transcript into a fresh session, preserving the tab in place instead of forcing the user to open a new one. closes #1042
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR adds a new error pattern to handle Codex resume failures when rollout files are missing. The implementation detects "no rollout found" and "rollout not found" error messages and returns a dedicated recoverable message, with comprehensive test coverage confirming correct pattern matching and preventing regression to agent crash classification. ChangesCodex resume failure error pattern
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR adds a
Confidence Score: 5/5Safe to merge — the change is a single regex addition in a well-isolated pattern table with targeted regression tests. The fix touches only the error-pattern lookup table and adds no new logic paths. The regex is intentionally narrow, session_not_found is already checked before agent_crashed in the ordered loop so routing is correct, and the real-world stderr string from codex-cli 0.130.0 is directly used as a regression anchor in the tests. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["codex exec resume exits with code 1"] --> B["detectErrorFromExit reads stderr"]
B --> C{"matchErrorPattern loop"}
C --> D["auth_expired? No"]
D --> E["token_exhaustion / rate_limited / network_error / permission_denied? No"]
E --> I{"session_not_found patterns"}
I --> J{"NEW: no rollout found OR rollout not found"}
J -->|match| K["type: session_not_found, recoverable: true"]
J -->|no match| L{"session not found"}
L -->|match| K
L -->|no match| M{"invalid session"}
M -->|match| K
M -->|no match| N["agent_crashed"]
K --> O["useAgentErrorListener clears stale agentSessionId"]
O --> P["useSessionRecovery re-seeds from tab transcript"]
P --> Q["Fresh session: tab + name + sidebar preserved"]
N --> R["Dead-end: Agent exited with code 1"]
Reviews (1): Last reviewed commit: "fix(codex): recover idle session resume ..." | Re-trigger Greptile |
chr1syy
left a comment
There was a problem hiding this comment.
Verified pattern ordering routes this before agent_crashed (matchErrorPattern iterates session_not_found first), walked every earlier-priority Codex pattern against the exact stderr to rule out false positives, and confirmed useAgentErrorListener (renderer) clears agentSessionId and downgrades to a system-source log on session_not_found. Applied the PR locally and ran npx vitest run src/__tests__/main/parsers/error-patterns.test.ts — 119 passed. Narrow, well-tested, correct. LGTM.
closes #1042
Problem
Resuming a Codex session after an idle period failed with a dead-end "Agent exited with code 1", leaving the session unusable. The only workaround was opening a new tab, which lost all prior context.
Root cause
When Maestro resumes a Codex session it runs
codex exec resume <session_id>. If the rollout file backing that thread is gone — pruned, or written by a different/older Codex version after the idle gap — the CLI exits 1 and prints to stderr:Reproduced locally with
codex-cli 0.130.0. That"no rollout found"string matched none of Maestro'sCODEX_ERROR_PATTERNS, sodetectErrorFromExit()fell through to the genericagent_crashedbranch → the user-facing "Agent exited with code 1" crash.Fix
Add a
session_not_foundpattern toCODEX_ERROR_PATTERNSmatchingno rollout found/rollout not found. This routes the failure into Maestro's existing in-place recovery flow (useAgentErrorListener→useSessionRecovery): the staleagentSessionIdis cleared and the prior conversation is re-seeded from the tab transcript into a fresh session — preserving the tab, its name, and the sidebar entry instead of forcing the user to open a new one.The pattern is intentionally narrow (only the definitive "rollout not found" phrasing, not any
thread/resume failed) to avoid wrongly discarding a session on a transient resume error.Testing
error-patterns.test.tsusing the realcodex-cli 0.130.0stderr string; asserts it classifies assession_not_found(notagent_crashed).vitest run src/__tests__/main/parsers/error-patterns.test.ts→ 119 passed.npm run lintclean on a normal checkout; ESLint clean on changed files.Note for the reporter
This makes resume fail gracefully (recover in place with prior context) rather than dead-ending. It does not restore Codex's own server-side rollout when that file is genuinely gone — Maestro re-seeds context from its own transcript instead.
Summary by CodeRabbit
Bug Fixes