Skip to content

fix(codex): recover idle session resume when rollout file is missing#1043

Open
pedramamini wants to merge 1 commit into
mainfrom
fix/1042-codex-resume-missing-rollout
Open

fix(codex): recover idle session resume when rollout file is missing#1043
pedramamini wants to merge 1 commit into
mainfrom
fix/1042-codex-resume-missing-rollout

Conversation

@pedramamini
Copy link
Copy Markdown
Collaborator

@pedramamini pedramamini commented May 24, 2026

closes #1042

Problem

Resuming a Codex session after an idle period failed with a dead-end "Agent exited with code 1", leaving the session unusable. The only workaround was opening a new tab, which lost all prior context.

Root cause

When Maestro resumes a Codex session it runs codex exec resume <session_id>. If the rollout file backing that thread is gone — pruned, or written by a different/older Codex version after the idle gap — the CLI exits 1 and prints to stderr:

Error: thread/resume: thread/resume failed: no rollout found for thread id <uuid> (code -32600)

Reproduced locally with codex-cli 0.130.0. That "no rollout found" string matched none of Maestro's CODEX_ERROR_PATTERNS, so detectErrorFromExit() fell through to the generic agent_crashed branch → the user-facing "Agent exited with code 1" crash.

Fix

Add a session_not_found pattern to CODEX_ERROR_PATTERNS matching no rollout found / rollout not found. This routes the failure into Maestro's existing in-place recovery flow (useAgentErrorListeneruseSessionRecovery): the stale agentSessionId is cleared and the prior conversation is re-seeded from the tab transcript into a fresh session — preserving the tab, its name, and the sidebar entry instead of forcing the user to open a new one.

The pattern is intentionally narrow (only the definitive "rollout not found" phrasing, not any thread/resume failed) to avoid wrongly discarding a session on a transient resume error.

Testing

  • Added regression tests in error-patterns.test.ts using the real codex-cli 0.130.0 stderr string; asserts it classifies as session_not_found (not agent_crashed).
  • vitest run src/__tests__/main/parsers/error-patterns.test.ts → 119 passed.
  • npm run lint clean on a normal checkout; ESLint clean on changed files.

Note for the reporter

This makes resume fail gracefully (recover in place with prior context) rather than dead-ending. It does not restore Codex's own server-side rollout when that file is genuinely gone — Maestro re-seeds context from its own transcript instead.

Summary by CodeRabbit

Bug Fixes

  • Enhanced error handling for missing Codex sessions. The system now correctly detects when a previous session cannot be found and displays a user-friendly message: "Previous Codex session could not be found. Starting fresh conversation." This prevents the error from being misclassified, allowing you to seamlessly continue with a new conversation.

Review Change Stack

Resuming an idle Codex session ran 'codex exec resume <id>', which exits
1 with stderr 'no rollout found for thread id <uuid>' when the rollout
file backing the thread is gone (pruned or written by a different/older
Codex version). That string matched none of Maestro's error patterns, so
it fell through to a dead-end 'Agent exited with code 1' crash.

Classify it as session_not_found so the existing in-place recovery kicks
in: the stale agentSessionId is cleared and the prior conversation is
re-seeded from the tab transcript into a fresh session, preserving the
tab in place instead of forcing the user to open a new one.

closes #1042
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ac4b6d64-f183-4690-a351-2506d0955ad6

📥 Commits

Reviewing files that changed from the base of the PR and between 1006e3b and b9e131f.

📒 Files selected for processing (2)
  • src/__tests__/main/parsers/error-patterns.test.ts
  • src/main/parsers/error-patterns.ts

📝 Walkthrough

Walkthrough

This PR adds a new error pattern to handle Codex resume failures when rollout files are missing. The implementation detects "no rollout found" and "rollout not found" error messages and returns a dedicated recoverable message, with comprehensive test coverage confirming correct pattern matching and preventing regression to agent crash classification.

Changes

Codex resume failure error pattern

Layer / File(s) Summary
Error pattern definition for rollout not found
src/main/parsers/error-patterns.ts
New session_not_found pattern in CODEX_ERROR_PATTERNS detects "no rollout found"/"rollout not found" messages from failed codex exec resume and returns recoverable error message for fresh conversation.
Pattern matching tests and regression guard
src/__tests__/main/parsers/error-patterns.test.ts
New Codex test group verifies pattern matches three scenarios (detailed stderr, rollout-not-found, session-not-found) with correct error type and recoverable status, includes regression guard comments for #1042.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

Poem

🐰 A rollout lost? No fear, dear friend,
This pattern marks where sessions end,
Fresh talks bloom from missing files,
Recovery wrapped in gentle smiles. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: adding error pattern recovery for Codex session resume when rollout file is missing, matching the core fix in both test and parser files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/1042-codex-resume-missing-rollout

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 24, 2026

Greptile Summary

This PR adds a session_not_found error pattern to CODEX_ERROR_PATTERNS so that Codex's "no rollout found" resume failure routes into Maestro's existing graceful recovery flow instead of dead-ending as a generic "Agent exited with code 1" crash.

  • src/main/parsers/error-patterns.ts: Inserts a new { pattern: /no rollout found|rollout not found/i, ... } entry at the top of the session_not_found bucket in CODEX_ERROR_PATTERNS, with recoverable: true and a clear user-facing message.
  • src/__tests__/main/parsers/error-patterns.test.ts: Adds three regression tests including one anchored to the exact real-world stderr string from codex-cli 0.130.0, asserting the error is classified as session_not_found and not agent_crashed.

Confidence Score: 5/5

Safe to merge — the change is a single regex addition in a well-isolated pattern table with targeted regression tests.

The fix touches only the error-pattern lookup table and adds no new logic paths. The regex is intentionally narrow, session_not_found is already checked before agent_crashed in the ordered loop so routing is correct, and the real-world stderr string from codex-cli 0.130.0 is directly used as a regression anchor in the tests.

No files require special attention.

Important Files Changed

Filename Overview
src/main/parsers/error-patterns.ts Adds one new pattern entry to the existing session_not_found bucket; placement before agent_crashed ensures correct routing via the ordered matchErrorPattern loop.
src/tests/main/parsers/error-patterns.test.ts Three new tests cover the new pattern: one uses the exact real-world stderr string, one tests the alternative phrasing, and one guards the pre-existing generic session pattern.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["codex exec resume exits with code 1"] --> B["detectErrorFromExit reads stderr"]
    B --> C{"matchErrorPattern loop"}
    C --> D["auth_expired? No"]
    D --> E["token_exhaustion / rate_limited / network_error / permission_denied? No"]
    E --> I{"session_not_found patterns"}
    I --> J{"NEW: no rollout found OR rollout not found"}
    J -->|match| K["type: session_not_found, recoverable: true"]
    J -->|no match| L{"session not found"}
    L -->|match| K
    L -->|no match| M{"invalid session"}
    M -->|match| K
    M -->|no match| N["agent_crashed"]
    K --> O["useAgentErrorListener clears stale agentSessionId"]
    O --> P["useSessionRecovery re-seeds from tab transcript"]
    P --> Q["Fresh session: tab + name + sidebar preserved"]
    N --> R["Dead-end: Agent exited with code 1"]
Loading

Reviews (1): Last reviewed commit: "fix(codex): recover idle session resume ..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@chr1syy chr1syy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified pattern ordering routes this before agent_crashed (matchErrorPattern iterates session_not_found first), walked every earlier-priority Codex pattern against the exact stderr to rule out false positives, and confirmed useAgentErrorListener (renderer) clears agentSessionId and downgrades to a system-source log on session_not_found. Applied the PR locally and ran npx vitest run src/__tests__/main/parsers/error-patterns.test.ts — 119 passed. Narrow, well-tested, correct. LGTM.

@chr1syy chr1syy added the ready to merge This PR is ready to merge label May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready to merge This PR is ready to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Codex session fails with exit code 1 when resumed after idle period

2 participants