release: 0.7.9 — orchestrator crash-gap fix#26
Conversation
…rror BaseTmuxBackend._run deliberately propagates raw TimeoutExpired/OSError so callers' try/except fires, but several contract methods had no such guard — a 30s tmux hang in list_window_ids/window_alive (the engine's liveness probe) escaped the MultiplexerError seam and would later crash the engine. Add localized guards in the inherited contract methods (above _run, so a psmux overriding only _run is still covered): - _tmux wraps once → covers new_session/set_session_option/new_window/ new_parked_window/send_text, and pipe_pane via its existing except TmuxError. - list_window_ids raises TmuxError on transport failure (a sentinel [] would falsely read as "window dead"); keeps returncode != 0 → [] for a real death. - Group-3 best-effort ops degrade to their documented sentinel on failure (list_windows→[], show_window_option→"", switch_client→False, the rest→no-op). multiplexer.py: docstrings note the raise-on-unknowable / sentinel-on-failure contract. Tests lock in the raisers, the sentinel returners, the already-safe swallowers, and the psmux-style _run-override case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A transient transport hang (the liveness probe raising MultiplexerError, e.g. a 30s tmux hang) was read as "window dead -> crashed", rolling back a possibly-working session. Honor the 0.7.7 stall-hardening rule: skip the crash check on a probe transport error and let spec.timeout_s bound a persistent failure to an honest "timeout". - generic.wait_for_completion: guard self._window_alive with try/except MultiplexerError -> count an internal probe_failures and continue; genuine death (dead window -> list_window_ids returns [], no exception) still returns "crashed". - probe.py: same guard around launcher.window_alive -> retry next tick. - tests: transient-recovers -> completed; persistent -> timeout (never crashed); genuine False -> crashed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add crashed/crash_error to RunState — persistable, backward-compatible flags the engine will set (Phase 4) and the TUI will read (Phase 5). to_dict emits both; from_dict uses the same .get(default) idiom so old state.json loads unchanged. clear_pause now also resets them so resume re-arms a crashed run like a stopped one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ne (Layer B core) Engine.run() now catches any unexpected exception escaping the loop: it persists the traceback to crash.txt, tears down the orphaned agent session, sets state.crashed/crash_error, and appends a run-crash journal line — then falls through to a crashed summary rather than letting the traceback die to the lossy parked control window. Mirrors the RunStopped branch, including the nested-engine re-raise (the owner records; nested still persists + tears down). RunSummary gains crashed/crash_error with a CRASHED render line. cli.main() gets a final except-Exception backstop for the residual surface outside engine.run() (config load, construction, render/notify). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er B visibility) A run that recorded an unexpected engine crash (state.crashed) now surfaces as a distinct CRASHED status with a "see crash.txt" hint and the crash error, checked before liveness so its dead pid does not read as a generic INTERRUPTED. Pre-feature crashes lack the flag and stay INTERRUPTED. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
Next review available in: 1 minute Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
WalkthroughVersion 0.7.9 introduces crash recording in ChangesCrash Recording & Tmux Transport Hardening
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🤖 Augment PR SummarySummary: Releases Changes:
🤖 Was this summary useful? React with 👍 or 👎 |
| Exception | ||
| ): # noqa: BLE001 # nosec B110 - best-effort teardown; a crashing run must still record | ||
| pass | ||
| if not self._owns_signals: |
There was a problem hiding this comment.
In Engine.run() (src/automator/engine.py:+227), the if not self._owns_signals: raise gate means an unexpected exception will re-raise if stop handlers weren’t installed (e.g., running off the main thread), even when there’s no outer engine to record the crash. That seems like it could re-open the “orchestrator dies without recording” gap in non-CLI usage paths.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Fixed in 188b716. Good catch — _owns_signals conflated "I installed handlers" with "I am nested". The re-raise now gates on a dedicated _is_nested flag, captured at _install_stop_signals time (Engine._stop_signals_owner is not None). A top-level engine that could not install handlers (off the main thread) has _is_nested=False and now records the crash instead of re-raising; a genuinely nested auto-sweep still re-raises for the owner to record. Covered by test_top_level_crash_without_signal_handlers_still_records.
| try: | ||
| message = str(exc) | ||
| except Exception: | ||
| message = repr(type(exc).__name__) |
There was a problem hiding this comment.
In the crash handler (src/automator/engine.py:+232), the fallback message = repr(type(exc).__name__) will include quotes and drop details if str(exc) itself fails, producing odd crash_error/journal messages like RuntimeError: 'RuntimeError'. Consider whether a non-quoted type name (or other safe formatting) would better preserve intent here.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Fixed in 188b716. The str(exc)-failed fallback now uses the bare type(exc).__name__ (no repr()), so crash_error reads BadStr: BadStr rather than BadStr: 'BadStr'. Covered by test_crash_message_fallback_when_str_raises.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/automator/engine.py`:
- Around line 211-245: Clear the finished flag when the exception handler
records a crash in the engine loop. In the exception block that sets
self.state.crashed and self.state.crash_error, also reset self.state.finished so
downstream status classification in Engine/state handling cannot report a
post-run failure as FINISHED; make this change in the same crash path that
appends the "run-crash" journal entry.
In `@src/automator/tui/widgets.py`:
- Around line 119-125: Suppress the attach decision footer for crashed runs by
updating the status check in the widget rendering path so `CRASHED` is excluded
alongside `INTERRUPTED`. In `src/automator/tui/widgets.py`, adjust the logic
around the `state.crash_error`/status text rendering so the later “press a to
attach and answer” hint does not appear when `status == data.CRASHED`, since
`Engine.run()` has already torn down the tmux session and there is no live
session to attach to.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c4704799-036d-42bb-8de2-48767db19cd0
⛔ Files ignored due to path filters (11)
docs/images/dashboard.pngis excluded by!**/*.pngdocs/images/dashboard.svgis excluded by!**/*.svgdocs/images/demo.gifis excluded by!**/*.gifdocs/images/settings-scm.pngis excluded by!**/*.pngdocs/images/settings-scm.svgis excluded by!**/*.svgdocs/images/settings.svgis excluded by!**/*.svgdocs/images/start-run-modal.pngis excluded by!**/*.pngdocs/images/start-run-modal.svgis excluded by!**/*.svgdocs/images/sweep-decision.pngis excluded by!**/*.pngdocs/images/sweep-decision.svgis excluded by!**/*.svguv.lockis excluded by!**/*.lock
📒 Files selected for processing (20)
.claude-plugin/marketplace.jsonCHANGELOG.mdmodule.yamlpyproject.tomlsrc/automator/__init__.pysrc/automator/adapters/generic.pysrc/automator/adapters/multiplexer.pysrc/automator/adapters/tmux_base.pysrc/automator/cli.pysrc/automator/data/skills/bmad-auto-setup/assets/module.yamlsrc/automator/engine.pysrc/automator/model.pysrc/automator/probe.pysrc/automator/tui/data.pysrc/automator/tui/widgets.pytests/test_engine.pytests/test_generic_tmux.pytests/test_model.pytests/test_multiplexer.pytests/test_tui_data.py
…ding Four findings from the automated reviewers on the crash-gap fix, all valid: - (major) Reset state.finished when recording a crash. The loop sets finished=True before _gc_run_worktrees/post_run/run-complete, so a throw in those was recorded as crashed but left finished=True — and classification checks finished first, masking the crash as FINISHED. - (medium) Distinguish "nested" from "no handlers installed". _owns_signals is False both for a nested auto-sweep and for a top-level engine that couldn't install handlers (off the main thread); the latter re-raised and lost the crash. A dedicated _is_nested flag (captured at install time) now gates the re-raise, so a top-level engine always records. - (low) Drop repr() from the str(exc)-failed fallback so crash_error reads "BadStr: BadStr", not "BadStr: 'BadStr'". - (minor) Suppress the "press a to attach and answer" footer for CRASHED runs — the tmux session is already torn down, so the hint points at a dead session. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the reviews — all four findings validated and addressed in 188b716:
Tests added for each (post-loop crash → CRASHED, off-main-thread top-level crash records, |
|
@coderabbitai resolve |
✅ Action performedComments resolved. Approval is disabled; enable |
|
@CodeRabbit review |
|
✅ Action performedReview finished.
|
0.7.9 — close the orchestrator crash-gap
Resilience hardening of the shared multiplexer seam + engine, surfaced while working on the psmux (Windows) backend but living entirely in shared core. Two layers:
Layer A — no false crash from a transient transport hang
tmux_base): no multiplexer contract method leaks a rawsubprocess.TimeoutExpired/OSError; each raisesMultiplexerError/TmuxErroror returns its documented sentinel. A 30 s tmux hang in the liveness probe (list_window_ids/window_alive) no longer escapes the seam.timeoutbounded byspec.timeout_s; genuine window death still crashes as before.Layer B — an unexpected exception is recorded, not lost
Engine.run()is now recorded —run-crashjournal line + persistedcrash.txttraceback +state.crashed/crash_error— the orphaned agent session is torn down, and a crashed summary is returned instead of the orchestrator dying to the detached, prunedbmad-auto-ctlwindow. Nested auto-sweep engines re-raise so the parent records it; a CLI backstop covers the residual surface.CRASHEDstatus (not genericINTERRUPTED); pre-feature crashes stayINTERRUPTED.resumere-arms a crashed run viaclear_pause.Validation
trunk checkclean.BaseTmuxBackend, overriding only_run; the seam guards live above_run).Once merged to
main, the Release workflow auto-creates thev0.7.9tag + GitHub release from the CHANGELOG.🤖 Generated with Claude Code
Summary by CodeRabbit