Skip to content

release: 0.7.9 — orchestrator crash-gap fix#26

Merged
pbean merged 8 commits into
mainfrom
release/0.7.9
Jun 30, 2026
Merged

release: 0.7.9 — orchestrator crash-gap fix#26
pbean merged 8 commits into
mainfrom
release/0.7.9

Conversation

@pbean

@pbean pbean commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

0.7.9 — close the orchestrator crash-gap

Resilience hardening of the shared multiplexer seam + engine, surfaced while working on the psmux (Windows) backend but living entirely in shared core. Two layers:

Layer A — no false crash from a transient transport hang

  • Seam honesty (tmux_base): no multiplexer contract method leaks a raw subprocess.TimeoutExpired/OSError; each raises MultiplexerError/TmuxError or returns its documented sentinel. A 30 s tmux hang in the liveness probe (list_window_ids/window_alive) no longer escapes the seam.
  • Wait-loop tolerance: an unknowable liveness probe is no longer read as "window dead → crashed". A persistent transport failure degrades to an honest timeout bounded by spec.timeout_s; genuine window death still crashes as before.

Layer B — an unexpected exception is recorded, not lost

  • Engine crash handler: any unexpected exception out of Engine.run() is now recorded — run-crash journal line + persisted crash.txt traceback + state.crashed/crash_error — the orphaned agent session is torn down, and a crashed summary is returned instead of the orchestrator dying to the detached, pruned bmad-auto-ctl window. Nested auto-sweep engines re-raise so the parent records it; a CLI backstop covers the residual surface.
  • TUI visibility: a recorded crash classifies as a distinct CRASHED status (not generic INTERRUPTED); pre-feature crashes stay INTERRUPTED. resume re-arms a crashed run via clear_pause.

Validation

  • Full suite green (1217 passed); full trunk check clean.
  • Seam-honesty REPL check + end-to-end crash safety-net (crash.txt + run-crash + state.crashed + session teardown + TUI CRASHED + resume re-arm).
  • psmux PR Windows terminal-multiplexer backend #19 rebases cleanly onto this work and inherits the closed gap with zero psmux-side edits (it subclasses BaseTmuxBackend, overriding only _run; the seam guards live above _run).

Once merged to main, the Release workflow auto-creates the v0.7.9 tag + GitHub release from the CHANGELOG.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling of tmux connection and window-check issues so transient transport problems no longer look like crashes.
    • Runs now record crash details more reliably, including a traceback and crash status when unexpected failures occur.
    • The app’s UI now shows a distinct “crashed” state with clearer status messaging.
    • CLI errors now return a cleaner message instead of exposing raw tracebacks for unexpected failures.

pbean and others added 7 commits June 29, 2026 18:39
…rror

BaseTmuxBackend._run deliberately propagates raw TimeoutExpired/OSError so
callers' try/except fires, but several contract methods had no such guard —
a 30s tmux hang in list_window_ids/window_alive (the engine's liveness probe)
escaped the MultiplexerError seam and would later crash the engine.

Add localized guards in the inherited contract methods (above _run, so a psmux
overriding only _run is still covered):
- _tmux wraps once → covers new_session/set_session_option/new_window/
  new_parked_window/send_text, and pipe_pane via its existing except TmuxError.
- list_window_ids raises TmuxError on transport failure (a sentinel [] would
  falsely read as "window dead"); keeps returncode != 0 → [] for a real death.
- Group-3 best-effort ops degrade to their documented sentinel on failure
  (list_windows→[], show_window_option→"", switch_client→False, the rest→no-op).

multiplexer.py: docstrings note the raise-on-unknowable / sentinel-on-failure
contract. Tests lock in the raisers, the sentinel returners, the already-safe
swallowers, and the psmux-style _run-override case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A transient transport hang (the liveness probe raising MultiplexerError,
e.g. a 30s tmux hang) was read as "window dead -> crashed", rolling back a
possibly-working session. Honor the 0.7.7 stall-hardening rule: skip the
crash check on a probe transport error and let spec.timeout_s bound a
persistent failure to an honest "timeout".

- generic.wait_for_completion: guard self._window_alive with try/except
  MultiplexerError -> count an internal probe_failures and continue;
  genuine death (dead window -> list_window_ids returns [], no exception)
  still returns "crashed".
- probe.py: same guard around launcher.window_alive -> retry next tick.
- tests: transient-recovers -> completed; persistent -> timeout (never
  crashed); genuine False -> crashed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add crashed/crash_error to RunState — persistable, backward-compatible
flags the engine will set (Phase 4) and the TUI will read (Phase 5).
to_dict emits both; from_dict uses the same .get(default) idiom so old
state.json loads unchanged. clear_pause now also resets them so resume
re-arms a crashed run like a stopped one.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ne (Layer B core)

Engine.run() now catches any unexpected exception escaping the loop: it
persists the traceback to crash.txt, tears down the orphaned agent session,
sets state.crashed/crash_error, and appends a run-crash journal line — then
falls through to a crashed summary rather than letting the traceback die to
the lossy parked control window. Mirrors the RunStopped branch, including the
nested-engine re-raise (the owner records; nested still persists + tears down).

RunSummary gains crashed/crash_error with a CRASHED render line. cli.main()
gets a final except-Exception backstop for the residual surface outside
engine.run() (config load, construction, render/notify).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er B visibility)

A run that recorded an unexpected engine crash (state.crashed) now surfaces
as a distinct CRASHED status with a "see crash.txt" hint and the crash error,
checked before liveness so its dead pid does not read as a generic INTERRUPTED.
Pre-feature crashes lack the flag and stay INTERRUPTED.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@pbean, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 1 minute

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: af42e27e-ee91-4eb8-9860-0a77f2cd6766

📥 Commits

Reviewing files that changed from the base of the PR and between 184b380 and 188b716.

📒 Files selected for processing (4)
  • src/automator/engine.py
  • src/automator/tui/widgets.py
  • tests/test_engine.py
  • tests/test_tui_app.py

Walkthrough

Version 0.7.9 introduces crash recording in Engine.run() and RunState, hardens tmux transport methods to follow strict raise/sentinel error contracts, tolerates transient MultiplexerError in liveness probes (GenericAdapter, probe.py), surfaces crashed run status in the TUI, and broadens CLI exception handling.

Changes

Crash Recording & Tmux Transport Hardening

Layer / File(s) Summary
Tmux transport error contracts
src/automator/adapters/multiplexer.py, src/automator/adapters/tmux_base.py, tests/test_multiplexer.py
_tmux wraps TimeoutExpired/OSError as TmuxError; list_window_ids raises on transport failure; kill_window, list_windows, window option, and client methods follow documented best-effort/sentinel (no-op or empty return) contracts. Interface docstrings codify all raise/sentinel distinctions. Tests assert no raw subprocess errors leak through.
Transient MultiplexerError tolerance in liveness probes
src/automator/adapters/generic.py, src/automator/probe.py, tests/test_generic_tmux.py
wait_for_completion wraps _window_alive in try/except MultiplexerError, tracking a probe_failures counter and skipping crash classification on transient errors. probe.py's polling loop similarly continues on MultiplexerError. Tests cover transient recovery, persistent timeout, and genuine dead-window crash detection.
RunState crash fields and serialization
src/automator/model.py, tests/test_model.py
RunState gains crashed and crash_error fields with to_dict/from_dict round-trip and clear_pause() reset. Tests cover round-trip, legacy defaults, and clear_pause behavior.
Engine crash recording and RunSummary
src/automator/engine.py, src/automator/cli.py, tests/test_engine.py
Engine.run() adds a catch-all except Exception block that persists crash.txt, kills the session, marks state.crashed, records a run-crash journal entry, and re-raises for nested engines. RunSummary gains crashed/crash_error fields and renders CRASHED in output. CLI broadens exception catch to prevent bare tracebacks.
TUI crashed status discovery and rendering
src/automator/tui/data.py, src/automator/tui/widgets.py, tests/test_tui_data.py
data.py adds CRASHED constant; _classify accepts crashed flag and returns CRASHED before liveness checks; discover_runs reads crashed from state.json. widgets.py adds crash glyph (), bold-red style, and a RunHeader branch showing crash_error. Tests cover classification, watcher/discovery paths, and legacy backward compatibility.
Version bump to 0.7.9
pyproject.toml, module.yaml, src/automator/__init__.py, .claude-plugin/marketplace.json, src/automator/data/skills/bmad-auto-setup/assets/module.yaml, CHANGELOG.md
Version bumped from 0.7.8 to 0.7.9 across all manifests and changelog.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • bmad-code-org/bmad-auto#11: Earlier multiplexer seam refactor that introduced the MultiplexerError / GenericAdapter / probe liveness-check integration that this PR builds crash-safety handling on top of.

Poem

🐰 Hoppity-hop through the terminal tree,
A crash once was lost — now it's saved, you'll see!
The tmux may hang, the transport may fail,
But crash.txt remembers each panicked detail.
✖ engine crashed glows bold red in the TUI —
Resume with e, little bunny, we believe in you! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.80% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the PR’s main release bump and crash-gap hardening work.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch release/0.7.9

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@augmentcode

augmentcode Bot commented Jun 30, 2026

Copy link
Copy Markdown
🤖 Augment PR Summary

Summary: Releases 0.7.9 and closes a “crash-gap” where tmux transport hangs or unexpected engine exceptions could incorrectly crash runs or be lost to the parked control pane.

Changes:

  • Bumps version metadata to 0.7.9 and records the release in CHANGELOG.md.
  • Hardens the tmux multiplexer seam so raw subprocess.TimeoutExpired/OSError never leak; critical liveness paths raise MultiplexerError/TmuxError and non-critical ops return documented sentinels.
  • Makes wait loops tolerate “unknowable” liveness (transport errors) so transient hangs don’t get misread as window death; persistent failures degrade to an honest timeout.
  • Adds an Engine.run() crash safety-net: persist crash.txt, append run-crash, mark state.crashed/crash_error, and tear down the orphaned session instead of dying with a bare traceback.
  • Distinguishes nested engines via a dedicated _is_nested flag so only true nested runs re-raise for the outer owner to record.
  • Extends run state persistence (RunState) with crashed/crash_error and clears them on clear_pause() to re-arm resume.
  • Updates TUI classification/rendering to surface a distinct CRASHED status and suppress attach hints when the tmux session is already torn down.
  • Adds a CLI-level exception backstop to emit a clean error instead of leaking tracebacks.
  • Adds targeted tests covering seam honesty, crash recording, nested behavior, classification, and TUI rendering.

🤖 Was this summary useful? React with 👍 or 👎

@augmentcode augmentcode Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread src/automator/engine.py Outdated
Exception
): # noqa: BLE001 # nosec B110 - best-effort teardown; a crashing run must still record
pass
if not self._owns_signals:

@augmentcode augmentcode Bot Jun 30, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Engine.run() (src/automator/engine.py:+227), the if not self._owns_signals: raise gate means an unexpected exception will re-raise if stop handlers weren’t installed (e.g., running off the main thread), even when there’s no outer engine to record the crash. That seems like it could re-open the “orchestrator dies without recording” gap in non-CLI usage paths.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 188b716. Good catch — _owns_signals conflated "I installed handlers" with "I am nested". The re-raise now gates on a dedicated _is_nested flag, captured at _install_stop_signals time (Engine._stop_signals_owner is not None). A top-level engine that could not install handlers (off the main thread) has _is_nested=False and now records the crash instead of re-raising; a genuinely nested auto-sweep still re-raises for the owner to record. Covered by test_top_level_crash_without_signal_handlers_still_records.

Comment thread src/automator/engine.py Outdated
try:
message = str(exc)
except Exception:
message = repr(type(exc).__name__)

@augmentcode augmentcode Bot Jun 30, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the crash handler (src/automator/engine.py:+232), the fallback message = repr(type(exc).__name__) will include quotes and drop details if str(exc) itself fails, producing odd crash_error/journal messages like RuntimeError: 'RuntimeError'. Consider whether a non-quoted type name (or other safe formatting) would better preserve intent here.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 188b716. The str(exc)-failed fallback now uses the bare type(exc).__name__ (no repr()), so crash_error reads BadStr: BadStr rather than BadStr: 'BadStr'. Covered by test_crash_message_fallback_when_str_raises.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/automator/engine.py`:
- Around line 211-245: Clear the finished flag when the exception handler
records a crash in the engine loop. In the exception block that sets
self.state.crashed and self.state.crash_error, also reset self.state.finished so
downstream status classification in Engine/state handling cannot report a
post-run failure as FINISHED; make this change in the same crash path that
appends the "run-crash" journal entry.

In `@src/automator/tui/widgets.py`:
- Around line 119-125: Suppress the attach decision footer for crashed runs by
updating the status check in the widget rendering path so `CRASHED` is excluded
alongside `INTERRUPTED`. In `src/automator/tui/widgets.py`, adjust the logic
around the `state.crash_error`/status text rendering so the later “press a to
attach and answer” hint does not appear when `status == data.CRASHED`, since
`Engine.run()` has already torn down the tmux session and there is no live
session to attach to.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c4704799-036d-42bb-8de2-48767db19cd0

📥 Commits

Reviewing files that changed from the base of the PR and between ef7f77b and 184b380.

⛔ Files ignored due to path filters (11)
  • docs/images/dashboard.png is excluded by !**/*.png
  • docs/images/dashboard.svg is excluded by !**/*.svg
  • docs/images/demo.gif is excluded by !**/*.gif
  • docs/images/settings-scm.png is excluded by !**/*.png
  • docs/images/settings-scm.svg is excluded by !**/*.svg
  • docs/images/settings.svg is excluded by !**/*.svg
  • docs/images/start-run-modal.png is excluded by !**/*.png
  • docs/images/start-run-modal.svg is excluded by !**/*.svg
  • docs/images/sweep-decision.png is excluded by !**/*.png
  • docs/images/sweep-decision.svg is excluded by !**/*.svg
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • .claude-plugin/marketplace.json
  • CHANGELOG.md
  • module.yaml
  • pyproject.toml
  • src/automator/__init__.py
  • src/automator/adapters/generic.py
  • src/automator/adapters/multiplexer.py
  • src/automator/adapters/tmux_base.py
  • src/automator/cli.py
  • src/automator/data/skills/bmad-auto-setup/assets/module.yaml
  • src/automator/engine.py
  • src/automator/model.py
  • src/automator/probe.py
  • src/automator/tui/data.py
  • src/automator/tui/widgets.py
  • tests/test_engine.py
  • tests/test_generic_tmux.py
  • tests/test_model.py
  • tests/test_multiplexer.py
  • tests/test_tui_data.py

Comment thread src/automator/engine.py
Comment thread src/automator/tui/widgets.py
…ding

Four findings from the automated reviewers on the crash-gap fix, all valid:

- (major) Reset state.finished when recording a crash. The loop sets
  finished=True before _gc_run_worktrees/post_run/run-complete, so a throw in
  those was recorded as crashed but left finished=True — and classification
  checks finished first, masking the crash as FINISHED.
- (medium) Distinguish "nested" from "no handlers installed". _owns_signals is
  False both for a nested auto-sweep and for a top-level engine that couldn't
  install handlers (off the main thread); the latter re-raised and lost the
  crash. A dedicated _is_nested flag (captured at install time) now gates the
  re-raise, so a top-level engine always records.
- (low) Drop repr() from the str(exc)-failed fallback so crash_error reads
  "BadStr: BadStr", not "BadStr: 'BadStr'".
- (minor) Suppress the "press a to attach and answer" footer for CRASHED runs —
  the tmux session is already torn down, so the hint points at a dead session.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pbean

pbean commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for the reviews — all four findings validated and addressed in 188b716:

  • (CodeRabbit, major) Crash recorded after a post-loop failure now clears state.finished, so it classifies as CRASHED instead of FINISHED (classification checks finished first).
  • (Augment, medium) Split the re-raise predicate off _owns_signals onto a dedicated _is_nested flag, so a top-level engine that couldn't install signal handlers (off the main thread) still records the crash instead of re-raising and losing it.
  • (Augment, low) Dropped repr() from the str(exc)-failed fallback → crash_error reads BadStr: BadStr, not BadStr: 'BadStr'.
  • (CodeRabbit, minor) CRASHED now suppresses the "press a to attach and answer" footer (the tmux session is already torn down on crash).

Tests added for each (post-loop crash → CRASHED, off-main-thread top-level crash records, __str__-raises fallback, crashed-run footer suppression). Full suite green (1221), trunk clean.

@pbean

pbean commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai resolve

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
✅ Action performed

Comments resolved. Approval is disabled; enable reviews.request_changes_workflow to allow explicit top-level @coderabbitai resolve or @coderabbitai approve commands.

@pbean

pbean commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit review
augment review

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

@pbean Re-reviewing the PR now.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@augmentcode augmentcode Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

@pbean pbean merged commit bfe7562 into main Jun 30, 2026
7 checks passed
@pbean pbean deleted the release/0.7.9 branch June 30, 2026 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant