release: 0.7.9 — orchestrator crash-gap fix by pbean · Pull Request #26 · bmad-code-org/bmad-auto

pbean · 2026-06-30T02:42:49Z

0.7.9 — close the orchestrator crash-gap

Resilience hardening of the shared multiplexer seam + engine, surfaced while working on the psmux (Windows) backend but living entirely in shared core. Two layers:

Layer A — no false crash from a transient transport hang

Seam honesty (tmux_base): no multiplexer contract method leaks a raw subprocess.TimeoutExpired/OSError; each raises MultiplexerError/TmuxError or returns its documented sentinel. A 30 s tmux hang in the liveness probe (list_window_ids/window_alive) no longer escapes the seam.
Wait-loop tolerance: an unknowable liveness probe is no longer read as "window dead → crashed". A persistent transport failure degrades to an honest timeout bounded by spec.timeout_s; genuine window death still crashes as before.

Layer B — an unexpected exception is recorded, not lost

Engine crash handler: any unexpected exception out of Engine.run() is now recorded — run-crash journal line + persisted crash.txt traceback + state.crashed/crash_error — the orphaned agent session is torn down, and a crashed summary is returned instead of the orchestrator dying to the detached, pruned bmad-auto-ctl window. Nested auto-sweep engines re-raise so the parent records it; a CLI backstop covers the residual surface.
TUI visibility: a recorded crash classifies as a distinct CRASHED status (not generic INTERRUPTED); pre-feature crashes stay INTERRUPTED. resume re-arms a crashed run via clear_pause.

Validation

Full suite green (1217 passed); full trunk check clean.
Seam-honesty REPL check + end-to-end crash safety-net (crash.txt + run-crash + state.crashed + session teardown + TUI CRASHED + resume re-arm).
psmux PR Windows terminal-multiplexer backend #19 rebases cleanly onto this work and inherits the closed gap with zero psmux-side edits (it subclasses BaseTmuxBackend, overriding only _run; the seam guards live above _run).

Once merged to main, the Release workflow auto-creates the v0.7.9 tag + GitHub release from the CHANGELOG.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved handling of tmux connection and window-check issues so transient transport problems no longer look like crashes.
- Runs now record crash details more reliably, including a traceback and crash status when unexpected failures occur.
- The app’s UI now shows a distinct “crashed” state with clearer status messaging.
- CLI errors now return a cleaner message instead of exposing raw tracebacks for unexpected failures.

…rror BaseTmuxBackend._run deliberately propagates raw TimeoutExpired/OSError so callers' try/except fires, but several contract methods had no such guard — a 30s tmux hang in list_window_ids/window_alive (the engine's liveness probe) escaped the MultiplexerError seam and would later crash the engine. Add localized guards in the inherited contract methods (above _run, so a psmux overriding only _run is still covered): - _tmux wraps once → covers new_session/set_session_option/new_window/ new_parked_window/send_text, and pipe_pane via its existing except TmuxError. - list_window_ids raises TmuxError on transport failure (a sentinel [] would falsely read as "window dead"); keeps returncode != 0 → [] for a real death. - Group-3 best-effort ops degrade to their documented sentinel on failure (list_windows→[], show_window_option→"", switch_client→False, the rest→no-op). multiplexer.py: docstrings note the raise-on-unknowable / sentinel-on-failure contract. Tests lock in the raisers, the sentinel returners, the already-safe swallowers, and the psmux-style _run-override case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A transient transport hang (the liveness probe raising MultiplexerError, e.g. a 30s tmux hang) was read as "window dead -> crashed", rolling back a possibly-working session. Honor the 0.7.7 stall-hardening rule: skip the crash check on a probe transport error and let spec.timeout_s bound a persistent failure to an honest "timeout". - generic.wait_for_completion: guard self._window_alive with try/except MultiplexerError -> count an internal probe_failures and continue; genuine death (dead window -> list_window_ids returns [], no exception) still returns "crashed". - probe.py: same guard around launcher.window_alive -> retry next tick. - tests: transient-recovers -> completed; persistent -> timeout (never crashed); genuine False -> crashed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add crashed/crash_error to RunState — persistable, backward-compatible flags the engine will set (Phase 4) and the TUI will read (Phase 5). to_dict emits both; from_dict uses the same .get(default) idiom so old state.json loads unchanged. clear_pause now also resets them so resume re-arms a crashed run like a stopped one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ne (Layer B core) Engine.run() now catches any unexpected exception escaping the loop: it persists the traceback to crash.txt, tears down the orphaned agent session, sets state.crashed/crash_error, and appends a run-crash journal line — then falls through to a crashed summary rather than letting the traceback die to the lossy parked control window. Mirrors the RunStopped branch, including the nested-engine re-raise (the owner records; nested still persists + tears down). RunSummary gains crashed/crash_error with a CRASHED render line. cli.main() gets a final except-Exception backstop for the residual surface outside engine.run() (config load, construction, render/notify). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…er B visibility) A run that recorded an unexpected engine crash (state.crashed) now surfaces as a distinct CRASHED status with a "see crash.txt" hint and the crash error, checked before liveness so its dead pid does not read as a generic INTERRUPTED. Pre-feature crashes lack the flag and stay INTERRUPTED. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rocess timeouts…

coderabbitai · 2026-06-30T02:43:07Z

Warning

Review limit reached

@pbean, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 1 minute

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: af42e27e-ee91-4eb8-9860-0a77f2cd6766

📥 Commits

Reviewing files that changed from the base of the PR and between 184b380 and 188b716.

📒 Files selected for processing (4)

src/automator/engine.py
src/automator/tui/widgets.py
tests/test_engine.py
tests/test_tui_app.py

Walkthrough

Version 0.7.9 introduces crash recording in Engine.run() and RunState, hardens tmux transport methods to follow strict raise/sentinel error contracts, tolerates transient MultiplexerError in liveness probes (GenericAdapter, probe.py), surfaces crashed run status in the TUI, and broadens CLI exception handling.

Changes

Crash Recording & Tmux Transport Hardening

Layer / File(s)	Summary
Tmux transport error contracts `src/automator/adapters/multiplexer.py`, `src/automator/adapters/tmux_base.py`, `tests/test_multiplexer.py`	`_tmux` wraps `TimeoutExpired`/`OSError` as `TmuxError`; `list_window_ids` raises on transport failure; `kill_window`, `list_windows`, window option, and client methods follow documented best-effort/sentinel (no-op or empty return) contracts. Interface docstrings codify all raise/sentinel distinctions. Tests assert no raw subprocess errors leak through.
Transient `MultiplexerError` tolerance in liveness probes `src/automator/adapters/generic.py`, `src/automator/probe.py`, `tests/test_generic_tmux.py`	`wait_for_completion` wraps `_window_alive` in `try/except MultiplexerError`, tracking a `probe_failures` counter and skipping crash classification on transient errors. `probe.py`'s polling loop similarly continues on `MultiplexerError`. Tests cover transient recovery, persistent timeout, and genuine dead-window crash detection.
`RunState` crash fields and serialization `src/automator/model.py`, `tests/test_model.py`	`RunState` gains `crashed` and `crash_error` fields with `to_dict`/`from_dict` round-trip and `clear_pause()` reset. Tests cover round-trip, legacy defaults, and `clear_pause` behavior.
Engine crash recording and `RunSummary` `src/automator/engine.py`, `src/automator/cli.py`, `tests/test_engine.py`	`Engine.run()` adds a catch-all `except Exception` block that persists `crash.txt`, kills the session, marks `state.crashed`, records a `run-crash` journal entry, and re-raises for nested engines. `RunSummary` gains `crashed`/`crash_error` fields and renders `CRASHED` in output. CLI broadens exception catch to prevent bare tracebacks.
TUI crashed status discovery and rendering `src/automator/tui/data.py`, `src/automator/tui/widgets.py`, `tests/test_tui_data.py`	`data.py` adds `CRASHED` constant; `_classify` accepts `crashed` flag and returns `CRASHED` before liveness checks; `discover_runs` reads `crashed` from `state.json`. `widgets.py` adds crash glyph (`✖`), bold-red style, and a `RunHeader` branch showing `crash_error`. Tests cover classification, watcher/discovery paths, and legacy backward compatibility.
Version bump to 0.7.9 `pyproject.toml`, `module.yaml`, `src/automator/__init__.py`, `.claude-plugin/marketplace.json`, `src/automator/data/skills/bmad-auto-setup/assets/module.yaml`, `CHANGELOG.md`	Version bumped from `0.7.8` to `0.7.9` across all manifests and changelog.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

bmad-code-org/bmad-auto#11: Earlier multiplexer seam refactor that introduced the MultiplexerError / GenericAdapter / probe liveness-check integration that this PR builds crash-safety handling on top of.

Poem

🐰 Hoppity-hop through the terminal tree,
A crash once was lost — now it's saved, you'll see!
The tmux may hang, the transport may fail,
But crash.txt remembers each panicked detail.
✖ engine crashed glows bold red in the TUI —
Resume with e, little bunny, we believe in you! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.80% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly matches the PR’s main release bump and crash-gap hardening work.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch release/0.7.9

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

augmentcode · 2026-06-30T02:46:09Z

🤖 Augment PR Summary

Summary: Releases 0.7.9 and closes a “crash-gap” where tmux transport hangs or unexpected engine exceptions could incorrectly crash runs or be lost to the parked control pane.

Changes:

Bumps version metadata to 0.7.9 and records the release in CHANGELOG.md.
Hardens the tmux multiplexer seam so raw subprocess.TimeoutExpired/OSError never leak; critical liveness paths raise MultiplexerError/TmuxError and non-critical ops return documented sentinels.
Makes wait loops tolerate “unknowable” liveness (transport errors) so transient hangs don’t get misread as window death; persistent failures degrade to an honest timeout.
Adds an Engine.run() crash safety-net: persist crash.txt, append run-crash, mark state.crashed/crash_error, and tear down the orphaned session instead of dying with a bare traceback.
Distinguishes nested engines via a dedicated _is_nested flag so only true nested runs re-raise for the outer owner to record.
Extends run state persistence (RunState) with crashed/crash_error and clears them on clear_pause() to re-arm resume.
Updates TUI classification/rendering to surface a distinct CRASHED status and suppress attach hints when the tmux session is already torn down.
Adds a CLI-level exception backstop to emit a clean error instead of leaking tracebacks.
Adds targeted tests covering seam honesty, crash recording, nested behavior, classification, and TUI rendering.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-06-30T02:46:10Z

+                    Exception
+                ):  # noqa: BLE001  # nosec B110 - best-effort teardown; a crashing run must still record
+                    pass
+                if not self._owns_signals:


In Engine.run() (src/automator/engine.py:+227), the if not self._owns_signals: raise gate means an unexpected exception will re-raise if stop handlers weren’t installed (e.g., running off the main thread), even when there’s no outer engine to record the crash. That seems like it could re-open the “orchestrator dies without recording” gap in non-CLI usage paths.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

Fixed in 188b716. Good catch — _owns_signals conflated "I installed handlers" with "I am nested". The re-raise now gates on a dedicated _is_nested flag, captured at _install_stop_signals time (Engine._stop_signals_owner is not None). A top-level engine that could not install handlers (off the main thread) has _is_nested=False and now records the crash instead of re-raising; a genuinely nested auto-sweep still re-raises for the owner to record. Covered by test_top_level_crash_without_signal_handlers_still_records.

augmentcode · 2026-06-30T02:46:11Z

+                try:
+                    message = str(exc)
+                except Exception:
+                    message = repr(type(exc).__name__)


In the crash handler (src/automator/engine.py:+232), the fallback message = repr(type(exc).__name__) will include quotes and drop details if str(exc) itself fails, producing odd crash_error/journal messages like RuntimeError: 'RuntimeError'. Consider whether a non-quoted type name (or other safe formatting) would better preserve intent here.

Severity: low

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

Fixed in 188b716. The str(exc)-failed fallback now uses the bare type(exc).__name__ (no repr()), so crash_error reads BadStr: BadStr rather than BadStr: 'BadStr'. Covered by test_crash_message_fallback_when_str_raises.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/automator/engine.py`:
- Around line 211-245: Clear the finished flag when the exception handler
records a crash in the engine loop. In the exception block that sets
self.state.crashed and self.state.crash_error, also reset self.state.finished so
downstream status classification in Engine/state handling cannot report a
post-run failure as FINISHED; make this change in the same crash path that
appends the "run-crash" journal entry.

In `@src/automator/tui/widgets.py`:
- Around line 119-125: Suppress the attach decision footer for crashed runs by
updating the status check in the widget rendering path so `CRASHED` is excluded
alongside `INTERRUPTED`. In `src/automator/tui/widgets.py`, adjust the logic
around the `state.crash_error`/status text rendering so the later “press a to
attach and answer” hint does not appear when `status == data.CRASHED`, since
`Engine.run()` has already torn down the tmux session and there is no live
session to attach to.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c4704799-036d-42bb-8de2-48767db19cd0

📥 Commits

Reviewing files that changed from the base of the PR and between ef7f77b and 184b380.

⛔ Files ignored due to path filters (11)

docs/images/dashboard.png is excluded by !**/*.png
docs/images/dashboard.svg is excluded by !**/*.svg
docs/images/demo.gif is excluded by !**/*.gif
docs/images/settings-scm.png is excluded by !**/*.png
docs/images/settings-scm.svg is excluded by !**/*.svg
docs/images/settings.svg is excluded by !**/*.svg
docs/images/start-run-modal.png is excluded by !**/*.png
docs/images/start-run-modal.svg is excluded by !**/*.svg
docs/images/sweep-decision.png is excluded by !**/*.png
docs/images/sweep-decision.svg is excluded by !**/*.svg
uv.lock is excluded by !**/*.lock

📒 Files selected for processing (20)

.claude-plugin/marketplace.json
CHANGELOG.md
module.yaml
pyproject.toml
src/automator/__init__.py
src/automator/adapters/generic.py
src/automator/adapters/multiplexer.py
src/automator/adapters/tmux_base.py
src/automator/cli.py
src/automator/data/skills/bmad-auto-setup/assets/module.yaml
src/automator/engine.py
src/automator/model.py
src/automator/probe.py
src/automator/tui/data.py
src/automator/tui/widgets.py
tests/test_engine.py
tests/test_generic_tmux.py
tests/test_model.py
tests/test_multiplexer.py
tests/test_tui_data.py

…ding Four findings from the automated reviewers on the crash-gap fix, all valid: - (major) Reset state.finished when recording a crash. The loop sets finished=True before _gc_run_worktrees/post_run/run-complete, so a throw in those was recorded as crashed but left finished=True — and classification checks finished first, masking the crash as FINISHED. - (medium) Distinguish "nested" from "no handlers installed". _owns_signals is False both for a nested auto-sweep and for a top-level engine that couldn't install handlers (off the main thread); the latter re-raised and lost the crash. A dedicated _is_nested flag (captured at install time) now gates the re-raise, so a top-level engine always records. - (low) Drop repr() from the str(exc)-failed fallback so crash_error reads "BadStr: BadStr", not "BadStr: 'BadStr'". - (minor) Suppress the "press a to attach and answer" footer for CRASHED runs — the tmux session is already torn down, so the hint points at a dead session. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

pbean · 2026-06-30T03:07:09Z

Thanks for the reviews — all four findings validated and addressed in 188b716:

(CodeRabbit, major) Crash recorded after a post-loop failure now clears state.finished, so it classifies as CRASHED instead of FINISHED (classification checks finished first).
(Augment, medium) Split the re-raise predicate off _owns_signals onto a dedicated _is_nested flag, so a top-level engine that couldn't install signal handlers (off the main thread) still records the crash instead of re-raising and losing it.
(Augment, low) Dropped repr() from the str(exc)-failed fallback → crash_error reads BadStr: BadStr, not BadStr: 'BadStr'.
(CodeRabbit, minor) CRASHED now suppresses the "press a to attach and answer" footer (the tmux session is already torn down on crash).

Tests added for each (post-loop crash → CRASHED, off-main-thread top-level crash records, __str__-raises fallback, crashed-run footer suppression). Full suite green (1221), trunk clean.

pbean · 2026-06-30T03:20:07Z

@coderabbitai resolve

coderabbitai · 2026-06-30T03:20:34Z

✅ Action performed

Comments resolved. Approval is disabled; enable reviews.request_changes_workflow to allow explicit top-level @coderabbitai resolve or @coderabbitai approve commands.

pbean · 2026-06-30T03:41:22Z

@CodeRabbit review
augment review

coderabbitai · 2026-06-30T03:41:34Z

@pbean Re-reviewing the PR now.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

augmentcode

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

pbean and others added 7 commits June 29, 2026 18:39

docs(changelog): record the orchestrator crash-gap fix

9528ccd

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(release): 0.7.9 — The multiplexer seam no longer leaks raw subp…

184b380

…rocess timeouts…

augmentcode Bot reviewed Jun 30, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread src/automator/engine.py

Comment thread src/automator/tui/widgets.py

augmentcode Bot reviewed Jun 30, 2026

View reviewed changes

pbean merged commit bfe7562 into main Jun 30, 2026
7 checks passed

pbean deleted the release/0.7.9 branch June 30, 2026 04:19

Uh oh!

Conversation

pbean commented Jun 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

0.7.9 — close the orchestrator crash-gap

Layer A — no false crash from a transient transport hang

Layer B — an unexpected exception is recorded, not lost

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

augmentcode Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pbean Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pbean Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pbean commented Jun 30, 2026

Uh oh!

pbean commented Jun 30, 2026

Uh oh!

coderabbitai Bot commented Jun 30, 2026

Uh oh!

pbean commented Jun 30, 2026

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pbean commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

augmentcode Bot commented Jun 30, 2026 •

edited

Loading

augmentcode Bot Jun 30, 2026 •

edited

Loading

augmentcode Bot Jun 30, 2026 •

edited

Loading

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading