Skip to content

fix: revive pane-watcher / no-handoff watchdog (zellij 0.45 dump-screen)#579

Merged
brickfrog merged 11 commits into
mainfrom
feature/pane-watcher-revival
Jun 13, 2026
Merged

fix: revive pane-watcher / no-handoff watchdog (zellij 0.45 dump-screen)#579
brickfrog merged 11 commits into
mainfrom
feature/pane-watcher-revival

Conversation

@brickfrog

@brickfrog brickfrog commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Summary

Revives the no-handoff watchdog, which had been silently dead. The watchdog is supposed to detect an agent that finished but never called notify_parent (or stalled) and escalate its pane tail to the parent. It never fired because the pane-content feed relied on zellij subscribe pane-viewport, an invocation broken on zellij 0.45.0 (the pane-viewport positional was removed, and --pane-id requires a real pane-id, not a tab name). The supervisor died on spawn, the /tmp/choir-viewport-* snapshot was never written, observe_idle always returned None, and the watchdog looped over nothing — which is why silent workers produced no escalation.

This is bead choir-8unx / choir-jckm.

What changed

  • Read panes via zellij action dump-screen --pane-id <id> --full on the existing ~30s poller tick — focus-free (verified on host), no long-running supervisor, no /tmp snapshot files.
  • Resolve each agent's pane-id at watch time by reusing the existing @tools.resolve_zellij_pane_id against a cached list-panes --json --tab snapshot (injected list_panes_snapshot_capture capability) — no duplicate resolver, no sidecar, no spawn-flow changes.
  • Re-watch recovered agents after a server reload (apply_recovered_agentwatch_pane, guarded by alive && terminal_target), so the watchdog survives the exec-in-place reload.
  • Dropped the in-pane "prod" nag entirely (worker and pre-TDD-leaf): the watchdog no longer types LocalTerminalInput into agent panes to demand a handoff (that text triggered codex apology-loops). It escalates the pane tail straight to the parent at a single worker_no_handoff_idle_sec threshold. The Prod enum branch and prodded_at state are deleted.
  • Deleted the broken streaming-subscribe subsystem: C choir_spawn_zellij_pane_viewport_subscribe + SIGTERM helpers, the FFI binding, io_stub, and the snapshot-file path.
  • Kept src/runtime I/O-free: the native dump/list-panes adapters are injected at the server/bin boundary; PaneWatcher::new is hermetic.

Review trail

This branch took four leaf PRs and one audit round, disclosed in full:

  • fix: revive pane watcher no-handoff watchdog #576 revived the watcher but (a) raced past a mid-flight redirect and landed a heavier design, and (b) reinvented resolve_zellij_pane_id instead of reusing it.
  • refactor: reuse pane resolver for watcher #577 refactored to reuse the existing resolver (net −13 lines), dropping the sidecar/spawn-capture/duplicate-resolver.
  • Audit (Sarcasmotron, receipt f2358fc8, 5 findings): recovery didn't re-watch; the prod was only half-removed; src/runtime called @exec directly; the spec-required live-zellij host check was missing.
  • Fix pane watcher audit regressions #578 fixed all five, TL-gated (automerge=false, merged via merge_pr after CI green).

The spawn_diagnostics subsystem that briefly appeared is pre-existing main infra; #577 reverted #576's edits to it, so this surface carries 0 lines of it.

Verification

  • moon test --target native: 2009/2009 green on the integrated branch.
  • Hermeticity lint (choir_lint): clean.
  • Grep gates: streaming-subscribe, prod machinery, duplicate resolver/sidecar all absent; src/runtime free of @exec.
  • New self-skipping host check (live_zellij_dump_screen_test.mbt): when a zellij session exists, creates a scratch tab, asserts dump-screen --pane-id returns non-empty text and leaves focus unchanged, then closes the tab — the exact observable check that would have caught the original 0.45 breakage. Self-skips (and passes) in CI where no session exists.

Follow-ups (filed, not in this PR)

  • Auto-synthesize a structured notify_parent from pane text on done-without-handoff (this PR forwards the raw pane tail).
  • Auto-delete merged leaf branches after merge_pr/automerge (today's 220-branch cleanup was manual).
  • PID/pane-id-reuse hardening.

@brickfrog brickfrog merged commit ee3985b into main Jun 13, 2026
1 check passed
@brickfrog brickfrog deleted the feature/pane-watcher-revival branch June 13, 2026 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant