Follow-up to #29 (feat/observe-reaper). Both gaps were flagged in review and accepted as non-blocking for that PR; tracking them here so they don't get lost.
Gap 1 — ProcessState is stamped equal to RuntimeState
backend/internal/observe/reaper/reaper.go:191-197 only invokes Runtime.IsAlive and copies the boolean into both axes:
case alive:
facts.RuntimeState = ports.RuntimeProbeAlive
facts.ProcessState = ports.ProcessProbeAlive
default:
facts.RuntimeState = ports.RuntimeProbeDead
facts.ProcessState = ports.ProcessProbeDead
Failure mode: Claude Code panics / OOMs / segfaults inside a still-running tmux (or zellij) pane. tmux has-session returns success, so the reaper reports Alive + Alive. The LCM's probe decider can never see the disagreement (runtime alive, process dead) it was designed to resolve, so the session sits at working forever and only a human notices.
Why it merged anyway: the LCM/SM design hands process-exit detection to activity ingest (Claude Code hooks + .ao/activity.jsonl FS watcher), which is a separate follow-up. Between #29 landing and activity ingest landing, this gap is real.
Fix shape: once ports.Agent.ProbeProcess is wired (per internal/ports/outbound.go:104), the reaper should resolve the Agent for each session alongside the Runtime and stamp the two axes independently. The decide truth table already expresses the four combinations.
Gap 2 — No per-tick / per-probe timeout
backend/internal/observe/reaper/reaper.go:133-153 passes the loop's parent ctx straight through to IsAlive. tmux/zellij adapters honor ctx via exec.CommandContext (internal/adapters/runtime/tmux/tmux.go:50), but that ctx only cancels on daemon shutdown. There is no per-tick deadline.
Failure mode: a single hung IsAlive (NFS-backed tmux socket, runaway subprocess, kernel deadlock on the runtime binary) blocks the current tick indefinitely. Because the loop is a single goroutine with a time.Ticker, the next tick can't fire while the previous one is stuck — so TickEscalations stops firing for every session, not just the hung one. That defeats the whole point of the reaper heartbeat (invariant #6: a non-polling LCM relies on TickEscalations to wake up 30m escalations).
Low probability on a healthy box; blast radius is "no escalations fire until daemon restart."
Fix shape options (pick one):
- Per-probe
context.WithTimeout(ctx, probeBudget) inside probeOne (simplest; bounds each call).
- Per-tick
context.WithDeadline(ctx, now.Add(tickBudget)) in Tick (bounds the whole cycle).
- Run probes in bounded-concurrency goroutines with a wait group + per-probe timeout (also closes the "sequential probes" polish item).
Either of the first two is enough to prevent a stalled heartbeat; the third is a bigger refactor and only worth doing if the running set ever grows past tens of sessions.
Scope
Follow-up to #29 (feat/observe-reaper). Both gaps were flagged in review and accepted as non-blocking for that PR; tracking them here so they don't get lost.
Gap 1 —
ProcessStateis stamped equal toRuntimeStatebackend/internal/observe/reaper/reaper.go:191-197only invokesRuntime.IsAliveand copies the boolean into both axes:Failure mode: Claude Code panics / OOMs / segfaults inside a still-running tmux (or zellij) pane.
tmux has-sessionreturns success, so the reaper reportsAlive + Alive. The LCM's probe decider can never see the disagreement (runtime alive, process dead) it was designed to resolve, so the session sits atworkingforever and only a human notices.Why it merged anyway: the LCM/SM design hands process-exit detection to activity ingest (Claude Code hooks +
.ao/activity.jsonlFS watcher), which is a separate follow-up. Between #29 landing and activity ingest landing, this gap is real.Fix shape: once
ports.Agent.ProbeProcessis wired (perinternal/ports/outbound.go:104), the reaper should resolve theAgentfor each session alongside theRuntimeand stamp the two axes independently. Thedecidetruth table already expresses the four combinations.Gap 2 — No per-tick / per-probe timeout
backend/internal/observe/reaper/reaper.go:133-153passes the loop's parent ctx straight through toIsAlive. tmux/zellij adapters honor ctx viaexec.CommandContext(internal/adapters/runtime/tmux/tmux.go:50), but that ctx only cancels on daemon shutdown. There is no per-tick deadline.Failure mode: a single hung
IsAlive(NFS-backed tmux socket, runaway subprocess, kernel deadlock on the runtime binary) blocks the current tick indefinitely. Because the loop is a single goroutine with atime.Ticker, the next tick can't fire while the previous one is stuck — soTickEscalationsstops firing for every session, not just the hung one. That defeats the whole point of the reaper heartbeat (invariant #6: a non-polling LCM relies onTickEscalationsto wake up30mescalations).Low probability on a healthy box; blast radius is "no escalations fire until daemon restart."
Fix shape options (pick one):
context.WithTimeout(ctx, probeBudget)insideprobeOne(simplest; bounds each call).context.WithDeadline(ctx, now.Add(tickBudget))inTick(bounds the whole cycle).Either of the first two is enough to prevent a stalled heartbeat; the third is a bigger refactor and only worth doing if the running set ever grows past tens of sessions.
Scope
(RuntimeProbeAlive, ProcessProbeDead)fact reaches the LCM; (feat(backend): Lifecycle Manager + Session Manager lane #2) a probe that blocks past the budget is cancelled and the next tick still fires.