reaper: two known gaps from PR #29 (process-axis collapsed; no per-probe timeout)

Follow-up to #29 (feat/observe-reaper). Both gaps were flagged in review and accepted as non-blocking for that PR; tracking them here so they don't get lost.

## Gap 1 — `ProcessState` is stamped equal to `RuntimeState`

`backend/internal/observe/reaper/reaper.go:191-197` only invokes `Runtime.IsAlive` and copies the boolean into both axes:

```go
case alive:
    facts.RuntimeState = ports.RuntimeProbeAlive
    facts.ProcessState = ports.ProcessProbeAlive
default:
    facts.RuntimeState = ports.RuntimeProbeDead
    facts.ProcessState = ports.ProcessProbeDead
```

**Failure mode:** Claude Code panics / OOMs / segfaults *inside* a still-running tmux (or zellij) pane. `tmux has-session` returns success, so the reaper reports `Alive + Alive`. The LCM's probe decider can never see the disagreement (runtime alive, process dead) it was designed to resolve, so the session sits at `working` forever and only a human notices.

**Why it merged anyway:** the LCM/SM design hands process-exit detection to activity ingest (Claude Code hooks + `.ao/activity.jsonl` FS watcher), which is a separate follow-up. Between #29 landing and activity ingest landing, this gap is real.

**Fix shape:** once `ports.Agent.ProbeProcess` is wired (per `internal/ports/outbound.go:104`), the reaper should resolve the `Agent` for each session alongside the `Runtime` and stamp the two axes independently. The `decide` truth table already expresses the four combinations.

## Gap 2 — No per-tick / per-probe timeout

`backend/internal/observe/reaper/reaper.go:133-153` passes the loop's parent ctx straight through to `IsAlive`. tmux/zellij adapters honor ctx via `exec.CommandContext` (`internal/adapters/runtime/tmux/tmux.go:50`), but that ctx only cancels on daemon shutdown. There is no per-tick deadline.

**Failure mode:** a single hung `IsAlive` (NFS-backed tmux socket, runaway subprocess, kernel deadlock on the runtime binary) blocks the current tick indefinitely. Because the loop is a single goroutine with a `time.Ticker`, the next tick can't fire while the previous one is stuck — so **`TickEscalations` stops firing for every session, not just the hung one**. That defeats the whole point of the reaper heartbeat (invariant #6: a non-polling LCM relies on `TickEscalations` to wake up `30m` escalations).

Low probability on a healthy box; blast radius is "no escalations fire until daemon restart."

**Fix shape options (pick one):**
- Per-probe `context.WithTimeout(ctx, probeBudget)` inside `probeOne` (simplest; bounds each call).
- Per-tick `context.WithDeadline(ctx, now.Add(tickBudget))` in `Tick` (bounds the whole cycle).
- Run probes in bounded-concurrency goroutines with a wait group + per-probe timeout (also closes the "sequential probes" polish item).

Either of the first two is enough to prevent a stalled heartbeat; the third is a bigger refactor and only worth doing if the running set ever grows past tens of sessions.

## Scope
- One PR is fine for both — they're both in the reaper package and share the same review surface.
- Tests should cover: (#1) a `(RuntimeProbeAlive, ProcessProbeDead)` fact reaches the LCM; (#2) a probe that blocks past the budget is cancelled and the next tick still fires.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reaper: two known gaps from PR #29 (process-axis collapsed; no per-probe timeout) #32

Gap 1 — `ProcessState` is stamped equal to `RuntimeState`

Gap 2 — No per-tick / per-probe timeout

Scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

reaper: two known gaps from PR #29 (process-axis collapsed; no per-probe timeout) #32

Description

Gap 1 — ProcessState is stamped equal to RuntimeState

Gap 2 — No per-tick / per-probe timeout

Scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Gap 1 — `ProcessState` is stamped equal to `RuntimeState`