Skip to content

reaper: two known gaps from PR #29 (process-axis collapsed; no per-probe timeout) #32

@harshitsinghbhandari

Description

@harshitsinghbhandari

Follow-up to #29 (feat/observe-reaper). Both gaps were flagged in review and accepted as non-blocking for that PR; tracking them here so they don't get lost.

Gap 1 — ProcessState is stamped equal to RuntimeState

backend/internal/observe/reaper/reaper.go:191-197 only invokes Runtime.IsAlive and copies the boolean into both axes:

case alive:
    facts.RuntimeState = ports.RuntimeProbeAlive
    facts.ProcessState = ports.ProcessProbeAlive
default:
    facts.RuntimeState = ports.RuntimeProbeDead
    facts.ProcessState = ports.ProcessProbeDead

Failure mode: Claude Code panics / OOMs / segfaults inside a still-running tmux (or zellij) pane. tmux has-session returns success, so the reaper reports Alive + Alive. The LCM's probe decider can never see the disagreement (runtime alive, process dead) it was designed to resolve, so the session sits at working forever and only a human notices.

Why it merged anyway: the LCM/SM design hands process-exit detection to activity ingest (Claude Code hooks + .ao/activity.jsonl FS watcher), which is a separate follow-up. Between #29 landing and activity ingest landing, this gap is real.

Fix shape: once ports.Agent.ProbeProcess is wired (per internal/ports/outbound.go:104), the reaper should resolve the Agent for each session alongside the Runtime and stamp the two axes independently. The decide truth table already expresses the four combinations.

Gap 2 — No per-tick / per-probe timeout

backend/internal/observe/reaper/reaper.go:133-153 passes the loop's parent ctx straight through to IsAlive. tmux/zellij adapters honor ctx via exec.CommandContext (internal/adapters/runtime/tmux/tmux.go:50), but that ctx only cancels on daemon shutdown. There is no per-tick deadline.

Failure mode: a single hung IsAlive (NFS-backed tmux socket, runaway subprocess, kernel deadlock on the runtime binary) blocks the current tick indefinitely. Because the loop is a single goroutine with a time.Ticker, the next tick can't fire while the previous one is stuck — so TickEscalations stops firing for every session, not just the hung one. That defeats the whole point of the reaper heartbeat (invariant #6: a non-polling LCM relies on TickEscalations to wake up 30m escalations).

Low probability on a healthy box; blast radius is "no escalations fire until daemon restart."

Fix shape options (pick one):

  • Per-probe context.WithTimeout(ctx, probeBudget) inside probeOne (simplest; bounds each call).
  • Per-tick context.WithDeadline(ctx, now.Add(tickBudget)) in Tick (bounds the whole cycle).
  • Run probes in bounded-concurrency goroutines with a wait group + per-probe timeout (also closes the "sequential probes" polish item).

Either of the first two is enough to prevent a stalled heartbeat; the third is a bigger refactor and only worth doing if the running set ever grows past tens of sessions.

Scope

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestlcm-smLifecycle + Session Manager lane

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions