Bug
A worker that finishes its turn and reports waiting_input is correct to show needs_input. But if its runtime pane then dies (daemon/app restart, zellij exit, crash), the session is never reconciled to exited. It stays permanently:
status: needs_input / activity.state: waiting_input
activity.lastActivityAt frozen at the last hook signal
- a dead terminal pane that shows nothing new (the agent process is gone)
This was observed live on agent-orchestrator-2: the worker committed and opened PR #278, went waiting_input at 08:12Z, its zellij pane later exited, and the session sat frozen at needs_input with lastActivityAt=08:12Z while the reaper probed it dead every 5s.
Source: local observation (session agent-orchestrator-2) | Reported by: @aditi1178 | Analyzed against: 7c9ae53 (lifecycle code identical on main)
Confidence: High
Reproduction
- Spawn a worker; let it finish a task so it reports
waiting_input (needs_input).
- Kill its runtime pane (restart the daemon/app, or
zellij pane exits/crashes).
- The reaper probes the pane
dead every 5s, but the session never flips to exited.
- UI keeps showing
needs_input with a stale lastActivityAt and a dead/blank terminal.
Root Cause
The OBSERVE-layer reaper correctly reports the pane as dead, but the LCM's death-arbitration gate suppresses it for sticky states.
backend/internal/observe/reaper/reaper.go — every 5s, IsAlive returns false → reports ports.ProbeDead via ApplyRuntimeObservation. Working.
backend/internal/lifecycle/manager.go:94 ApplyRuntimeObservation only marks IsTerminated + ActivityExited when runtimeClearlyDead(...) is true.
backend/internal/lifecycle/runtime.go:25-28:
func runtimeClearlyDead(f ports.RuntimeFacts, activity domain.Activity, now time.Time, window time.Duration) bool {
observedAt := timeOr(f.ObservedAt, now)
return f.Probe == ports.ProbeDead && !hasRecentActivity(activity, observedAt, window)
}
hasRecentActivity (runtime.go:12-23) returns true unconditionally for any sticky state:
case a.State.IsSticky(): // ActivityWaitingInput
return true
IsSticky() (backend/internal/domain/activity.go:19-21) is true for waiting_input.
So for a waiting_input session, hasRecentActivity is always true → !hasRecentActivity is always false → runtimeClearlyDead is always false → the dead-runtime fact is discarded forever.
The sticky exemption is right for aging by passage of time ("a paused agent is still paused until a new signal says so"), but it is wrong to apply it against an authoritative dead-runtime probe: if the process is gone, the agent cannot be "waiting for input" — there is nothing to receive it. hasRecentActivity has exactly one caller (runtimeClearlyDead), so the sticky branch exists solely to (incorrectly) influence death arbitration.
Maps to all three reported symptoms
- Commits don't show in the terminal — the worker's commits are real (in the worktree git log), but its pane is dead; because the session is never marked
exited, the UI still presents it as a live needs_input session with an attachable terminal that has no live PTY behind it.
- Stuck in needs_input — legit at first, then frozen by this bug.
- Activity not updated — no new hook signals (process dead) AND the reaper-driven
exited transition (which would stamp lastActivityAt to the observed-dead time) is suppressed.
Fix
In runtimeClearlyDead, let an authoritative ProbeDead override the sticky exemption while still honoring the recent-activity time window (so we don't race a just-reported signal during a relaunch):
func runtimeClearlyDead(f ports.RuntimeFacts, activity domain.Activity, now time.Time, window time.Duration) bool {
if f.Probe != ports.ProbeDead || activity.State == domain.ActivityExited {
return false
}
observedAt := timeOr(f.ObservedAt, now)
if activity.LastActivityAt.IsZero() {
return true
}
// A dead runtime is authoritative; a sticky state no longer means "still
// waiting" once the process is gone. Only protect against racing a *recent*
// live signal via the time window — do NOT treat sticky as infinitely recent.
return observedAt.Sub(activity.LastActivityAt) > window
}
This keeps sticky semantics for time-based aging while letting a provably-dead pane be reaped after the window. Needs a unit test: waiting_input + ProbeDead + LastActivityAt older than window ⇒ exited.
Impact
- Workers/orchestrators that pause for input and then lose their pane are stuck forever as
needs_input — they pollute the session list, mislead the orchestrator/UI, and present dead terminals.
- Notification correctness: a stuck
needs_input keeps signaling "needs input" for a session that can never receive it.
Related
- #241 — MarkSpawned clobbers a racing activity signal / resurrects exited sessions (same activity-vs-runtime arbitration surface).
- #265 — hung zellij pane blocks PR nudges (dead-pane handling).
- #32 — reaper known gaps.
Bug
A worker that finishes its turn and reports
waiting_inputis correct to showneeds_input. But if its runtime pane then dies (daemon/app restart, zellij exit, crash), the session is never reconciled toexited. It stays permanently:status: needs_input/activity.state: waiting_inputactivity.lastActivityAtfrozen at the last hook signalThis was observed live on
agent-orchestrator-2: the worker committed and opened PR #278, wentwaiting_inputat08:12Z, its zellij pane later exited, and the session sat frozen atneeds_inputwithlastActivityAt=08:12Zwhile the reaper probed itdeadevery 5s.Source: local observation (session
agent-orchestrator-2) | Reported by: @aditi1178 | Analyzed against:7c9ae53(lifecycle code identical onmain)Confidence: High
Reproduction
waiting_input(needs_input).zellijpane exits/crashes).deadevery 5s, but the session never flips toexited.needs_inputwith a stalelastActivityAtand a dead/blank terminal.Root Cause
The OBSERVE-layer reaper correctly reports the pane as dead, but the LCM's death-arbitration gate suppresses it for sticky states.
backend/internal/observe/reaper/reaper.go— every 5s,IsAlivereturns false → reportsports.ProbeDeadviaApplyRuntimeObservation. Working.backend/internal/lifecycle/manager.go:94ApplyRuntimeObservationonly marksIsTerminated+ActivityExitedwhenruntimeClearlyDead(...)is true.backend/internal/lifecycle/runtime.go:25-28:hasRecentActivity(runtime.go:12-23) returnstrueunconditionally for any sticky state:IsSticky()(backend/internal/domain/activity.go:19-21) is true forwaiting_input.So for a
waiting_inputsession,hasRecentActivityis alwaystrue→!hasRecentActivityis alwaysfalse→runtimeClearlyDeadis alwaysfalse→ the dead-runtime fact is discarded forever.The sticky exemption is right for aging by passage of time ("a paused agent is still paused until a new signal says so"), but it is wrong to apply it against an authoritative dead-runtime probe: if the process is gone, the agent cannot be "waiting for input" — there is nothing to receive it.
hasRecentActivityhas exactly one caller (runtimeClearlyDead), so the sticky branch exists solely to (incorrectly) influence death arbitration.Maps to all three reported symptoms
exited, the UI still presents it as a liveneeds_inputsession with an attachable terminal that has no live PTY behind it.exitedtransition (which would stamplastActivityAtto the observed-dead time) is suppressed.Fix
In
runtimeClearlyDead, let an authoritativeProbeDeadoverride the sticky exemption while still honoring the recent-activity time window (so we don't race a just-reported signal during a relaunch):This keeps sticky semantics for time-based aging while letting a provably-dead pane be reaped after the window. Needs a unit test:
waiting_input+ProbeDead+LastActivityAtolder thanwindow⇒exited.Impact
needs_input— they pollute the session list, mislead the orchestrator/UI, and present dead terminals.needs_inputkeeps signaling "needs input" for a session that can never receive it.Related