Bug
sendOnce in the lifecycle manager acquires m.react.mu at the top of the function (defer m.react.mu.Unlock()) and holds it across:
m.loadPRSignaturesLocked — DB read
m.messenger.Send — shells out to zellij action write-chars (blocking external process)
m.persistPRSignaturesLocked — DB write
If the zellij action write-chars command hangs (e.g., the zellij session has exited but the handle is stale, or the zellij socket is unresponsive), all PR nudge reactions for all sessions are blocked indefinitely. The SCM observer calls ApplyPRObservation → sendOnce for CI failure, review feedback, and merge conflict nudges — all of these go through the same react.mu lock.
Analyzed against: 96d1649 (current main)
Confidence: High — traced lock acquisition through the I/O call path.
Root Cause
backend/internal/lifecycle/reactions.go:348-386:
func (m *Manager) sendOnce(ctx context.Context, id domain.SessionID, prURL, key, sig, msg string, maxAttempts int) error {
if m.messenger == nil {
return nil
}
m.react.mu.Lock()
defer m.react.mu.Unlock() // <-- held across everything below
// ... loadPRSignaturesLocked (DB read)
// ... dedup check
if err := m.messenger.Send(ctx, id, msg); err != nil { // <-- zellij shell-out, can hang
return err
}
// ... update in-memory state
// ... persistPRSignaturesLocked (DB write)
}
m.messenger.Send ultimately calls zellij.Runtime.SendMessage, which executes zellij action write-chars as an external process. This can hang if:
- The zellij session has exited but the runtime handle is not yet cleaned up
- The zellij socket is temporarily unresponsive
- The system is under load and process creation is slow
Reproduction
- Start the daemon, spawn 2+ worker sessions with PR observations enabled
- Kill one worker's zellij session externally (e.g.,
zellij kill-session <name>)
- Wait for the SCM observer to detect a CI failure or review on any session
- Observe that the nudge for the healthy session also hangs —
m.react.mu is held by the stuck Send to the dead zellij pane
Impact
- All lifecycle PR nudges blocked: A single unresponsive zellij pane prevents CI-failure, review-feedback, and merge-conflict nudges from being delivered to any session.
- Silent degradation: No timeout on the
messenger.Send call means the lock can be held indefinitely.
- Blast radius scales with sessions: The more sessions under observation, the more nudges queue behind the stuck lock.
Suggested Fix
Move m.messenger.Send outside the lock. The pattern:
- Under lock: check dedup, tentatively mark as sent, copy needed state
- Release lock
- Send message (blocking I/O, no lock held)
- Re-acquire lock: confirm the send (or roll back the tentative mark on failure)
This ensures the dedup check is still atomic while the blocking I/O doesn't hold the global reaction lock.
Bug
sendOncein the lifecycle manager acquiresm.react.muat the top of the function (defer m.react.mu.Unlock()) and holds it across:m.loadPRSignaturesLocked— DB readm.messenger.Send— shells out tozellij action write-chars(blocking external process)m.persistPRSignaturesLocked— DB writeIf the
zellij action write-charscommand hangs (e.g., the zellij session has exited but the handle is stale, or the zellij socket is unresponsive), all PR nudge reactions for all sessions are blocked indefinitely. The SCM observer callsApplyPRObservation→sendOncefor CI failure, review feedback, and merge conflict nudges — all of these go through the samereact.mulock.Analyzed against:
96d1649(currentmain)Confidence: High — traced lock acquisition through the I/O call path.
Root Cause
backend/internal/lifecycle/reactions.go:348-386:m.messenger.Sendultimately callszellij.Runtime.SendMessage, which executeszellij action write-charsas an external process. This can hang if:Reproduction
zellij kill-session <name>)m.react.muis held by the stuckSendto the dead zellij paneImpact
messenger.Sendcall means the lock can be held indefinitely.Suggested Fix
Move
m.messenger.Sendoutside the lock. The pattern:This ensures the dedup check is still atomic while the blocking I/O doesn't hold the global reaction lock.