Skip to content

bug(lifecycle): sendOnce holds react.mu across blocking I/O — hung zellij pane blocks all PR nudges #265

@fireddd

Description

@fireddd

Bug

sendOnce in the lifecycle manager acquires m.react.mu at the top of the function (defer m.react.mu.Unlock()) and holds it across:

  1. m.loadPRSignaturesLocked — DB read
  2. m.messenger.Send — shells out to zellij action write-chars (blocking external process)
  3. m.persistPRSignaturesLocked — DB write

If the zellij action write-chars command hangs (e.g., the zellij session has exited but the handle is stale, or the zellij socket is unresponsive), all PR nudge reactions for all sessions are blocked indefinitely. The SCM observer calls ApplyPRObservationsendOnce for CI failure, review feedback, and merge conflict nudges — all of these go through the same react.mu lock.

Analyzed against: 96d1649 (current main)
Confidence: High — traced lock acquisition through the I/O call path.

Root Cause

backend/internal/lifecycle/reactions.go:348-386:

func (m *Manager) sendOnce(ctx context.Context, id domain.SessionID, prURL, key, sig, msg string, maxAttempts int) error {
    if m.messenger == nil {
        return nil
    }
    m.react.mu.Lock()
    defer m.react.mu.Unlock()  // <-- held across everything below

    // ... loadPRSignaturesLocked (DB read)
    // ... dedup check
    if err := m.messenger.Send(ctx, id, msg); err != nil {  // <-- zellij shell-out, can hang
        return err
    }
    // ... update in-memory state
    // ... persistPRSignaturesLocked (DB write)
}

m.messenger.Send ultimately calls zellij.Runtime.SendMessage, which executes zellij action write-chars as an external process. This can hang if:

  • The zellij session has exited but the runtime handle is not yet cleaned up
  • The zellij socket is temporarily unresponsive
  • The system is under load and process creation is slow

Reproduction

  1. Start the daemon, spawn 2+ worker sessions with PR observations enabled
  2. Kill one worker's zellij session externally (e.g., zellij kill-session <name>)
  3. Wait for the SCM observer to detect a CI failure or review on any session
  4. Observe that the nudge for the healthy session also hangs — m.react.mu is held by the stuck Send to the dead zellij pane

Impact

  • All lifecycle PR nudges blocked: A single unresponsive zellij pane prevents CI-failure, review-feedback, and merge-conflict nudges from being delivered to any session.
  • Silent degradation: No timeout on the messenger.Send call means the lock can be held indefinitely.
  • Blast radius scales with sessions: The more sessions under observation, the more nudges queue behind the stuck lock.

Suggested Fix

Move m.messenger.Send outside the lock. The pattern:

  1. Under lock: check dedup, tentatively mark as sent, copy needed state
  2. Release lock
  3. Send message (blocking I/O, no lock held)
  4. Re-acquire lock: confirm the send (or roll back the tentative mark on failure)

This ensures the dedup check is still atomic while the blocking I/O doesn't hold the global reaction lock.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions