Skip to content

fix(ipc): non-blocking broadcast + 4 KiB per-event size cap#40

Closed
artyomsv wants to merge 2 commits into
masterfrom
fix/broadcast-non-blocking
Closed

fix(ipc): non-blocking broadcast + 4 KiB per-event size cap#40
artyomsv wants to merge 2 commits into
masterfrom
fix/broadcast-non-blocking

Conversation

@artyomsv

@artyomsv artyomsv commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Summary

Standalone wedge defense — the prerequisite Phase A from the hook-driven notifications plan. Closes the class of incident where one stuck IPC client (slow TUI render, crashed MCP bridge, socket buffer congestion) stalled the daemon's broadcast for every other client.

Two coordinated changes:

  1. Per-conn buffered send queue (internal/ipc/server.go). Each Conn owns a 64-slot send channel and a dedicated send goroutine. Send is non-blocking — if the queue overflows the offending conn is scheduled for async close and ErrSendOverflow is returned; healthy peers keep receiving. Broadcast also marshals the wire frame once and shares the bytes across the per-conn queues instead of re-serializing per client.

  2. 4 KiB per-event size cap (internal/daemon/event.go). toPaneEventPayload caps Message at 4 KiB and each Data value at 128 bytes. Truncation preserves the tail (most recent / terminal-visible content) and sets Data["truncated"]="1". The earlier opencode-splash incident produced a >1 KiB box-drawing excerpt that flooded the loop; this is the size-side defense.

Why land this standalone

  • Independently valuable for the existing daemon — fixes the wedge we hit yesterday (SIGTERM didn't deliver because the daemon's main loop was blocked on a socket write).
  • Phase B–D of the hook-events plan add a new high-frequency event source; this PR makes the broadcast loop safe before that source lands.
  • Small, focused diff — 2 files changed + 2 test files. Reviewable in one sitting.

Test plan

  • go vet ./... clean
  • go test ./... — all 17 packages pass
  • go test -race ./... — all packages pass under race detector
  • New tests:
    • TestBroadcast_SlowConnDoesNotBlockFastConn — 200 broadcasts with one peer refusing to read; asserts broadcast loop returns in < 1 s and the fast peer drains ≥ 50 messages within 3 s.
    • TestBroadcast_ContinuesAfterSlowConnDisconnects — after the slow conn is overflowed + closed, new broadcasts still reach the surviving conn.
    • TestToPaneEventPayload_MessageOverCapTruncatesAndMarks — 8 KiB Message → 4 KiB Message + truncation marker prefix + Data["truncated"]="1".
    • TestToPaneEventPayload_DataValueOverCapTruncates — 1 KiB value capped to 128 bytes, other values pass through.
    • TestToPaneEventPayload_WithinCapPassesThrough — happy path unchanged.
    • TestToPaneEventPayload_NilDataNoTruncationMarker — no allocation when no truncation needed.
  • Manual smoke: build dev binaries, attach a TUI + an MCP bridge, simulate a slow MCP client (e.g. paused via kill -STOP), verify the daemon stays responsive and the TUI continues receiving state updates.

artyomsv added 2 commits June 8, 2026 14:12
Closes the wedge incident class. One stuck IPC client (slow TUI render,
crashed MCP bridge, kernel socket buffer congestion) can no longer
stall the daemon's broadcast fan-out for every other client.

internal/ipc/server.go — replace the synchronous Broadcast with a
per-Conn 64-slot buffered send queue and a dedicated send goroutine
per conn. Send is non-blocking: if the queue overflows the offending
conn is scheduled for async close and ErrSendOverflow surfaces in the
broadcast log, but other conns keep receiving. Broadcast also marshals
the wire frame once and shares the bytes across the per-conn queues
instead of re-serializing per client. Close is sync.Once-guarded so
the read and write halves can race to it safely.

internal/daemon/event.go — enforce a 4 KiB cap on PaneEvent.Message
and 128-byte cap per Data value at the IPC boundary (toPaneEventPayload).
Truncation keeps the tail (most recent / terminal-visible content) and
sets Data["truncated"]="1" so consumers can flag the event. The earlier
opencode-splash wedge produced a > 1 KiB box-drawing excerpt that
flooded the broadcast loop; this is the size-side defense.

Tests:
- TestBroadcast_SlowConnDoesNotBlockFastConn — 200 broadcasts with one
  conn refusing to read; assert the Broadcast loop returns within 1 s
  and the fast conn drains.
- TestBroadcast_ContinuesAfterSlowConnDisconnects — after the slow conn
  is torn down, new broadcasts still reach the surviving conn.
- TestToPaneEventPayload_* — truncation marker placement, tail
  preservation, under-cap pass-through, nil-Data no-allocation path.

All green under -race.
Apply all Critical / High / Medium / Low items surfaced by the code
review.

Source changes:

- internal/ipc/server.go
  * (C1) Broadcast now clones the marshaled frame via slices.Clone
    before fan-out. Today no callsite mutates it, but a future
    sync.Pool of bytes.Buffer would silently corrupt frames still
    being read by per-conn sendLoops; the clone makes that contract
    bullet-proof.
  * (L4 + M3) Overflow path is CAS-guarded — the warning log fires
    exactly once per slow conn and only one Close goroutine spawns,
    instead of N redundant log lines and N redundant goroutines while
    the conn is being reaped.
  * (H1 belt-and-suspenders) sendLoop sets a 30 s SetWriteDeadline per
    frame so a kernel-buffer wedge that doesn't error has a
    deterministic cleanup ceiling.
  * (M4) log.Printf migrated to logger.Warn / logger.Error so the
    project's leveled logger controls visibility.
  * (M2) Document why Send and sendFrame both check closed/overflow —
    outer check is the marshal-cost fast path, inner check is the
    race-safe gate next to the channel send.
  * (H2) Close() doc comment clarifies pending frames are intentionally
    discarded.
  * (L6) Guard comment on the conns snapshot copy in Broadcast warns
    against future "optimizations" that would reintroduce the wedge.
  * (M6) Server.Stop comment clarifies Daemon shutdown does not rely
    on a final IPC broadcast (durability lives in the on-disk
    workspace.json path).
  * New: Server.ConnCount() for test-friendly connect/disconnect sync.

- internal/daemon/event.go
  * (H4) Reserved Data key renamed to "_quil_truncated" — the new
    _quil_ namespace prevents accidental collision with caller-supplied
    keys (including a literal "truncated" key that emitters may use).
  * (H3) Compile-time invariant block asserts both caps strictly exceed
    len(truncationMarker); a future contributor lowering either cap
    below the marker length will fail to build instead of panic at
    runtime.
  * (Rules #1) const block reflowed; gofmt-clean.
  * (L1) Marker length explained (… is 3-byte UTF-8 + "[truncated]" =
    14 bytes, not 12).
  * (L2) Truncation strategy doc comment names the tail-keep rationale
    and points future emitters at TruncationStrategy if they need head.
  * (L3) TODO comment on sync.Pool trade-off (defer until profiling
    demands it; clone-then-pool is the correct sequence).

Test changes:

- internal/ipc/conn_internal_test.go (NEW, package ipc white-box)
  * TestSendLoop_ExitsOnWriteError — uses net.Pipe to force a write
    error mid-frame; asserts no goroutine leak (QA #1 finding).
  * TestConn_CloseIdempotent — 16 concurrent Close calls, none panic,
    closed flag set exactly once.
  * TestConn_SendFrameAfterCloseShortCircuits — sendFrame and Send
    after Close return ErrSendOverflow without touching the socket.

- internal/ipc/broadcast_resilience_test.go
  * (M5) Replaced time.Sleep-after-connect with waitForConnCount
    polling on the new ConnCount() accessor. No more CI flake risk
    from racing the connect-registration window.
  * (Rules #4) Added t.Parallel() to both tests.
  * (L7) TestBroadcast_MarshalErrorLogsAndReturns covers the
    previously-untested marshal-error path.

- internal/daemon/event_sizecap_test.go
  * Updated existing 4 tests for the renamed _quil_truncated key.
  * Added t.Parallel() to all 7 tests.
  * (QA #2) TestToPaneEventPayload_ExactCapBoundary verifies the > cap
    condition rejects an off-by-one refactor to >=.
  * (L8) TestToPaneEventPayload_BothOverCapSetsFlagOnce — when both
    Message and Data value exceed cap, the flag is set exactly once.
  * NEW TestToPaneEventPayload_ReservedKeyDoesNotClobberCaller proves
    the _quil_ namespace fix: a caller's literal "truncated" key
    survives untouched alongside the daemon's _quil_truncated flag.

CLAUDE.md: `internal/ipc/` bullet expanded with the resilience design
(sendCh, sendLoop, CAS overflow, ConnCount, size caps) per
documentation-maintenance rule.

NewClient audit (cmd/quil/main.go, mcp.go, version_gate.go): all
callsites either defer Close or explicitly close. SetWriteDeadline in
sendLoop is the runtime safety net for future regressions.

All green under -race.
@artyomsv

artyomsv commented Jun 8, 2026

Copy link
Copy Markdown
Owner Author

Superseded by #41 — combined with PR #39 (notifications informativeness). All review findings here are preserved in the combined branch.

@artyomsv artyomsv closed this Jun 8, 2026
artyomsv added a commit that referenced this pull request Jun 8, 2026
 + #40) (#41)

## Summary

Combines PR #39 (notifications informativeness) and PR #40 (broadcast
hardening) into a single PR with clean linear history. Broadcast wedge
defense lands first as the foundation; notification UX work builds on
top.

Closes #39, closes #40.

## What's in here

### Phase A — broadcast hardening (foundation)
- Per-conn 64-slot buffered send goroutine; `Broadcast` is non-blocking
- Marshal once, share frame across N conns with `slices.Clone` defensive
copy
- CAS-guarded overflow: one log, one async close per slow conn, zero
broadcast stall
- 4 KiB per-event Message cap + **1 KiB per-Data-value cap** (raised
from 128 B to fit multi-line `data.excerpt`)
- Reserved `_quil_truncated` namespace flag with compile-time invariant
guard
- 30 s `SetWriteDeadline` belt-and-suspenders
- `Server.ConnCount()` test accessor
- `logger.Warn` / `logger.Error` for new code (project leveled logger)

### Notifications informativeness
- Events carry excerpts (`Message` + `data.excerpt`); sidebar renders
excerpt under title
- Per-pane mute (`Alt+M`, persisted in workspace.json)
- Active-pane idle suppression (TUI-side filter)
- `(PaneID, Title)` aggregation with `×N` badge; AddEvent updates in
place + moves to front
- `\r`-reset cleanup in `lastNLines` (handles OSC 0 window-title leak)
- Prompt-only excerpt suppression (recognises `%`, `$`, `❯`, hostname
patterns)
- ANSI boundary scan also seeks ESC byte (handles claude-code TUI's
no-newline regions)
- Default `notification.max_events` raised 50 → 200
- Dead `[[notification_handlers]]` emits one-shot deprecation warning on
plugin load

### MCP surface
- New `dismiss_notifications` tool — agents can ack events
- `watch_notifications` gains `since_timestamp` — closes the race
between agent action and watcher registration
- All events carry `data.excerpt` with full multi-line content (fits
within 1 KiB Data cap)

## Commit list (8 commits, in foundation-first order)

1. `54125eb` fix(ipc): non-blocking broadcast + 4 KiB per-event size cap
2. `8b177af` fix(ipc): address review findings on PR #40
3. `af85f36` feat(notifications): excerpts, mute, MCP dismiss +
since_timestamp
4. `8711c14` fix(notifications): ANSI boundary, prompt-only suppress,
aggregation
5. `01d2d70` fix(notifications): seek ESC byte in ANSI trim, log idle
decisions
6. `5340412` fix(notifications): CR-reset cleanup + OSC hostname leak as
prompt
7. `2e38d1a` fix(daemon): raise Data value cap to 1 KiB for multi-line
excerpts

## Test plan

- [x] `go vet ./...` clean
- [x] `go test ./...` — all 17 packages pass
- [x] `go test -race ./...` — all packages pass under race detector
- [x] Comprehensive code review (security-officer, code-reviewer,
rules-compliance, qa) on the broadcast hardening portion; all
Critical/High/Medium findings addressed in commit `8b177af`
- [x] 14 new tests in scope (broadcast resilience, conn internal, size
cap boundary tests, excerpt rendering, mute, aggregation, find-since,
active-pane suppression)
- [ ] Manual smoke (when reviewed): launch dev TUI, exercise `Alt+M`
mute, verify sidebar excerpts render under titles, verify `×N` badges
aggregate, kick off slow MCP client and confirm daemon stays responsive

## What's left for PR #41

- Phase B + C + D from the hook-events plan
(`~/.claude/plans/velvety-bubbling-twilight.md`): the JSONL spool
watcher, Claude/OpenCode hook script extensions registering the
opinionated event tier, config knobs. Independent of this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant