fix(ipc): non-blocking broadcast + 4 KiB per-event size cap#40
Closed
artyomsv wants to merge 2 commits into
Closed
fix(ipc): non-blocking broadcast + 4 KiB per-event size cap#40artyomsv wants to merge 2 commits into
artyomsv wants to merge 2 commits into
Conversation
Closes the wedge incident class. One stuck IPC client (slow TUI render, crashed MCP bridge, kernel socket buffer congestion) can no longer stall the daemon's broadcast fan-out for every other client. internal/ipc/server.go — replace the synchronous Broadcast with a per-Conn 64-slot buffered send queue and a dedicated send goroutine per conn. Send is non-blocking: if the queue overflows the offending conn is scheduled for async close and ErrSendOverflow surfaces in the broadcast log, but other conns keep receiving. Broadcast also marshals the wire frame once and shares the bytes across the per-conn queues instead of re-serializing per client. Close is sync.Once-guarded so the read and write halves can race to it safely. internal/daemon/event.go — enforce a 4 KiB cap on PaneEvent.Message and 128-byte cap per Data value at the IPC boundary (toPaneEventPayload). Truncation keeps the tail (most recent / terminal-visible content) and sets Data["truncated"]="1" so consumers can flag the event. The earlier opencode-splash wedge produced a > 1 KiB box-drawing excerpt that flooded the broadcast loop; this is the size-side defense. Tests: - TestBroadcast_SlowConnDoesNotBlockFastConn — 200 broadcasts with one conn refusing to read; assert the Broadcast loop returns within 1 s and the fast conn drains. - TestBroadcast_ContinuesAfterSlowConnDisconnects — after the slow conn is torn down, new broadcasts still reach the surviving conn. - TestToPaneEventPayload_* — truncation marker placement, tail preservation, under-cap pass-through, nil-Data no-allocation path. All green under -race.
Apply all Critical / High / Medium / Low items surfaced by the code
review.
Source changes:
- internal/ipc/server.go
* (C1) Broadcast now clones the marshaled frame via slices.Clone
before fan-out. Today no callsite mutates it, but a future
sync.Pool of bytes.Buffer would silently corrupt frames still
being read by per-conn sendLoops; the clone makes that contract
bullet-proof.
* (L4 + M3) Overflow path is CAS-guarded — the warning log fires
exactly once per slow conn and only one Close goroutine spawns,
instead of N redundant log lines and N redundant goroutines while
the conn is being reaped.
* (H1 belt-and-suspenders) sendLoop sets a 30 s SetWriteDeadline per
frame so a kernel-buffer wedge that doesn't error has a
deterministic cleanup ceiling.
* (M4) log.Printf migrated to logger.Warn / logger.Error so the
project's leveled logger controls visibility.
* (M2) Document why Send and sendFrame both check closed/overflow —
outer check is the marshal-cost fast path, inner check is the
race-safe gate next to the channel send.
* (H2) Close() doc comment clarifies pending frames are intentionally
discarded.
* (L6) Guard comment on the conns snapshot copy in Broadcast warns
against future "optimizations" that would reintroduce the wedge.
* (M6) Server.Stop comment clarifies Daemon shutdown does not rely
on a final IPC broadcast (durability lives in the on-disk
workspace.json path).
* New: Server.ConnCount() for test-friendly connect/disconnect sync.
- internal/daemon/event.go
* (H4) Reserved Data key renamed to "_quil_truncated" — the new
_quil_ namespace prevents accidental collision with caller-supplied
keys (including a literal "truncated" key that emitters may use).
* (H3) Compile-time invariant block asserts both caps strictly exceed
len(truncationMarker); a future contributor lowering either cap
below the marker length will fail to build instead of panic at
runtime.
* (Rules #1) const block reflowed; gofmt-clean.
* (L1) Marker length explained (… is 3-byte UTF-8 + "[truncated]" =
14 bytes, not 12).
* (L2) Truncation strategy doc comment names the tail-keep rationale
and points future emitters at TruncationStrategy if they need head.
* (L3) TODO comment on sync.Pool trade-off (defer until profiling
demands it; clone-then-pool is the correct sequence).
Test changes:
- internal/ipc/conn_internal_test.go (NEW, package ipc white-box)
* TestSendLoop_ExitsOnWriteError — uses net.Pipe to force a write
error mid-frame; asserts no goroutine leak (QA #1 finding).
* TestConn_CloseIdempotent — 16 concurrent Close calls, none panic,
closed flag set exactly once.
* TestConn_SendFrameAfterCloseShortCircuits — sendFrame and Send
after Close return ErrSendOverflow without touching the socket.
- internal/ipc/broadcast_resilience_test.go
* (M5) Replaced time.Sleep-after-connect with waitForConnCount
polling on the new ConnCount() accessor. No more CI flake risk
from racing the connect-registration window.
* (Rules #4) Added t.Parallel() to both tests.
* (L7) TestBroadcast_MarshalErrorLogsAndReturns covers the
previously-untested marshal-error path.
- internal/daemon/event_sizecap_test.go
* Updated existing 4 tests for the renamed _quil_truncated key.
* Added t.Parallel() to all 7 tests.
* (QA #2) TestToPaneEventPayload_ExactCapBoundary verifies the > cap
condition rejects an off-by-one refactor to >=.
* (L8) TestToPaneEventPayload_BothOverCapSetsFlagOnce — when both
Message and Data value exceed cap, the flag is set exactly once.
* NEW TestToPaneEventPayload_ReservedKeyDoesNotClobberCaller proves
the _quil_ namespace fix: a caller's literal "truncated" key
survives untouched alongside the daemon's _quil_truncated flag.
CLAUDE.md: `internal/ipc/` bullet expanded with the resilience design
(sendCh, sendLoop, CAS overflow, ConnCount, size caps) per
documentation-maintenance rule.
NewClient audit (cmd/quil/main.go, mcp.go, version_gate.go): all
callsites either defer Close or explicitly close. SetWriteDeadline in
sendLoop is the runtime safety net for future regressions.
All green under -race.
This was referenced Jun 8, 2026
Owner
Author
artyomsv
added a commit
that referenced
this pull request
Jun 8, 2026
+ #40) (#41) ## Summary Combines PR #39 (notifications informativeness) and PR #40 (broadcast hardening) into a single PR with clean linear history. Broadcast wedge defense lands first as the foundation; notification UX work builds on top. Closes #39, closes #40. ## What's in here ### Phase A — broadcast hardening (foundation) - Per-conn 64-slot buffered send goroutine; `Broadcast` is non-blocking - Marshal once, share frame across N conns with `slices.Clone` defensive copy - CAS-guarded overflow: one log, one async close per slow conn, zero broadcast stall - 4 KiB per-event Message cap + **1 KiB per-Data-value cap** (raised from 128 B to fit multi-line `data.excerpt`) - Reserved `_quil_truncated` namespace flag with compile-time invariant guard - 30 s `SetWriteDeadline` belt-and-suspenders - `Server.ConnCount()` test accessor - `logger.Warn` / `logger.Error` for new code (project leveled logger) ### Notifications informativeness - Events carry excerpts (`Message` + `data.excerpt`); sidebar renders excerpt under title - Per-pane mute (`Alt+M`, persisted in workspace.json) - Active-pane idle suppression (TUI-side filter) - `(PaneID, Title)` aggregation with `×N` badge; AddEvent updates in place + moves to front - `\r`-reset cleanup in `lastNLines` (handles OSC 0 window-title leak) - Prompt-only excerpt suppression (recognises `%`, `$`, `❯`, hostname patterns) - ANSI boundary scan also seeks ESC byte (handles claude-code TUI's no-newline regions) - Default `notification.max_events` raised 50 → 200 - Dead `[[notification_handlers]]` emits one-shot deprecation warning on plugin load ### MCP surface - New `dismiss_notifications` tool — agents can ack events - `watch_notifications` gains `since_timestamp` — closes the race between agent action and watcher registration - All events carry `data.excerpt` with full multi-line content (fits within 1 KiB Data cap) ## Commit list (8 commits, in foundation-first order) 1. `54125eb` fix(ipc): non-blocking broadcast + 4 KiB per-event size cap 2. `8b177af` fix(ipc): address review findings on PR #40 3. `af85f36` feat(notifications): excerpts, mute, MCP dismiss + since_timestamp 4. `8711c14` fix(notifications): ANSI boundary, prompt-only suppress, aggregation 5. `01d2d70` fix(notifications): seek ESC byte in ANSI trim, log idle decisions 6. `5340412` fix(notifications): CR-reset cleanup + OSC hostname leak as prompt 7. `2e38d1a` fix(daemon): raise Data value cap to 1 KiB for multi-line excerpts ## Test plan - [x] `go vet ./...` clean - [x] `go test ./...` — all 17 packages pass - [x] `go test -race ./...` — all packages pass under race detector - [x] Comprehensive code review (security-officer, code-reviewer, rules-compliance, qa) on the broadcast hardening portion; all Critical/High/Medium findings addressed in commit `8b177af` - [x] 14 new tests in scope (broadcast resilience, conn internal, size cap boundary tests, excerpt rendering, mute, aggregation, find-since, active-pane suppression) - [ ] Manual smoke (when reviewed): launch dev TUI, exercise `Alt+M` mute, verify sidebar excerpts render under titles, verify `×N` badges aggregate, kick off slow MCP client and confirm daemon stays responsive ## What's left for PR #41 - Phase B + C + D from the hook-events plan (`~/.claude/plans/velvety-bubbling-twilight.md`): the JSONL spool watcher, Claude/OpenCode hook script extensions registering the opinionated event tier, config knobs. Independent of this PR.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Standalone wedge defense — the prerequisite Phase A from the hook-driven notifications plan. Closes the class of incident where one stuck IPC client (slow TUI render, crashed MCP bridge, socket buffer congestion) stalled the daemon's broadcast for every other client.
Two coordinated changes:
Per-conn buffered send queue (
internal/ipc/server.go). EachConnowns a 64-slot send channel and a dedicated send goroutine.Sendis non-blocking — if the queue overflows the offending conn is scheduled for async close andErrSendOverflowis returned; healthy peers keep receiving.Broadcastalso marshals the wire frame once and shares the bytes across the per-conn queues instead of re-serializing per client.4 KiB per-event size cap (
internal/daemon/event.go).toPaneEventPayloadcapsMessageat 4 KiB and eachDatavalue at 128 bytes. Truncation preserves the tail (most recent / terminal-visible content) and setsData["truncated"]="1". The earlier opencode-splash incident produced a >1 KiB box-drawing excerpt that flooded the loop; this is the size-side defense.Why land this standalone
Test plan
go vet ./...cleango test ./...— all 17 packages passgo test -race ./...— all packages pass under race detectorTestBroadcast_SlowConnDoesNotBlockFastConn— 200 broadcasts with one peer refusing to read; asserts broadcast loop returns in < 1 s and the fast peer drains ≥ 50 messages within 3 s.TestBroadcast_ContinuesAfterSlowConnDisconnects— after the slow conn is overflowed + closed, new broadcasts still reach the surviving conn.TestToPaneEventPayload_MessageOverCapTruncatesAndMarks— 8 KiB Message → 4 KiB Message + truncation marker prefix +Data["truncated"]="1".TestToPaneEventPayload_DataValueOverCapTruncates— 1 KiB value capped to 128 bytes, other values pass through.TestToPaneEventPayload_WithinCapPassesThrough— happy path unchanged.TestToPaneEventPayload_NilDataNoTruncationMarker— no allocation when no truncation needed.kill -STOP), verify the daemon stays responsive and the TUI continues receiving state updates.