Skip to content

fix(ws): re-arm write deadline on reactive control replies#104

Merged
Augustas11 merged 1 commit into
mainfrom
fix/ws-stale-write-deadline
Jun 18, 2026
Merged

fix(ws): re-arm write deadline on reactive control replies#104
Augustas11 merged 1 commit into
mainfrom
fix/ws-stale-write-deadline

Conversation

@Augustas11

Copy link
Copy Markdown
Owner

Summary

  • runWriter sets an absolute write deadline only at the top of each writeCh send, and socket write deadlines are never cleared after a successful write. Once a provider goes idle, that deadline lapses. The next keepalive PING the provider sends drives gobwas's HandlePing to write a PONG synchronously from the read goroutine (readClientData) directly to the conn, inheriting the stale, already-expired deadline. The PONG write fails with i/o timeout, which readProviderLoop surfaces as a read failure and tears the session down — disconnecting an idle-but-healthy provider on its keepalive cadence. HandleClose has the same direct-write pattern, so coordinator-initiated Close frames during an idle gap fail identically.
  • Fix: route the control handler's reply destination through a small deadlineRefreshingWriter so every reactive PONG/Close write re-arms the write deadline first. The Reader still reads from the raw conn; only reply writes are wrapped, so this covers both the explicit control-frame branch and the OnIntermediate path, and works uniformly for the handshake-phase readClientData callers that have no writeCh.
  • New regression test TestProviderControlPongSurvivesStaleWriteDeadline opens a provider session with a 50ms write limit, drives a real runWriter send to set the deadline exactly as production does, lets it lapse, sends a client PING, and asserts the session survives (PONG returns, no read-loop teardown), then re-arms the stale deadline twice to confirm sustained idle-keepalive survival.

Test plan

  • go build ./... clean
  • go vet ./internal/ws/ clean
  • go test ./internal/ws/ -race -count=1 green (full suite, 11.3s under race)
  • New regression test verified red→green: passes with fix, fails with read pipe: i/o timeout ("session torn down by stale write deadline?") when the wrapped Dst is temporarily reverted
  • Stress run: regression test x10 under -race — non-flaky

Scope notes

  • No public API, wire-protocol, or config-key change (ws.write_timeout_s semantics unchanged).
  • Closes a divergence between implementation and the documented FR-P11 / FR-P19 contracts in specs/SPEC-002-coordinator.md; no spec edits needed.
  • Same pattern as previous sibling WS bug-fix commits (591a0e2 receive-loop drain, fbae02b sendPing continuation) — code-only.
  • Latent follow-up flagged separately: serialize all provider WS writes through runWriter (current fix re-arms the deadline; it does not eliminate the concurrent-writer hazard between runWriter and reactive control replies — the race detector cannot see WS-frame-level interleaving because net.Conn.Write is internally locked).

runWriter sets an absolute write deadline only at the top of each writeCh
send, and socket write deadlines are never cleared after a successful write.
Once a provider goes idle, that deadline lapses. The next keepalive PING the
provider sends drives gobwas's HandlePing to write a PONG synchronously from
the read goroutine (readClientData) directly to the conn, inheriting the
stale, already-expired deadline. The PONG write fails with `i/o timeout`,
which readProviderLoop surfaces as a read failure and tears the session down,
disconnecting an idle-but-healthy provider on its keepalive cadence.
HandleClose has the same direct-write pattern, so coordinator-initiated Close
frames during an idle gap fail identically.

Route the control handler's reply destination through deadlineRefreshingWriter
so every reactive PONG/Close write re-arms the write deadline first. The Reader
still reads from the raw conn; only reply writes are wrapped, so this covers
both the explicit control-frame branch and the OnIntermediate path, and works
uniformly for the handshake-phase readClientData callers that have no writeCh.

Add a regression test that opens a provider session, lets the writer's deadline
lapse with no server-initiated write, sends a client PING, and asserts the
session survives (PONG returns, no read-loop teardown).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Augustas11 Augustas11 merged commit 688652e into main Jun 18, 2026
5 checks passed
@Augustas11 Augustas11 deleted the fix/ws-stale-write-deadline branch June 18, 2026 06:02
Augustas11 added a commit that referenced this pull request Jun 18, 2026
…106)

PR #104 closed the stale-write-deadline bug on reactive PONG/Close replies,
but the underlying concurrent-writer hazard remains: runWriter (relay.go)
and the readProviderLoop goroutine (via gobwas's ControlHandler) both
write WS frames directly to the same provider conn. Each wsutil-level
frame write issues TWO underlying conn.Write calls (header, then payload).
A PONG reply arriving between the header and payload of an in-flight
inference_request lands its own header+payload mid-frame and corrupts the
WS framing for both. The Go race detector cannot see this because
net.TCPConn.Write is internally locked; the hazard lives at the WS framing
boundary, one layer up.

Additional racing paths: the admin-reject Close in admin_endpoints.go
(direct s.close + conn.Close inside a 200ms AfterFunc) and the two
ack/auth_response json.Marshal failure paths in handleV1Conn /
handleV2Conn -- all post-registerProviderSession, all racing runWriter
on the wire.

Fix (Option 2 from the original report -- single-writer ownership):

* writeCh now carries providerFrame{raw, payload}. runWriter dispatches:
  text -> wsutil.WriteServerText (header + payload, both from the single
  writer goroutine -- no inter-write window for another goroutine to
  squeeze in); raw -> one conn.Write of pre-baked bytes. New enqueueRaw
  routes reactive control replies and server-initiated Close frames
  through the same channel so they serialize behind any in-flight text
  frame instead of racing it.
* readClientData(conn, controlReply) takes a reply callback. The
  control-frame handler captures gobwas's PONG / Close-echo header+body
  into a bytes.Buffer (DisableSrcCiphering: true to match the
  wsutil.ControlFrameHandler contract with wsutil.Reader -- otherwise
  the payload gets XOR'd twice). The assembled frame is handed to
  controlReply for delivery.
* Two reply paths: handshake-phase callers (handleConn, handleV2Conn
  proof read) pass s.directControlReply(conn) -- writes straight to conn
  after re-arming the write deadline, exactly as deadlineRefreshingWriter
  did, because no providerSession / runWriter exists yet. The post-
  handshake call from readProviderLoop passes session.enqueueRaw so
  every reactive reply funnels through the single runWriter.
* New s.closeSession(session, code, reason) for post-runWriter Close
  paths: enqueues a Close frame via enqueueRaw, schedules conn.Close
  after a 100ms drain so runWriter can actually flush. Switches the
  three post-runWriter direct-Close sites: admin reject, and the two
  json.Marshal failure paths after registerProviderSession in
  handleV1Conn / handleV2Conn.
* deadlineRefreshingWriter removed -- its role is now the
  directControlReply closure for pre-handshake reads.

Regression test (relay_writer_serialize_test.go) drives 200 concurrent
text sends + 200 inbound client PINGs over net.Pipe, parses every wire
frame with gobwas.ReadFrame, and asserts: no parse errors, expected
frame counts, send-order preservation for text, exactly-once PONG echo
per PING. A temporary regression shim that forced the old direct-write
path failed the test 5/5 with "unexpected opcode 8 (OpClose)" parsed
out of garbage bytes -- the smoking gun for framing corruption -- so the
test has teeth.

Verified: go test ./internal/ws/ -race -count=1 passes (11s); new test
stress -race -count=30 passes; go vet ./... and go build ./... clean
module-wide. TestProviderControlPongSurvivesStaleWriteDeadline (the
PR #104 regression test) still passes -- directControlReply preserves
its deadline-refresh semantics for the pre-handshake path.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant