Skip to content

fix(ws): give runWriter exclusive ownership of provider conn writes#106

Merged
Augustas11 merged 1 commit into
mainfrom
fix/ws-writer-exclusive-ownership
Jun 18, 2026
Merged

fix(ws): give runWriter exclusive ownership of provider conn writes#106
Augustas11 merged 1 commit into
mainfrom
fix/ws-writer-exclusive-ownership

Conversation

@Augustas11

Copy link
Copy Markdown
Owner

Summary

Closes the concurrent-writer hazard on provider WS conns that PR #104's deadline-refresh fix surfaced but didn't fully resolve. runWriter is now the sole goroutine writing WS frames to a provider conn after handshake; reactive PONG/Close replies and server-initiated Close frames funnel through writeCh as pre-baked raw frames instead of racing it on the wire.

  • Hazard. wsutil.WriteServerText / gobwas.WriteFrame each issue two underlying conn.Write calls (header, then payload). Before this PR, runWriter and the read goroutine (via gobwas's ControlHandler.HandlePing/HandleClose) both wrote directly to the same conn. A PONG arriving between runWriter's header and payload writes interleaves at the WS framing boundary — invisible to -race (because net.TCPConn.Write is internally locked) and surfaces as parser failures on the receiving side. Additional racing paths: admin-reject Close (admin_endpoints.go), and the two json.Marshal failure paths in handleV1Conn/handleV2Conn after registerProviderSession.
  • Fix (Option 2 — single-writer ownership). writeCh now carries providerFrame{raw, payload}; runWriter dispatches text → wsutil.WriteServerText, raw → one conn.Write of pre-baked bytes. readClientData takes a controlReply func([]byte) error; the post-handshake path passes session.enqueueRaw, pre-handshake passes directControlReply(conn) (replaces deadlineRefreshingWriter and preserves its deadline-refresh semantics — TestProviderControlPongSurvivesStaleWriteDeadline still passes). The control handler captures gobwas's PONG/Close-echo bytes into a bytes.Buffer (with DisableSrcCiphering: true, matching the contract wsutil.ControlFrameHandler relies on with wsutil.Reader) and hands the assembled frame to controlReply. New s.closeSession(session, code, reason) enqueues a Close frame + schedules conn.Close after a 100ms drain; replaces the three post-runWriter direct-Close sites (admin reject; v1/v2 ack-marshal failure).
  • Regression test (relay_writer_serialize_test.go). 200 concurrent text sends + 200 inbound client PINGs over net.Pipe. Parses every wire frame with gobwas.ReadFrame; asserts no parse errors, expected counts, send-order preservation for text, exactly-once PONG echo per PING. Has teeth: a temporary regression shim that forced the old direct-write path failed the test 5/5 with unexpected opcode 8 (OpClose) parsed out of garbage bytes (the framing-corruption signature).

Test plan

  • go test ./internal/ws/ -race -count=1 -timeout 300s → ok 11.0s (entire ws suite, including PR fix(ws): re-arm write deadline on reactive control replies #104's TestProviderControlPongSurvivesStaleWriteDeadline)
  • New regression test under -race -count=30 → ok
  • go vet ./... / go build ./... clean module-wide
  • Teeth-check: regression shim forcing the old direct-write path → 5/5 failures with unexpected opcode 8 (framing corruption)
  • Stage soak / live-fire — gated on operator heads-up per the live-fire convention

PR #104 closed the stale-write-deadline bug on reactive PONG/Close replies,
but the underlying concurrent-writer hazard remains: runWriter (relay.go)
and the readProviderLoop goroutine (via gobwas's ControlHandler) both
write WS frames directly to the same provider conn. Each wsutil-level
frame write issues TWO underlying conn.Write calls (header, then payload).
A PONG reply arriving between the header and payload of an in-flight
inference_request lands its own header+payload mid-frame and corrupts the
WS framing for both. The Go race detector cannot see this because
net.TCPConn.Write is internally locked; the hazard lives at the WS framing
boundary, one layer up.

Additional racing paths: the admin-reject Close in admin_endpoints.go
(direct s.close + conn.Close inside a 200ms AfterFunc) and the two
ack/auth_response json.Marshal failure paths in handleV1Conn /
handleV2Conn -- all post-registerProviderSession, all racing runWriter
on the wire.

Fix (Option 2 from the original report -- single-writer ownership):

* writeCh now carries providerFrame{raw, payload}. runWriter dispatches:
  text -> wsutil.WriteServerText (header + payload, both from the single
  writer goroutine -- no inter-write window for another goroutine to
  squeeze in); raw -> one conn.Write of pre-baked bytes. New enqueueRaw
  routes reactive control replies and server-initiated Close frames
  through the same channel so they serialize behind any in-flight text
  frame instead of racing it.
* readClientData(conn, controlReply) takes a reply callback. The
  control-frame handler captures gobwas's PONG / Close-echo header+body
  into a bytes.Buffer (DisableSrcCiphering: true to match the
  wsutil.ControlFrameHandler contract with wsutil.Reader -- otherwise
  the payload gets XOR'd twice). The assembled frame is handed to
  controlReply for delivery.
* Two reply paths: handshake-phase callers (handleConn, handleV2Conn
  proof read) pass s.directControlReply(conn) -- writes straight to conn
  after re-arming the write deadline, exactly as deadlineRefreshingWriter
  did, because no providerSession / runWriter exists yet. The post-
  handshake call from readProviderLoop passes session.enqueueRaw so
  every reactive reply funnels through the single runWriter.
* New s.closeSession(session, code, reason) for post-runWriter Close
  paths: enqueues a Close frame via enqueueRaw, schedules conn.Close
  after a 100ms drain so runWriter can actually flush. Switches the
  three post-runWriter direct-Close sites: admin reject, and the two
  json.Marshal failure paths after registerProviderSession in
  handleV1Conn / handleV2Conn.
* deadlineRefreshingWriter removed -- its role is now the
  directControlReply closure for pre-handshake reads.

Regression test (relay_writer_serialize_test.go) drives 200 concurrent
text sends + 200 inbound client PINGs over net.Pipe, parses every wire
frame with gobwas.ReadFrame, and asserts: no parse errors, expected
frame counts, send-order preservation for text, exactly-once PONG echo
per PING. A temporary regression shim that forced the old direct-write
path failed the test 5/5 with "unexpected opcode 8 (OpClose)" parsed
out of garbage bytes -- the smoking gun for framing corruption -- so the
test has teeth.

Verified: go test ./internal/ws/ -race -count=1 passes (11s); new test
stress -race -count=30 passes; go vet ./... and go build ./... clean
module-wide. TestProviderControlPongSurvivesStaleWriteDeadline (the
PR #104 regression test) still passes -- directControlReply preserves
its deadline-refresh semantics for the pre-handshake path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Augustas11 Augustas11 merged commit 5e63414 into main Jun 18, 2026
5 checks passed
@Augustas11 Augustas11 deleted the fix/ws-writer-exclusive-ownership branch June 18, 2026 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant