fix(ws): re-arm write deadline on reactive control replies#104
Merged
Conversation
runWriter sets an absolute write deadline only at the top of each writeCh send, and socket write deadlines are never cleared after a successful write. Once a provider goes idle, that deadline lapses. The next keepalive PING the provider sends drives gobwas's HandlePing to write a PONG synchronously from the read goroutine (readClientData) directly to the conn, inheriting the stale, already-expired deadline. The PONG write fails with `i/o timeout`, which readProviderLoop surfaces as a read failure and tears the session down, disconnecting an idle-but-healthy provider on its keepalive cadence. HandleClose has the same direct-write pattern, so coordinator-initiated Close frames during an idle gap fail identically. Route the control handler's reply destination through deadlineRefreshingWriter so every reactive PONG/Close write re-arms the write deadline first. The Reader still reads from the raw conn; only reply writes are wrapped, so this covers both the explicit control-frame branch and the OnIntermediate path, and works uniformly for the handshake-phase readClientData callers that have no writeCh. Add a regression test that opens a provider session, lets the writer's deadline lapse with no server-initiated write, sends a client PING, and asserts the session survives (PONG returns, no read-loop teardown). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5 tasks
Augustas11
added a commit
that referenced
this pull request
Jun 18, 2026
…106) PR #104 closed the stale-write-deadline bug on reactive PONG/Close replies, but the underlying concurrent-writer hazard remains: runWriter (relay.go) and the readProviderLoop goroutine (via gobwas's ControlHandler) both write WS frames directly to the same provider conn. Each wsutil-level frame write issues TWO underlying conn.Write calls (header, then payload). A PONG reply arriving between the header and payload of an in-flight inference_request lands its own header+payload mid-frame and corrupts the WS framing for both. The Go race detector cannot see this because net.TCPConn.Write is internally locked; the hazard lives at the WS framing boundary, one layer up. Additional racing paths: the admin-reject Close in admin_endpoints.go (direct s.close + conn.Close inside a 200ms AfterFunc) and the two ack/auth_response json.Marshal failure paths in handleV1Conn / handleV2Conn -- all post-registerProviderSession, all racing runWriter on the wire. Fix (Option 2 from the original report -- single-writer ownership): * writeCh now carries providerFrame{raw, payload}. runWriter dispatches: text -> wsutil.WriteServerText (header + payload, both from the single writer goroutine -- no inter-write window for another goroutine to squeeze in); raw -> one conn.Write of pre-baked bytes. New enqueueRaw routes reactive control replies and server-initiated Close frames through the same channel so they serialize behind any in-flight text frame instead of racing it. * readClientData(conn, controlReply) takes a reply callback. The control-frame handler captures gobwas's PONG / Close-echo header+body into a bytes.Buffer (DisableSrcCiphering: true to match the wsutil.ControlFrameHandler contract with wsutil.Reader -- otherwise the payload gets XOR'd twice). The assembled frame is handed to controlReply for delivery. * Two reply paths: handshake-phase callers (handleConn, handleV2Conn proof read) pass s.directControlReply(conn) -- writes straight to conn after re-arming the write deadline, exactly as deadlineRefreshingWriter did, because no providerSession / runWriter exists yet. The post- handshake call from readProviderLoop passes session.enqueueRaw so every reactive reply funnels through the single runWriter. * New s.closeSession(session, code, reason) for post-runWriter Close paths: enqueues a Close frame via enqueueRaw, schedules conn.Close after a 100ms drain so runWriter can actually flush. Switches the three post-runWriter direct-Close sites: admin reject, and the two json.Marshal failure paths after registerProviderSession in handleV1Conn / handleV2Conn. * deadlineRefreshingWriter removed -- its role is now the directControlReply closure for pre-handshake reads. Regression test (relay_writer_serialize_test.go) drives 200 concurrent text sends + 200 inbound client PINGs over net.Pipe, parses every wire frame with gobwas.ReadFrame, and asserts: no parse errors, expected frame counts, send-order preservation for text, exactly-once PONG echo per PING. A temporary regression shim that forced the old direct-write path failed the test 5/5 with "unexpected opcode 8 (OpClose)" parsed out of garbage bytes -- the smoking gun for framing corruption -- so the test has teeth. Verified: go test ./internal/ws/ -race -count=1 passes (11s); new test stress -race -count=30 passes; go vet ./... and go build ./... clean module-wide. TestProviderControlPongSurvivesStaleWriteDeadline (the PR #104 regression test) still passes -- directControlReply preserves its deadline-refresh semantics for the pre-handshake path. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
runWritersets an absolute write deadline only at the top of eachwriteChsend, and socket write deadlines are never cleared after a successful write. Once a provider goes idle, that deadline lapses. The next keepalive PING the provider sends drives gobwas'sHandlePingto write a PONG synchronously from the read goroutine (readClientData) directly to the conn, inheriting the stale, already-expired deadline. The PONG write fails withi/o timeout, whichreadProviderLoopsurfaces as a read failure and tears the session down — disconnecting an idle-but-healthy provider on its keepalive cadence.HandleClosehas the same direct-write pattern, so coordinator-initiated Close frames during an idle gap fail identically.deadlineRefreshingWriterso every reactive PONG/Close write re-arms the write deadline first. TheReaderstill reads from the raw conn; only reply writes are wrapped, so this covers both the explicit control-frame branch and theOnIntermediatepath, and works uniformly for the handshake-phasereadClientDatacallers that have nowriteCh.TestProviderControlPongSurvivesStaleWriteDeadlineopens a provider session with a 50ms write limit, drives a realrunWritersend to set the deadline exactly as production does, lets it lapse, sends a client PING, and asserts the session survives (PONG returns, no read-loop teardown), then re-arms the stale deadline twice to confirm sustained idle-keepalive survival.Test plan
go build ./...cleango vet ./internal/ws/cleango test ./internal/ws/ -race -count=1green (full suite, 11.3s under race)read pipe: i/o timeout("session torn down by stale write deadline?") when the wrappedDstis temporarily reverted-race— non-flakyScope notes
ws.write_timeout_ssemantics unchanged).specs/SPEC-002-coordinator.md; no spec edits needed.runWriter(current fix re-arms the deadline; it does not eliminate the concurrent-writer hazard betweenrunWriterand reactive control replies — the race detector cannot see WS-frame-level interleaving becausenet.Conn.Writeis internally locked).