Related to #75 — that issue is the read-side / handshake variant of the same underlying problem ("the Slipstream.Connection GenServer wedges on a blocking I/O op with no timeout and never self-heals"). This one is the write side, on an already-established connection.
What happens
Slipstream.Connection sends synchronously with no send timeout: Impl.push_message/2 → Mint.WebSocket.stream_request_body/3 → :ssl.send/2. The heartbeat takes the exact same path (push_heartbeat/1 → push_message/2).
On a half-open connection (peer stops ACKing — a brief network blip, a dead load-balancer path, an overloaded server), the kernel TCP send buffer fills and :ssl.send/2 blocks indefinitely. The Connection GenServer is now stuck inside that send and can't process its next message. Two consequences:
-
The heartbeat self-heal never fires. The "previous heartbeat still unacked → close and reconnect" check runs on the next SendHeartbeat tick (pipeline.ex#L229-L238), but the blocked process can't reach the next tick. The one mechanism designed to recover a dead connection is starved by the very condition it exists to detect.
-
The process heap grows without bound. Every message delivered to the blocked Connection (PubSub, pushes, timers) is copied on-heap — a blocked GenServer never reaches its receive, so nothing is moved off-heap. On an idle-ish multiplexed socket we watched a node climb from a flat baseline to ~4.6 GB in ~20 s and get OOM-killed, with the Connection process's own heap at ~2.6 GB, blocked in :ssl.send/2 → :tls_sender.call/2. The stalled send was the 30 s heartbeat, not an application push — so no app traffic is needed to trigger it.
Reproduce
Same spirit as #75, write-side: connect and complete the WebSocket upgrade against a peer that then stops reading (so the send buffer fills), and keep sending (the heartbeat alone is enough). The Connection process blocks in stream_request_body/3, its heap and mailbox grow, and it never reconnects. A nc-style listener that accepts, completes the upgrade, then stops draining reproduces it with test_mode off.
Works around it today
Slipstream already forwards mint_opts to Mint.HTTP.connect, so a bounded send timeout can be passed straight through to the transport:
connect(socket,
uri: uri,
mint_opts: [transport_opts: [send_timeout: 15_000, send_timeout_close: true]]
)
send_timeout bounds the :ssl.send; on expiry it returns {:error, :timeout} → push_message returns {:error, _, reason} → Slipstream routes %Events.ChannelClosed{reason: {:send_failure, reason}} → handle_disconnect/2 → reconnect/1. The self-heal is restored and the heap can't run away.
Suggested fix
Either:
- (a) Document the
mint_opts: [transport_opts: [send_timeout: …, send_timeout_close: true]] hardening prominently, and note that without it the heartbeat-timeout self-heal does not protect against a stalled send (a blocked send starves the detector). Cheapest, zero compatibility risk; or
- (b) Default a sane
send_timeout on the transport (perhaps derived from the heartbeat interval) so connections self-heal on a half-open socket out of the box.
Related: #75 (read-side / handshake twin) and #74 (send buffer) share the "bound the Connection's blocking I/O" theme.
We're happy to open a PR — would you prefer the docs change (a) or the defaulted send_timeout (b)? Let us know and we'll send it.
Env: slipstream 1.2.2, mint_web_socket 1.0.5, Erlang/OTP 29.0.2, Elixir 1.20.2-otp-29.
Related to #75 — that issue is the read-side / handshake variant of the same underlying problem ("the
Slipstream.ConnectionGenServer wedges on a blocking I/O op with no timeout and never self-heals"). This one is the write side, on an already-established connection.What happens
Slipstream.Connectionsends synchronously with no send timeout:Impl.push_message/2→Mint.WebSocket.stream_request_body/3→:ssl.send/2. The heartbeat takes the exact same path (push_heartbeat/1→push_message/2).On a half-open connection (peer stops ACKing — a brief network blip, a dead load-balancer path, an overloaded server), the kernel TCP send buffer fills and
:ssl.send/2blocks indefinitely. The Connection GenServer is now stuck inside that send and can't process its next message. Two consequences:The heartbeat self-heal never fires. The "previous heartbeat still unacked → close and reconnect" check runs on the next
SendHeartbeattick (pipeline.ex#L229-L238), but the blocked process can't reach the next tick. The one mechanism designed to recover a dead connection is starved by the very condition it exists to detect.The process heap grows without bound. Every message delivered to the blocked Connection (PubSub, pushes, timers) is copied on-heap — a blocked GenServer never reaches its
receive, so nothing is moved off-heap. On an idle-ish multiplexed socket we watched a node climb from a flat baseline to ~4.6 GB in ~20 s and get OOM-killed, with the Connection process's own heap at ~2.6 GB, blocked in:ssl.send/2 → :tls_sender.call/2. The stalled send was the 30 s heartbeat, not an application push — so no app traffic is needed to trigger it.Reproduce
Same spirit as #75, write-side: connect and complete the WebSocket upgrade against a peer that then stops reading (so the send buffer fills), and keep sending (the heartbeat alone is enough). The Connection process blocks in
stream_request_body/3, its heap and mailbox grow, and it never reconnects. Anc-style listener that accepts, completes the upgrade, then stops draining reproduces it withtest_modeoff.Works around it today
Slipstream already forwards
mint_optstoMint.HTTP.connect, so a bounded send timeout can be passed straight through to the transport:send_timeoutbounds the:ssl.send; on expiry it returns{:error, :timeout}→push_messagereturns{:error, _, reason}→ Slipstream routes%Events.ChannelClosed{reason: {:send_failure, reason}}→handle_disconnect/2→reconnect/1. The self-heal is restored and the heap can't run away.Suggested fix
Either:
mint_opts: [transport_opts: [send_timeout: …, send_timeout_close: true]]hardening prominently, and note that without it the heartbeat-timeout self-heal does not protect against a stalled send (a blocked send starves the detector). Cheapest, zero compatibility risk; orsend_timeouton the transport (perhaps derived from the heartbeat interval) so connections self-heal on a half-open socket out of the box.Related: #75 (read-side / handshake twin) and #74 (send buffer) share the "bound the Connection's blocking I/O" theme.
We're happy to open a PR — would you prefer the docs change (a) or the defaulted
send_timeout(b)? Let us know and we'll send it.Env: slipstream 1.2.2, mint_web_socket 1.0.5, Erlang/OTP 29.0.2, Elixir 1.20.2-otp-29.