Skip to content

Connection can block indefinitely on a stalled send (no send_timeout), defeating heartbeat self-heal and growing the process heap until OOM #85

Description

@oliver-kriska

Related to #75 — that issue is the read-side / handshake variant of the same underlying problem ("the Slipstream.Connection GenServer wedges on a blocking I/O op with no timeout and never self-heals"). This one is the write side, on an already-established connection.

What happens

Slipstream.Connection sends synchronously with no send timeout: Impl.push_message/2Mint.WebSocket.stream_request_body/3:ssl.send/2. The heartbeat takes the exact same path (push_heartbeat/1push_message/2).

On a half-open connection (peer stops ACKing — a brief network blip, a dead load-balancer path, an overloaded server), the kernel TCP send buffer fills and :ssl.send/2 blocks indefinitely. The Connection GenServer is now stuck inside that send and can't process its next message. Two consequences:

  1. The heartbeat self-heal never fires. The "previous heartbeat still unacked → close and reconnect" check runs on the next SendHeartbeat tick (pipeline.ex#L229-L238), but the blocked process can't reach the next tick. The one mechanism designed to recover a dead connection is starved by the very condition it exists to detect.

  2. The process heap grows without bound. Every message delivered to the blocked Connection (PubSub, pushes, timers) is copied on-heap — a blocked GenServer never reaches its receive, so nothing is moved off-heap. On an idle-ish multiplexed socket we watched a node climb from a flat baseline to ~4.6 GB in ~20 s and get OOM-killed, with the Connection process's own heap at ~2.6 GB, blocked in :ssl.send/2 → :tls_sender.call/2. The stalled send was the 30 s heartbeat, not an application push — so no app traffic is needed to trigger it.

Reproduce

Same spirit as #75, write-side: connect and complete the WebSocket upgrade against a peer that then stops reading (so the send buffer fills), and keep sending (the heartbeat alone is enough). The Connection process blocks in stream_request_body/3, its heap and mailbox grow, and it never reconnects. A nc-style listener that accepts, completes the upgrade, then stops draining reproduces it with test_mode off.

Works around it today

Slipstream already forwards mint_opts to Mint.HTTP.connect, so a bounded send timeout can be passed straight through to the transport:

connect(socket,
  uri: uri,
  mint_opts: [transport_opts: [send_timeout: 15_000, send_timeout_close: true]]
)

send_timeout bounds the :ssl.send; on expiry it returns {:error, :timeout}push_message returns {:error, _, reason} → Slipstream routes %Events.ChannelClosed{reason: {:send_failure, reason}}handle_disconnect/2reconnect/1. The self-heal is restored and the heap can't run away.

Suggested fix

Either:

  • (a) Document the mint_opts: [transport_opts: [send_timeout: …, send_timeout_close: true]] hardening prominently, and note that without it the heartbeat-timeout self-heal does not protect against a stalled send (a blocked send starves the detector). Cheapest, zero compatibility risk; or
  • (b) Default a sane send_timeout on the transport (perhaps derived from the heartbeat interval) so connections self-heal on a half-open socket out of the box.

Related: #75 (read-side / handshake twin) and #74 (send buffer) share the "bound the Connection's blocking I/O" theme.

We're happy to open a PR — would you prefer the docs change (a) or the defaulted send_timeout (b)? Let us know and we'll send it.


Env: slipstream 1.2.2, mint_web_socket 1.0.5, Erlang/OTP 29.0.2, Elixir 1.20.2-otp-29.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions