fix(slack): gate gateway self-heal on socket liveness, not event recency#65
fix(slack): gate gateway self-heal on socket liveness, not event recency#65kwliang1 wants to merge 1 commit into
Conversation
The Socket Mode self-heal watchdog reconnected whenever no business event (message/mention/button) arrived within STALE_THRESHOLD_MS (180s). On an idle bot that fired every ~3-4 min, needlessly reconnecting a healthy connection -- and since reconnect stops then restarts the Bolt app, it briefly stacked concurrent Socket Mode WebSockets, which Slack round-robins events across, starving each one and sustaining the loop. Gate the watchdog on the SocketModeClient's actual connection state: hook its connected/reconnecting/disconnected lifecycle (the client self-pings every 5s, times out the server at 30s, and auto-reconnects), and only force a reconnect once the socket has been continuously disconnected past the threshold. An idle-but-connected socket is left alone. - Generation-guard the transport listeners so a superseded client (Bolt's app.stop() tears the socket down async and never removes our listeners) can't write the shared flags for a connection we already replaced. - Track disconnectedSince (continuous-down duration) instead of time-since-last-event, so a self-healing reconnect blip never trips it. - forceReconnect now holds the same reconnecting mutex as reconnect(), so the health tick can't run a second concurrent start(). - Withhold the heartbeat-file write once stale and drop reconnect()'s catch-path write, so a wedged/failing-reconnect daemon lets the file age and watchdog.sh restarts it (defense-in-depth preserved); keep the network-down write to avoid a restart-storm during a real outage. - Fall back to event-time staleness if Bolt internals change and the client can't be hooked. Verified live: 0 stale-reconnects over multiple idle windows (old code logged ~1-2 reconnect cycles per ~4 min). Idle data-frame cadence measured at ~1 frame / 12+ min, confirming a frame/event-based staleness backstop would false-fire -- so transport liveness is delegated to socket-mode's ping keepalive, with watchdog.sh and exit-after-MAX as process-level backstops. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
sf8193
left a comment
There was a problem hiding this comment.
Reviewed by sp-reviewer (architecture) and typescript-reviewer (correctness) in parallel.
Core change is sound — keying staleness off socket lifecycle events instead of business-event recency eliminates the false-positive reconnect-storm on idle-but-healthy connections. Generation guard is a clean solution to stale-listener stomping. Heartbeat withholding / refresh policy changes are well-reasoned.
4 inline comments below (0 blockers, 3 should-fix, 1 nit).
| private socketConnected = false | ||
| private transportHooked = false | ||
| private socketGeneration = 0 | ||
| private disconnectedSince: number | null = null |
There was a problem hiding this comment.
should-fix (sp-reviewer): discord-gateway.ts delegates heartbeat/disconnect/reconnect-gap tracking to the shared GatewayHealth class (daemon/gateway-health.ts), which was designed for multi-gateway adoption (line 10: "SlackGateway can adopt it later"). This PR introduces a parallel socketConnected / disconnectedSince / touchHeartbeat / writeHeartbeat state machine that is semantically identical to GatewayHealth.markAlive() / markDisconnected() / markReconnected(). The transport-hook approach is correct, but the lifecycle flags should flow through GatewayHealth so heartbeat-write, outage-gap, and disconnect-tracking logic is single-sourced rather than drifting across two implementations.
| this.attachTransportHeartbeat() | ||
|
|
||
| await this.app.start() | ||
| // socketConnected/disconnectedSince are set by the generation-guarded |
There was a problem hiding this comment.
should-fix (ts-reviewer): Post-start() correctness depends on a Bolt behavioral contract: that the connected event fires before start() resolves. If a Bolt upgrade changes this ordering, socketConnected stays false and disconnectedSince points to a prior disconnect, causing immediate false-positive stale detection on the next health tick.
Consider adding a defensive this.socketConnected = true; this.disconnectedSince = null after start() resolves (or at minimum an assertion), so the invariant doesn't silently break on a dependency bump. The generation guard makes this safe — a stale connected event from a dead socket would be filtered, but a missing event from the live socket would leave the flags wrong.
| // few minutes, briefly stacking concurrent WebSockets. | ||
| private attachTransportHeartbeat(): void { | ||
| try { | ||
| const client = (this.app as any)?.receiver?.client |
There was a problem hiding this comment.
should-fix (flagged by both): (this.app as any)?.receiver?.client — the fallback path handles degradation correctly, but each start() → attachTransportHeartbeat() cycle accumulates generation-guarded-but-never-removed listeners on the prior client object. The old App is stop()'d and replaced so the listeners should be GC'd with it, but if stop() doesn't fully tear down (e.g., async teardown lag during a reconnect storm), unbounded listener growth could trigger Node/Bun MaxListenersExceededWarning. Consider tracking the listeners and removing them in stop(), or calling client.removeAllListeners() before attaching new ones.
| this.setHealthCheckInterval(backoffMs) | ||
| process.stderr.write(`slack gateway: reconnect attempt ${this.reconnectAttempts}/${MAX_RECONNECT_ATTEMPTS} failed, next check in ${Math.round(backoffMs / 1000)}s: ${err}\n`) | ||
| this.writeHeartbeat() | ||
| // Deliberately do NOT refresh the heartbeat file here: the network is up but |
There was a problem hiding this comment.
nit (ts-reviewer): Removing writeHeartbeat() here creates two competing restart mechanisms: in-process exit(1) after MAX_RECONNECT_ATTEMPTS and the external watchdog. The comment explains the rationale well. Just confirming the interaction: watchdog's STALE_SECONDS (300s in watchdog.sh:25) matches RECONNECT_BACKOFF_CAP_MS (5 × 60000 = 300s), so after ~2 failed reconnects the watchdog should intervene. This seems intentional — but documenting which mechanism is expected to fire first (and that the race is benign) would help future readers.
Problem
The Slack gateway's Socket Mode connection cycled
connection stale (~239s) → reconnectevery ~3–4 minutes on an idle bot. Root cause: the self-heal watchdog (startHealthCheck) judged liveness by business-event recency (lastEventAt, bumped only on message/mention/button), not transport health. WithSTALE_THRESHOLD_MS=180s, a normal quiet period was misread as a dead connection and force-reconnected. Each reconnect stops+restarts the Bolt app, briefly stacking concurrent Socket Mode WebSockets (observed up to 10 ESTABLISHED TLS conns, settling to 5); Slack round-robins events across all live sockets, starving each one and sustaining the loop.Fix
Gate the watchdog on the
SocketModeClient's real connection state:connected/reconnecting/disconnectedlifecycle (the client self-pings every 5s, times out the server at 30s, and auto-reconnects).disconnectedSince); a self-healing blip clears it before it trips.app.stop()tears the socket down async and never removes listeners) can't mutate the shared flags for a connection we already replaced.forceReconnectholds thereconnectingmutex so the health tick can't run a second concurrentstart().reconnect()'s catch-path write) so a wedged/failing-reconnect daemon lets the file age andwatchdog.shrestarts it; keep the network-down write (restart-storm avoidance, symmetric withwatchdog.sh).Verification
forceReconnect/tick-reconnect race that could strandsocketConnected=true, heartbeat-backstop holes, the catch-path file write) was fixed and re-verified against the installed@slack/bolt4.7.3 /@slack/socket-mode2.0.7 source.Accepted tradeoff (documented)
The old "blind 180s reconnect" incidentally recovered a silently-dead socket that socket-mode's own ping detection failed to notice. The new design relies on socket-mode's 5s/30s ping keepalive for transport liveness, with
watchdog.sh(process wedge) andexit()-after-MAX_RECONNECT_ATTEMPTS(a truly dead socket makesstart()throw on the reconnect path) as process-level backstops. A frame/event-based staleness backstop is not viable: measured idle data-frame cadence is ~1ws_messageper 12+ min (socket-mode keepalive is WebSocket control-frame ping/pong, never surfaced asws_message), so any practical ceiling would reintroduce the idle false-reconnect. The residual gap requires a socket-mode library-internal failure and is accepted.🤖 Generated with Claude Code