fix(slack): gate gateway self-heal on socket liveness, not event recency by kwliang1 · Pull Request #65 · sf8193/hydra

kwliang1 · 2026-06-30T05:14:00Z

Problem

The Slack gateway's Socket Mode connection cycled connection stale (~239s) → reconnect every ~3–4 minutes on an idle bot. Root cause: the self-heal watchdog (startHealthCheck) judged liveness by business-event recency (lastEventAt, bumped only on message/mention/button), not transport health. With STALE_THRESHOLD_MS=180s, a normal quiet period was misread as a dead connection and force-reconnected. Each reconnect stops+restarts the Bolt app, briefly stacking concurrent Socket Mode WebSockets (observed up to 10 ESTABLISHED TLS conns, settling to 5); Slack round-robins events across all live sockets, starving each one and sustaining the loop.

Fix

Gate the watchdog on the SocketModeClient's real connection state:

Hook its connected/reconnecting/disconnected lifecycle (the client self-pings every 5s, times out the server at 30s, and auto-reconnects).
Reconnect only after the socket has been continuously disconnected past the threshold (disconnectedSince); a self-healing blip clears it before it trips.
Generation-guard the listeners so a superseded client (Bolt's app.stop() tears the socket down async and never removes listeners) can't mutate the shared flags for a connection we already replaced.
forceReconnect holds the reconnecting mutex so the health tick can't run a second concurrent start().
Heartbeat-file backstop preserved: withhold the file write once stale (and drop reconnect()'s catch-path write) so a wedged/failing-reconnect daemon lets the file age and watchdog.sh restarts it; keep the network-down write (restart-storm avoidance, symmetric with watchdog.sh).
Fall back to the prior event-time staleness if Bolt internals change and the client can't be hooked.

Verification

Deployed to the live daemon: 0 stale-reconnects over multiple 5–14 min idle windows (old code logged ~1–2 reconnect cycles per ~4 min); transport hook attaches, connection stays up.
Reviewed across correctness + resilience lenses over several independent rounds; every real issue found (a forceReconnect/tick-reconnect race that could strand socketConnected=true, heartbeat-backstop holes, the catch-path file write) was fixed and re-verified against the installed @slack/bolt 4.7.3 / @slack/socket-mode 2.0.7 source.

Accepted tradeoff (documented)

The old "blind 180s reconnect" incidentally recovered a silently-dead socket that socket-mode's own ping detection failed to notice. The new design relies on socket-mode's 5s/30s ping keepalive for transport liveness, with watchdog.sh (process wedge) and exit()-after-MAX_RECONNECT_ATTEMPTS (a truly dead socket makes start() throw on the reconnect path) as process-level backstops. A frame/event-based staleness backstop is not viable: measured idle data-frame cadence is ~1 ws_message per 12+ min (socket-mode keepalive is WebSocket control-frame ping/pong, never surfaced as ws_message), so any practical ceiling would reintroduce the idle false-reconnect. The residual gap requires a socket-mode library-internal failure and is accepted.

🤖 Generated with Claude Code

The Socket Mode self-heal watchdog reconnected whenever no business event (message/mention/button) arrived within STALE_THRESHOLD_MS (180s). On an idle bot that fired every ~3-4 min, needlessly reconnecting a healthy connection -- and since reconnect stops then restarts the Bolt app, it briefly stacked concurrent Socket Mode WebSockets, which Slack round-robins events across, starving each one and sustaining the loop. Gate the watchdog on the SocketModeClient's actual connection state: hook its connected/reconnecting/disconnected lifecycle (the client self-pings every 5s, times out the server at 30s, and auto-reconnects), and only force a reconnect once the socket has been continuously disconnected past the threshold. An idle-but-connected socket is left alone. - Generation-guard the transport listeners so a superseded client (Bolt's app.stop() tears the socket down async and never removes our listeners) can't write the shared flags for a connection we already replaced. - Track disconnectedSince (continuous-down duration) instead of time-since-last-event, so a self-healing reconnect blip never trips it. - forceReconnect now holds the same reconnecting mutex as reconnect(), so the health tick can't run a second concurrent start(). - Withhold the heartbeat-file write once stale and drop reconnect()'s catch-path write, so a wedged/failing-reconnect daemon lets the file age and watchdog.sh restarts it (defense-in-depth preserved); keep the network-down write to avoid a restart-storm during a real outage. - Fall back to event-time staleness if Bolt internals change and the client can't be hooked. Verified live: 0 stale-reconnects over multiple idle windows (old code logged ~1-2 reconnect cycles per ~4 min). Idle data-frame cadence measured at ~1 frame / 12+ min, confirming a frame/event-based staleness backstop would false-fire -- so transport liveness is delegated to socket-mode's ping keepalive, with watchdog.sh and exit-after-MAX as process-level backstops. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

sf8193

Reviewed by sp-reviewer (architecture) and typescript-reviewer (correctness) in parallel.

Core change is sound — keying staleness off socket lifecycle events instead of business-event recency eliminates the false-positive reconnect-storm on idle-but-healthy connections. Generation guard is a clean solution to stale-listener stomping. Heartbeat withholding / refresh policy changes are well-reasoned.

4 inline comments below (0 blockers, 3 should-fix, 1 nit).

sf8193 · 2026-07-04T19:21:15Z

+  private socketConnected = false
+  private transportHooked = false
+  private socketGeneration = 0
+  private disconnectedSince: number | null = null


should-fix (sp-reviewer): discord-gateway.ts delegates heartbeat/disconnect/reconnect-gap tracking to the shared GatewayHealth class (daemon/gateway-health.ts), which was designed for multi-gateway adoption (line 10: "SlackGateway can adopt it later"). This PR introduces a parallel socketConnected / disconnectedSince / touchHeartbeat / writeHeartbeat state machine that is semantically identical to GatewayHealth.markAlive() / markDisconnected() / markReconnected(). The transport-hook approach is correct, but the lifecycle flags should flow through GatewayHealth so heartbeat-write, outage-gap, and disconnect-tracking logic is single-sourced rather than drifting across two implementations.

sf8193 · 2026-07-04T19:21:15Z

+    this.attachTransportHeartbeat()
+
    await this.app.start()
+    // socketConnected/disconnectedSince are set by the generation-guarded


should-fix (ts-reviewer): Post-start() correctness depends on a Bolt behavioral contract: that the connected event fires before start() resolves. If a Bolt upgrade changes this ordering, socketConnected stays false and disconnectedSince points to a prior disconnect, causing immediate false-positive stale detection on the next health tick.

Consider adding a defensive this.socketConnected = true; this.disconnectedSince = null after start() resolves (or at minimum an assertion), so the invariant doesn't silently break on a dependency bump. The generation guard makes this safe — a stale connected event from a dead socket would be filtered, but a missing event from the live socket would leave the flags wrong.

sf8193 · 2026-07-04T19:21:15Z

+  // few minutes, briefly stacking concurrent WebSockets.
+  private attachTransportHeartbeat(): void {
+    try {
+      const client = (this.app as any)?.receiver?.client


should-fix (flagged by both): (this.app as any)?.receiver?.client — the fallback path handles degradation correctly, but each start() → attachTransportHeartbeat() cycle accumulates generation-guarded-but-never-removed listeners on the prior client object. The old App is stop()'d and replaced so the listeners should be GC'd with it, but if stop() doesn't fully tear down (e.g., async teardown lag during a reconnect storm), unbounded listener growth could trigger Node/Bun MaxListenersExceededWarning. Consider tracking the listeners and removing them in stop(), or calling client.removeAllListeners() before attaching new ones.

sf8193 · 2026-07-04T19:21:15Z

      this.setHealthCheckInterval(backoffMs)
      process.stderr.write(`slack gateway: reconnect attempt ${this.reconnectAttempts}/${MAX_RECONNECT_ATTEMPTS} failed, next check in ${Math.round(backoffMs / 1000)}s: ${err}\n`)
-      this.writeHeartbeat()
+      // Deliberately do NOT refresh the heartbeat file here: the network is up but


nit (ts-reviewer): Removing writeHeartbeat() here creates two competing restart mechanisms: in-process exit(1) after MAX_RECONNECT_ATTEMPTS and the external watchdog. The comment explains the rationale well. Just confirming the interaction: watchdog's STALE_SECONDS (300s in watchdog.sh:25) matches RECONNECT_BACKOFF_CAP_MS (5 × 60000 = 300s), so after ~2 failed reconnects the watchdog should intervene. This seems intentional — but documenting which mechanism is expected to fire first (and that the race is benign) would help future readers.

sf8193 reviewed Jul 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(slack): gate gateway self-heal on socket liveness, not event recency#65

fix(slack): gate gateway self-heal on socket liveness, not event recency#65
kwliang1 wants to merge 1 commit into
mainfrom
kevinliang/slack-gateway-idle-reconnect

kwliang1 commented Jun 30, 2026

Uh oh!

sf8193 left a comment

Uh oh!

sf8193 Jul 4, 2026

Uh oh!

sf8193 Jul 4, 2026

Uh oh!

sf8193 Jul 4, 2026

Uh oh!

sf8193 Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kwliang1 commented Jun 30, 2026

Problem

Fix

Verification

Accepted tradeoff (documented)

Uh oh!

sf8193 left a comment

Choose a reason for hiding this comment

Uh oh!

sf8193 Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

sf8193 Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

sf8193 Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

sf8193 Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants