Skip to content

fix(slack): gate gateway self-heal on socket liveness, not event recency#65

Open
kwliang1 wants to merge 1 commit into
mainfrom
kevinliang/slack-gateway-idle-reconnect
Open

fix(slack): gate gateway self-heal on socket liveness, not event recency#65
kwliang1 wants to merge 1 commit into
mainfrom
kevinliang/slack-gateway-idle-reconnect

Conversation

@kwliang1

Copy link
Copy Markdown
Collaborator

Problem

The Slack gateway's Socket Mode connection cycled connection stale (~239s) → reconnect every ~3–4 minutes on an idle bot. Root cause: the self-heal watchdog (startHealthCheck) judged liveness by business-event recency (lastEventAt, bumped only on message/mention/button), not transport health. With STALE_THRESHOLD_MS=180s, a normal quiet period was misread as a dead connection and force-reconnected. Each reconnect stops+restarts the Bolt app, briefly stacking concurrent Socket Mode WebSockets (observed up to 10 ESTABLISHED TLS conns, settling to 5); Slack round-robins events across all live sockets, starving each one and sustaining the loop.

Fix

Gate the watchdog on the SocketModeClient's real connection state:

  • Hook its connected/reconnecting/disconnected lifecycle (the client self-pings every 5s, times out the server at 30s, and auto-reconnects).
  • Reconnect only after the socket has been continuously disconnected past the threshold (disconnectedSince); a self-healing blip clears it before it trips.
  • Generation-guard the listeners so a superseded client (Bolt's app.stop() tears the socket down async and never removes listeners) can't mutate the shared flags for a connection we already replaced.
  • forceReconnect holds the reconnecting mutex so the health tick can't run a second concurrent start().
  • Heartbeat-file backstop preserved: withhold the file write once stale (and drop reconnect()'s catch-path write) so a wedged/failing-reconnect daemon lets the file age and watchdog.sh restarts it; keep the network-down write (restart-storm avoidance, symmetric with watchdog.sh).
  • Fall back to the prior event-time staleness if Bolt internals change and the client can't be hooked.

Verification

  • Deployed to the live daemon: 0 stale-reconnects over multiple 5–14 min idle windows (old code logged ~1–2 reconnect cycles per ~4 min); transport hook attaches, connection stays up.
  • Reviewed across correctness + resilience lenses over several independent rounds; every real issue found (a forceReconnect/tick-reconnect race that could strand socketConnected=true, heartbeat-backstop holes, the catch-path file write) was fixed and re-verified against the installed @slack/bolt 4.7.3 / @slack/socket-mode 2.0.7 source.

Accepted tradeoff (documented)

The old "blind 180s reconnect" incidentally recovered a silently-dead socket that socket-mode's own ping detection failed to notice. The new design relies on socket-mode's 5s/30s ping keepalive for transport liveness, with watchdog.sh (process wedge) and exit()-after-MAX_RECONNECT_ATTEMPTS (a truly dead socket makes start() throw on the reconnect path) as process-level backstops. A frame/event-based staleness backstop is not viable: measured idle data-frame cadence is ~1 ws_message per 12+ min (socket-mode keepalive is WebSocket control-frame ping/pong, never surfaced as ws_message), so any practical ceiling would reintroduce the idle false-reconnect. The residual gap requires a socket-mode library-internal failure and is accepted.

🤖 Generated with Claude Code

The Socket Mode self-heal watchdog reconnected whenever no business event
(message/mention/button) arrived within STALE_THRESHOLD_MS (180s). On an
idle bot that fired every ~3-4 min, needlessly reconnecting a healthy
connection -- and since reconnect stops then restarts the Bolt app, it
briefly stacked concurrent Socket Mode WebSockets, which Slack round-robins
events across, starving each one and sustaining the loop.

Gate the watchdog on the SocketModeClient's actual connection state: hook
its connected/reconnecting/disconnected lifecycle (the client self-pings
every 5s, times out the server at 30s, and auto-reconnects), and only force
a reconnect once the socket has been continuously disconnected past the
threshold. An idle-but-connected socket is left alone.

- Generation-guard the transport listeners so a superseded client (Bolt's
  app.stop() tears the socket down async and never removes our listeners)
  can't write the shared flags for a connection we already replaced.
- Track disconnectedSince (continuous-down duration) instead of
  time-since-last-event, so a self-healing reconnect blip never trips it.
- forceReconnect now holds the same reconnecting mutex as reconnect(), so
  the health tick can't run a second concurrent start().
- Withhold the heartbeat-file write once stale and drop reconnect()'s
  catch-path write, so a wedged/failing-reconnect daemon lets the file age
  and watchdog.sh restarts it (defense-in-depth preserved); keep the
  network-down write to avoid a restart-storm during a real outage.
- Fall back to event-time staleness if Bolt internals change and the client
  can't be hooked.

Verified live: 0 stale-reconnects over multiple idle windows (old code
logged ~1-2 reconnect cycles per ~4 min). Idle data-frame cadence measured
at ~1 frame / 12+ min, confirming a frame/event-based staleness backstop
would false-fire -- so transport liveness is delegated to socket-mode's ping
keepalive, with watchdog.sh and exit-after-MAX as process-level backstops.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@sf8193 sf8193 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed by sp-reviewer (architecture) and typescript-reviewer (correctness) in parallel.

Core change is sound — keying staleness off socket lifecycle events instead of business-event recency eliminates the false-positive reconnect-storm on idle-but-healthy connections. Generation guard is a clean solution to stale-listener stomping. Heartbeat withholding / refresh policy changes are well-reasoned.

4 inline comments below (0 blockers, 3 should-fix, 1 nit).

Comment thread slack-gateway.ts
private socketConnected = false
private transportHooked = false
private socketGeneration = 0
private disconnectedSince: number | null = null

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should-fix (sp-reviewer): discord-gateway.ts delegates heartbeat/disconnect/reconnect-gap tracking to the shared GatewayHealth class (daemon/gateway-health.ts), which was designed for multi-gateway adoption (line 10: "SlackGateway can adopt it later"). This PR introduces a parallel socketConnected / disconnectedSince / touchHeartbeat / writeHeartbeat state machine that is semantically identical to GatewayHealth.markAlive() / markDisconnected() / markReconnected(). The transport-hook approach is correct, but the lifecycle flags should flow through GatewayHealth so heartbeat-write, outage-gap, and disconnect-tracking logic is single-sourced rather than drifting across two implementations.

Comment thread slack-gateway.ts
this.attachTransportHeartbeat()

await this.app.start()
// socketConnected/disconnectedSince are set by the generation-guarded

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should-fix (ts-reviewer): Post-start() correctness depends on a Bolt behavioral contract: that the connected event fires before start() resolves. If a Bolt upgrade changes this ordering, socketConnected stays false and disconnectedSince points to a prior disconnect, causing immediate false-positive stale detection on the next health tick.

Consider adding a defensive this.socketConnected = true; this.disconnectedSince = null after start() resolves (or at minimum an assertion), so the invariant doesn't silently break on a dependency bump. The generation guard makes this safe — a stale connected event from a dead socket would be filtered, but a missing event from the live socket would leave the flags wrong.

Comment thread slack-gateway.ts
// few minutes, briefly stacking concurrent WebSockets.
private attachTransportHeartbeat(): void {
try {
const client = (this.app as any)?.receiver?.client

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should-fix (flagged by both): (this.app as any)?.receiver?.client — the fallback path handles degradation correctly, but each start()attachTransportHeartbeat() cycle accumulates generation-guarded-but-never-removed listeners on the prior client object. The old App is stop()'d and replaced so the listeners should be GC'd with it, but if stop() doesn't fully tear down (e.g., async teardown lag during a reconnect storm), unbounded listener growth could trigger Node/Bun MaxListenersExceededWarning. Consider tracking the listeners and removing them in stop(), or calling client.removeAllListeners() before attaching new ones.

Comment thread slack-gateway.ts
this.setHealthCheckInterval(backoffMs)
process.stderr.write(`slack gateway: reconnect attempt ${this.reconnectAttempts}/${MAX_RECONNECT_ATTEMPTS} failed, next check in ${Math.round(backoffMs / 1000)}s: ${err}\n`)
this.writeHeartbeat()
// Deliberately do NOT refresh the heartbeat file here: the network is up but

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (ts-reviewer): Removing writeHeartbeat() here creates two competing restart mechanisms: in-process exit(1) after MAX_RECONNECT_ATTEMPTS and the external watchdog. The comment explains the rationale well. Just confirming the interaction: watchdog's STALE_SECONDS (300s in watchdog.sh:25) matches RECONNECT_BACKOFF_CAP_MS (5 × 60000 = 300s), so after ~2 failed reconnects the watchdog should intervene. This seems intentional — but documenting which mechanism is expected to fire first (and that the race is benign) would help future readers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants