Skip to content

fix(ui): add retry logic to SSE pong to improve connection resilience#519

Open
JDis03 wants to merge 3 commits into
NeuralNomadsAI:devfrom
JDis03:fix/sse-pong-retry
Open

fix(ui): add retry logic to SSE pong to improve connection resilience#519
JDis03 wants to merge 3 commits into
NeuralNomadsAI:devfrom
JDis03:fix/sse-pong-retry

Conversation

@JDis03
Copy link
Copy Markdown
Contributor

@JDis03 JDis03 commented Jun 2, 2026

Problem

When the client receives a ping from the server, it responds with a pong via HTTP POST. On unstable networks (mobile, WiFi with poor signal, network switches), this POST can fail in multiple ways:

  • Hung requests: fetch() never rejects, blocking retries indefinitely
  • Network disconnection: Failed to fetch
  • Brief disconnections: transient network errors

Previously, a single missed pong would cause the server to close the SSE connection after 45s, leaving message responses stuck in queue until the next message triggered a reconnection.

Solution

Three improvements to make the pong POST resilient:

1. Request timeout (10s)

Each pong POST is now bounded with a 10s AbortSignal timeout. Hung requests fail fast instead of blocking indefinitely, allowing retries to start before the server's stale connection sweep.

2. Selective retry with isRetryableError()

Only retries transient failures where recovery is possible:

  • AbortError / TimeoutError (hung or timed-out requests)
  • Failed to fetch (network disconnected)
  • NetworkError (browser network errors)

Non-retryable errors like 404 Client connection not found (permanently closed connection) fail immediately instead of wasting retry attempts.

3. Exponential backoff (3 attempts, 100ms → 2000ms)

Handles burst failures gracefully without hammering the server.

Changes

  • packages/ui/src/lib/retry-utils.ts (new): Reusable retry utility with timeoutMs and shouldRetry predicate support
  • packages/ui/src/lib/server-events.ts: Updated pong handler to use bounded timeout + selective retry

Verification

  • Build passes: npm run build:ui
  • Manually verified in production logs: retries now visible as Pong failed after retries instead of single immediate failure
  • SSE monitor log at ~/.codenomad/logs/sse-monitor.log shows PONG_OK / PONG_FAIL / STALE events for ongoing monitoring

When the client receives a ping from the server, it responds with a
pong via HTTP POST. On unstable networks (mobile, WiFi with poor signal,
network switches), this POST can fail due to transient errors:
- Network timeouts
- Brief disconnections
- Server temporarily unavailable

Previously, a single pong failure would cause the server to close the
SSE connection after 45s timeout, leaving message responses stuck in
queue until the next message triggered a reconnection.

This fix wraps the pong POST in retryWithBackoff with exponential
backoff:
- 3 attempts maximum
- 100ms initial delay, doubling each retry
- 2000ms maximum delay between retries

This handles transient network issues gracefully without requiring
reconnection, improving message delivery reliability for all users
on imperfect networks (not just mobile).

Retry logic is extracted to a reusable retry-utils.ts module that can
be used for other fragile HTTP operations in the future.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

PR builds are available as GitHub Actions artifacts:

https://github.com/NeuralNomadsAI/CodeNomad/actions/runs/26818019424

Artifacts expire in 7 days.
Artifacts:

  • pr-519-4184d0f094a123e0c6fb1a66c248a540d415d83f-tauri-linux
  • pr-519-4184d0f094a123e0c6fb1a66c248a540d415d83f-tauri-macos
  • pr-519-4184d0f094a123e0c6fb1a66c248a540d415d83f-tauri-windows
  • pr-519-4184d0f094a123e0c6fb1a66c248a540d415d83f-tauri-macos-arm64
  • pr-519-4184d0f094a123e0c6fb1a66c248a540d415d83f-electron-macos
  • pr-519-4184d0f094a123e0c6fb1a66c248a540d415d83f-electron-windows
  • pr-519-4184d0f094a123e0c6fb1a66c248a540d415d83f-electron-linux

@shantur
Copy link
Copy Markdown
Collaborator

shantur commented Jun 4, 2026

@CodeNomadBot

I want you to perform the gatekeeper checks for latest PR state

check for regressions
check if the PR could be better implemented in any other way
Be the PR gatekeeper
Be ruthless about code quality

@codenomadbot
Copy link
Copy Markdown
Contributor

codenomadbot Bot commented Jun 4, 2026

Gatekeeper review for latest PR state.

I interpreted the request as a ruthless quality/regression review, not a request to modify the PR directly.

Findings

Blocking: retry does not cover the timeout failure mode the PR claims to fix

packages/ui/src/lib/server-events.ts:46-52 wraps serverApi.sendClientConnectionPong(...) in retryWithBackoff, but sendClientConnectionPong goes through the shared request() helper in packages/ui/src/lib/api-client.ts:137-164, and that helper does not apply a timeout or AbortSignal for this call.

That means retries only start if fetch() rejects. For one of the headline failure modes in the PR description, “Network timeouts”, browser fetch can remain pending well beyond the server’s STALE_CONNECTION_TIMEOUT_MS = 45000 in packages/server/src/clients/connection-manager.ts:3. In that case the first pong attempt can hang, no retry happens in time, and the server can still close the SSE connection as stale. The implementation improves rejected transient failures, but it does not reliably fix timeout/hung POSTs.

Recommended fix: make the pong POST bounded with an abort timeout shorter than the stale window, then retry aborted attempts. Ideally keep this scoped to the pong path rather than adding global request timeout behavior unless the project wants that semantic everywhere.

Quality: retries should not blindly retry non-transient server responses

packages/ui/src/lib/retry-utils.ts:22-32 retries every thrown error, including deterministic HTTP errors emitted by request() such as 404 Client connection not found from packages/server/src/server/routes/events.ts:83-85.

For stale/replaced connections, retrying 404 cannot recover and adds extra requests/noise. This is not catastrophic, but as a reusable helper it is too blunt for future call sites. Either keep the retry helper private/specific to pong, or add a shouldRetry(error) predicate and only retry network/timeout/5xx-class failures where the API client exposes enough information to tell.

Missing coverage

There are no tests for retryWithBackoff or the pong behavior. For a connection-liveness path, I would expect at least helper-level coverage for:

  • success on first attempt
  • retry after rejection
  • final failure after max attempts
  • delay capping/backoff behavior
  • non-retryable error behavior if a predicate is added
  • abort/timeout behavior once implemented

Verification

Attempted local verification:

  • npm run build:ui failed before compiling the PR because vite is not installed/resolvable in this worktree.
  • npm run typecheck --workspace @codenomad/ui failed with missing dependency/type resolution errors for solid-js, @opencode-ai/sdk, and related packages.

I am treating these verification failures as environment/dependency-installation issues, not as regressions from this two-file PR.

Gatekeeper verdict

Do not merge as-is if the intended fix includes timeout resilience. The PR is directionally useful for rejected transient POST failures, but it overstates the protection and leaves the most important “hung timeout” class uncovered. A tighter implementation would add a bounded pong request timeout plus selective retry behavior, with small focused tests around the retry helper or pong path.

--
Yours,
CodeNomadBot

Address gatekeeper review findings on PR NeuralNomadsAI#519:

1. Add 10s AbortSignal timeout to pong POST so hung requests fail fast
   instead of blocking indefinitely. Previously a hung fetch would never
   reject, preventing retries from starting before the server's stale
   connection sweep closed the SSE connection.

2. Add isRetryableError() predicate to only retry transient failures:
   - AbortError / TimeoutError (hung or timed out requests)
   - 'Failed to fetch' (network disconnected)
   - 'NetworkError' (browser network errors)
   - 'timeout' in message (explicit timeouts)
   Non-retryable errors like '404 Client connection not found' now
   fail immediately instead of wasting retry attempts on a permanently
   closed connection.

3. Extract retryWithBackoff options to support timeoutMs and shouldRetry
   as first-class options, keeping the utility reusable for future use.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

PR builds are available as GitHub Actions artifacts:

https://github.com/NeuralNomadsAI/CodeNomad/actions/runs/26957422077

Artifacts expire in 7 days.
Artifacts: (none found on this run)

1 similar comment
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

PR builds are available as GitHub Actions artifacts:

https://github.com/NeuralNomadsAI/CodeNomad/actions/runs/26957422077

Artifacts expire in 7 days.
Artifacts: (none found on this run)

These files should not be part of this PR which is scoped to
the SSE pong retry logic only.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

PR builds are available as GitHub Actions artifacts:

https://github.com/NeuralNomadsAI/CodeNomad/actions/runs/26961299996

Artifacts expire in 7 days.
Artifacts:

  • pr-519-86daa2609f76d2b1e7f22c8ba4ce9e9e9a0cdd83-tauri-macos
  • pr-519-86daa2609f76d2b1e7f22c8ba4ce9e9e9a0cdd83-tauri-linux
  • pr-519-86daa2609f76d2b1e7f22c8ba4ce9e9e9a0cdd83-tauri-windows
  • pr-519-86daa2609f76d2b1e7f22c8ba4ce9e9e9a0cdd83-electron-macos
  • pr-519-86daa2609f76d2b1e7f22c8ba4ce9e9e9a0cdd83-tauri-macos-arm64
  • pr-519-86daa2609f76d2b1e7f22c8ba4ce9e9e9a0cdd83-electron-windows
  • pr-519-86daa2609f76d2b1e7f22c8ba4ce9e9e9a0cdd83-electron-linux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants