fix(ui): add retry logic to SSE pong to improve connection resilience by JDis03 · Pull Request #519 · NeuralNomadsAI/CodeNomad

JDis03 · 2026-06-02T11:54:34Z

Problem

When the client receives a ping from the server, it responds with a pong via HTTP POST. On unstable networks (mobile, WiFi with poor signal, network switches), this POST can fail in multiple ways:

Hung requests: fetch() never rejects, blocking retries indefinitely
Network disconnection: Failed to fetch
Brief disconnections: transient network errors

Previously, a single missed pong would cause the server to close the SSE connection after 45s, leaving message responses stuck in queue until the next message triggered a reconnection.

Solution

Three improvements to make the pong POST resilient:

1. Request timeout (10s)

Each pong POST is now bounded with a 10s AbortSignal timeout. Hung requests fail fast instead of blocking indefinitely, allowing retries to start before the server's stale connection sweep.

2. Selective retry with `isRetryableError()`

Only retries transient failures where recovery is possible:

AbortError / TimeoutError (hung or timed-out requests)
Failed to fetch (network disconnected)
NetworkError (browser network errors)

Non-retryable errors like 404 Client connection not found (permanently closed connection) fail immediately instead of wasting retry attempts.

3. Exponential backoff (3 attempts, 100ms → 2000ms)

Handles burst failures gracefully without hammering the server.

Changes

packages/ui/src/lib/retry-utils.ts (new): Reusable retry utility with timeoutMs and shouldRetry predicate support
packages/ui/src/lib/server-events.ts: Updated pong handler to use bounded timeout + selective retry

Verification

Build passes: npm run build:ui ✅
Manually verified in production logs: retries now visible as Pong failed after retries instead of single immediate failure
SSE monitor log at ~/.codenomad/logs/sse-monitor.log shows PONG_OK / PONG_FAIL / STALE events for ongoing monitoring

When the client receives a ping from the server, it responds with a pong via HTTP POST. On unstable networks (mobile, WiFi with poor signal, network switches), this POST can fail due to transient errors: - Network timeouts - Brief disconnections - Server temporarily unavailable Previously, a single pong failure would cause the server to close the SSE connection after 45s timeout, leaving message responses stuck in queue until the next message triggered a reconnection. This fix wraps the pong POST in retryWithBackoff with exponential backoff: - 3 attempts maximum - 100ms initial delay, doubling each retry - 2000ms maximum delay between retries This handles transient network issues gracefully without requiring reconnection, improving message delivery reliability for all users on imperfect networks (not just mobile). Retry logic is extracted to a reusable retry-utils.ts module that can be used for other fragile HTTP operations in the future.

github-actions · 2026-06-02T12:11:49Z

PR builds are available as GitHub Actions artifacts:

https://github.com/NeuralNomadsAI/CodeNomad/actions/runs/26818019424

Artifacts expire in 7 days.
Artifacts:

pr-519-4184d0f094a123e0c6fb1a66c248a540d415d83f-tauri-linux
pr-519-4184d0f094a123e0c6fb1a66c248a540d415d83f-tauri-macos
pr-519-4184d0f094a123e0c6fb1a66c248a540d415d83f-tauri-windows
pr-519-4184d0f094a123e0c6fb1a66c248a540d415d83f-tauri-macos-arm64
pr-519-4184d0f094a123e0c6fb1a66c248a540d415d83f-electron-macos
pr-519-4184d0f094a123e0c6fb1a66c248a540d415d83f-electron-windows
pr-519-4184d0f094a123e0c6fb1a66c248a540d415d83f-electron-linux

shantur · 2026-06-04T11:42:15Z

@CodeNomadBot

I want you to perform the gatekeeper checks for latest PR state

check for regressions
check if the PR could be better implemented in any other way
Be the PR gatekeeper
Be ruthless about code quality

codenomadbot · 2026-06-04T11:44:03Z

Gatekeeper review for latest PR state.

I interpreted the request as a ruthless quality/regression review, not a request to modify the PR directly.

Findings

Blocking: retry does not cover the timeout failure mode the PR claims to fix

packages/ui/src/lib/server-events.ts:46-52 wraps serverApi.sendClientConnectionPong(...) in retryWithBackoff, but sendClientConnectionPong goes through the shared request() helper in packages/ui/src/lib/api-client.ts:137-164, and that helper does not apply a timeout or AbortSignal for this call.

That means retries only start if fetch() rejects. For one of the headline failure modes in the PR description, “Network timeouts”, browser fetch can remain pending well beyond the server’s STALE_CONNECTION_TIMEOUT_MS = 45000 in packages/server/src/clients/connection-manager.ts:3. In that case the first pong attempt can hang, no retry happens in time, and the server can still close the SSE connection as stale. The implementation improves rejected transient failures, but it does not reliably fix timeout/hung POSTs.

Recommended fix: make the pong POST bounded with an abort timeout shorter than the stale window, then retry aborted attempts. Ideally keep this scoped to the pong path rather than adding global request timeout behavior unless the project wants that semantic everywhere.

Quality: retries should not blindly retry non-transient server responses

packages/ui/src/lib/retry-utils.ts:22-32 retries every thrown error, including deterministic HTTP errors emitted by request() such as 404 Client connection not found from packages/server/src/server/routes/events.ts:83-85.

For stale/replaced connections, retrying 404 cannot recover and adds extra requests/noise. This is not catastrophic, but as a reusable helper it is too blunt for future call sites. Either keep the retry helper private/specific to pong, or add a shouldRetry(error) predicate and only retry network/timeout/5xx-class failures where the API client exposes enough information to tell.

Missing coverage

There are no tests for retryWithBackoff or the pong behavior. For a connection-liveness path, I would expect at least helper-level coverage for:

success on first attempt
retry after rejection
final failure after max attempts
delay capping/backoff behavior
non-retryable error behavior if a predicate is added
abort/timeout behavior once implemented

Verification

Attempted local verification:

npm run build:ui failed before compiling the PR because vite is not installed/resolvable in this worktree.
npm run typecheck --workspace @codenomad/ui failed with missing dependency/type resolution errors for solid-js, @opencode-ai/sdk, and related packages.

I am treating these verification failures as environment/dependency-installation issues, not as regressions from this two-file PR.

Gatekeeper verdict

Do not merge as-is if the intended fix includes timeout resilience. The PR is directionally useful for rejected transient POST failures, but it overstates the protection and leaves the most important “hung timeout” class uncovered. A tighter implementation would add a bounded pong request timeout plus selective retry behavior, with small focused tests around the retry helper or pong path.

--
Yours,
CodeNomadBot

Address gatekeeper review findings on PR NeuralNomadsAI#519: 1. Add 10s AbortSignal timeout to pong POST so hung requests fail fast instead of blocking indefinitely. Previously a hung fetch would never reject, preventing retries from starting before the server's stale connection sweep closed the SSE connection. 2. Add isRetryableError() predicate to only retry transient failures: - AbortError / TimeoutError (hung or timed out requests) - 'Failed to fetch' (network disconnected) - 'NetworkError' (browser network errors) - 'timeout' in message (explicit timeouts) Non-retryable errors like '404 Client connection not found' now fail immediately instead of wasting retry attempts on a permanently closed connection. 3. Extract retryWithBackoff options to support timeoutMs and shouldRetry as first-class options, keeping the utility reusable for future use.

github-actions · 2026-06-04T14:19:02Z

PR builds are available as GitHub Actions artifacts:

https://github.com/NeuralNomadsAI/CodeNomad/actions/runs/26957422077

Artifacts expire in 7 days.
Artifacts: (none found on this run)

github-actions · 2026-06-04T14:19:05Z

PR builds are available as GitHub Actions artifacts:

https://github.com/NeuralNomadsAI/CodeNomad/actions/runs/26957422077

Artifacts expire in 7 days.
Artifacts: (none found on this run)

These files should not be part of this PR which is scoped to the SSE pong retry logic only.

github-actions · 2026-06-04T15:40:53Z

PR builds are available as GitHub Actions artifacts:

https://github.com/NeuralNomadsAI/CodeNomad/actions/runs/26961299996

Artifacts expire in 7 days.
Artifacts:

pr-519-86daa2609f76d2b1e7f22c8ba4ce9e9e9a0cdd83-tauri-macos
pr-519-86daa2609f76d2b1e7f22c8ba4ce9e9e9a0cdd83-tauri-linux
pr-519-86daa2609f76d2b1e7f22c8ba4ce9e9e9a0cdd83-tauri-windows
pr-519-86daa2609f76d2b1e7f22c8ba4ce9e9e9a0cdd83-electron-macos
pr-519-86daa2609f76d2b1e7f22c8ba4ce9e9e9a0cdd83-tauri-macos-arm64
pr-519-86daa2609f76d2b1e7f22c8ba4ce9e9e9a0cdd83-electron-windows
pr-519-86daa2609f76d2b1e7f22c8ba4ce9e9e9a0cdd83-electron-linux

chore: revert AGENTS.md and connection-manager to upstream state

86daa26

These files should not be part of this PR which is scoped to the SSE pong retry logic only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ui): add retry logic to SSE pong to improve connection resilience#519

fix(ui): add retry logic to SSE pong to improve connection resilience#519
JDis03 wants to merge 3 commits into
NeuralNomadsAI:devfrom
JDis03:fix/sse-pong-retry

JDis03 commented Jun 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

shantur commented Jun 4, 2026

Uh oh!

codenomadbot Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JDis03 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

1. Request timeout (10s)

2. Selective retry with isRetryableError()

3. Exponential backoff (3 attempts, 100ms → 2000ms)

Changes

Verification

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

shantur commented Jun 4, 2026

Uh oh!

codenomadbot Bot commented Jun 4, 2026

Findings

Blocking: retry does not cover the timeout failure mode the PR claims to fix

Quality: retries should not blindly retry non-transient server responses

Missing coverage

Verification

Gatekeeper verdict

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JDis03 commented Jun 2, 2026 •

edited

Loading

2. Selective retry with `isRetryableError()`