fix(ui): add retry logic to SSE pong to improve connection resilience#519
fix(ui): add retry logic to SSE pong to improve connection resilience#519JDis03 wants to merge 3 commits into
Conversation
When the client receives a ping from the server, it responds with a pong via HTTP POST. On unstable networks (mobile, WiFi with poor signal, network switches), this POST can fail due to transient errors: - Network timeouts - Brief disconnections - Server temporarily unavailable Previously, a single pong failure would cause the server to close the SSE connection after 45s timeout, leaving message responses stuck in queue until the next message triggered a reconnection. This fix wraps the pong POST in retryWithBackoff with exponential backoff: - 3 attempts maximum - 100ms initial delay, doubling each retry - 2000ms maximum delay between retries This handles transient network issues gracefully without requiring reconnection, improving message delivery reliability for all users on imperfect networks (not just mobile). Retry logic is extracted to a reusable retry-utils.ts module that can be used for other fragile HTTP operations in the future.
|
PR builds are available as GitHub Actions artifacts: https://github.com/NeuralNomadsAI/CodeNomad/actions/runs/26818019424 Artifacts expire in 7 days.
|
|
I want you to perform the gatekeeper checks for latest PR state check for regressions |
|
Gatekeeper review for latest PR state. I interpreted the request as a ruthless quality/regression review, not a request to modify the PR directly. FindingsBlocking: retry does not cover the timeout failure mode the PR claims to fix
That means retries only start if Recommended fix: make the pong POST bounded with an abort timeout shorter than the stale window, then retry aborted attempts. Ideally keep this scoped to the pong path rather than adding global request timeout behavior unless the project wants that semantic everywhere. Quality: retries should not blindly retry non-transient server responses
For stale/replaced connections, retrying Missing coverageThere are no tests for
VerificationAttempted local verification:
I am treating these verification failures as environment/dependency-installation issues, not as regressions from this two-file PR. Gatekeeper verdictDo not merge as-is if the intended fix includes timeout resilience. The PR is directionally useful for rejected transient POST failures, but it overstates the protection and leaves the most important “hung timeout” class uncovered. A tighter implementation would add a bounded pong request timeout plus selective retry behavior, with small focused tests around the retry helper or pong path. -- |
Address gatekeeper review findings on PR NeuralNomadsAI#519: 1. Add 10s AbortSignal timeout to pong POST so hung requests fail fast instead of blocking indefinitely. Previously a hung fetch would never reject, preventing retries from starting before the server's stale connection sweep closed the SSE connection. 2. Add isRetryableError() predicate to only retry transient failures: - AbortError / TimeoutError (hung or timed out requests) - 'Failed to fetch' (network disconnected) - 'NetworkError' (browser network errors) - 'timeout' in message (explicit timeouts) Non-retryable errors like '404 Client connection not found' now fail immediately instead of wasting retry attempts on a permanently closed connection. 3. Extract retryWithBackoff options to support timeoutMs and shouldRetry as first-class options, keeping the utility reusable for future use.
|
PR builds are available as GitHub Actions artifacts: https://github.com/NeuralNomadsAI/CodeNomad/actions/runs/26957422077 Artifacts expire in 7 days. |
1 similar comment
|
PR builds are available as GitHub Actions artifacts: https://github.com/NeuralNomadsAI/CodeNomad/actions/runs/26957422077 Artifacts expire in 7 days. |
These files should not be part of this PR which is scoped to the SSE pong retry logic only.
|
PR builds are available as GitHub Actions artifacts: https://github.com/NeuralNomadsAI/CodeNomad/actions/runs/26961299996 Artifacts expire in 7 days.
|
Problem
When the client receives a ping from the server, it responds with a pong via HTTP POST. On unstable networks (mobile, WiFi with poor signal, network switches), this POST can fail in multiple ways:
fetch()never rejects, blocking retries indefinitelyFailed to fetchPreviously, a single missed pong would cause the server to close the SSE connection after 45s, leaving message responses stuck in queue until the next message triggered a reconnection.
Solution
Three improvements to make the pong POST resilient:
1. Request timeout (10s)
Each pong POST is now bounded with a 10s
AbortSignaltimeout. Hung requests fail fast instead of blocking indefinitely, allowing retries to start before the server's stale connection sweep.2. Selective retry with
isRetryableError()Only retries transient failures where recovery is possible:
AbortError/TimeoutError(hung or timed-out requests)Failed to fetch(network disconnected)NetworkError(browser network errors)Non-retryable errors like
404 Client connection not found(permanently closed connection) fail immediately instead of wasting retry attempts.3. Exponential backoff (3 attempts, 100ms → 2000ms)
Handles burst failures gracefully without hammering the server.
Changes
packages/ui/src/lib/retry-utils.ts(new): Reusable retry utility withtimeoutMsandshouldRetrypredicate supportpackages/ui/src/lib/server-events.ts: Updated pong handler to use bounded timeout + selective retryVerification
npm run build:ui✅Pong failed after retriesinstead of single immediate failure~/.codenomad/logs/sse-monitor.logshowsPONG_OK/PONG_FAIL/STALEevents for ongoing monitoring