Skip to content

Add single-flight polling and request timeouts for bridge API#56

Open
altantutar wants to merge 4 commits into
mainfrom
codex/issue-44-polling-timeouts
Open

Add single-flight polling and request timeouts for bridge API#56
altantutar wants to merge 4 commits into
mainfrom
codex/issue-44-polling-timeouts

Conversation

@altantutar

@altantutar altantutar commented Feb 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add request timeouts to bridge API calls with retry/backoff for transient GET failures
  • switch OpenClaw polling from interval-based async calls to single-flight scheduled polling
  • add failure backoff for polling loop so service recovers cleanly after transient outages
  • add tests for API retry/timeout behavior and non-overlapping polling guarantees

Linked Issue

Closes #44

Validation

  • bun test
  • bunx tsc --noEmit

Summary by CodeRabbit

  • New Features

    • Retry logic with exponential backoff for transient API failures
    • Configurable timeout management for API requests
    • Polling revamped to use backoff and single-flight protection to avoid overlapping requests
  • Tests

    • Tests for retry behavior, timeout handling, polling behavior, and recovery from transient failures
  • Chores

    • Added .dmux/ to .gitignore

@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello @altantutar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the stability and efficiency of the Bridge API and the OpenClaw polling service. By introducing request timeouts and intelligent retry mechanisms, the system is now more resilient to network issues and temporary API unavailability. The polling service has been optimized to prevent redundant requests and to recover smoothly from failures, ensuring a more robust and responsive user experience.

Highlights

  • API Request Robustness: Implemented request timeouts for all Bridge API calls and added retry logic with exponential backoff for transient GET failures, enhancing the reliability of API interactions.
  • OpenClaw Polling Mechanism: Refactored the OpenClaw polling service to use a single-flight scheduled polling approach, preventing overlapping requests and incorporating failure backoff to gracefully handle transient outages.
  • New Test Coverage: Added comprehensive tests to validate the new API retry/timeout behavior and to ensure the non-overlapping guarantees of the OpenClaw polling mechanism.
Changelog
  • src/bridge/tests/api-client.test.ts
    • Imported apiGet and apiPost for testing.
    • Added setup and teardown logic to manage SYNAPSE_BRIDGE_API_TIMEOUT_MS environment variable.
    • Introduced a test case to verify transient GET failures are retried with backoff.
    • Added a test case to confirm request timeouts are applied to POST calls.
  • src/bridge/tests/openclaw-service.test.ts
    • Added a new test file for the OpenClaw service.
    • Included tests to ensure polling jobs do not overlap when API responses are slow.
    • Added tests to verify the service recovers from transient API failures and continues polling.
  • src/bridge/api-client.ts
    • Defined constants for default request timeout, GET retries, and retryable HTTP status codes.
    • Implemented helper functions getRequestTimeoutMs, isRetryableFetchError, isTimeoutError, backoffDelayMs, and sleep.
    • Introduced a new requestJson function to centralize API call logic, including timeouts, retries, and exponential backoff.
    • Refactored apiGet to utilize requestJson with default GET retry settings.
    • Refactored apiPost to utilize requestJson with timeouts and no retries by default for non-idempotent operations.
  • src/bridge/openclaw-service.ts
    • Updated the ServiceState interface to include timerHandle, pollInFlight, and consecutiveFailures.
    • Replaced intervalHandle with timerHandle for scheduled polling.
    • Added MAX_POLL_BACKOFF_MS constant for limiting exponential backoff.
    • Implemented computeNextPollDelayMs to calculate delay with exponential backoff based on consecutive failures.
    • Created scheduleNextPoll to manage the scheduling of the next poll cycle.
    • Introduced runPollCycle to enforce single-flight polling, handle errors, and manage failure backoff.
    • Modified the start function to initiate polling using scheduleNextPoll instead of setInterval.
    • Updated the stop function to clear the timerHandle and reset new state variables.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai

coderabbitai Bot commented Feb 20, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@serrrfirat has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 5 minutes and 9 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Implements single-flight, timer-based polling with backoff in the OpenClaw service and adds request timeouts, retry/backoff, and centralized request logic in the bridge API client; tests added for client retry/timeout behavior and service polling correctness.

Changes

Cohort / File(s) Summary
API Client Core
src/bridge/api-client.ts
Added requestJson helper with AbortSignal timeouts, retry/backoff (exponential, capped), retryable-status classification, utilities (sleep, backoffDelayMs), and exported apiGet/apiPost delegating to it; introduces configurable timeout and default retry counts.
Service Polling
src/bridge/openclaw-service.ts
Replaced interval-based polling with single-shot scheduling and epoch-based lifecycle. Added pollInFlight, consecutiveFailures, epoch, adaptive backoff (jittered, capped), scheduleNextPoll, and runPollCycle. getServiceState() now exposes consecutiveFailures.
Tests
src/bridge/__tests__/api-client.test.ts, src/bridge/__tests__/openclaw-service.test.ts
New tests: API client retry and timeout behaviors; OpenClaw service tests for preventing overlapping polls, backoff/recovery from transient failures, and notification emission. Preserve/restore env and config in test lifecycle; add helper waitUntil.
Repository Ignore
.gitignore
Added .dmux/ to ignore rules.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant apiGet
    participant requestJson as RequestJson
    participant Server
    participant Backoff as BackoffLogic

    Client->>apiGet: apiGet(url)
    apiGet->>requestJson: requestJson(url, maxRetries=2, timeout)
    
    rect rgba(100,150,200,0.5)
    Note over RequestJson,Server: Attempt 1 (503)
    RequestJson->>Server: fetch (with timeout)
    Server-->>RequestJson: 503
    RequestJson->>Backoff: backoffDelayMs(0)
    Backoff-->>RequestJson: delay
    end

    rect rgba(150,100,200,0.5)
    Note over RequestJson,Server: Attempt 2 (retry)
    RequestJson->>RequestJson: sleep(delay)
    RequestJson->>Server: fetch (with timeout)
    Server-->>RequestJson: 200 + data
    RequestJson-->>apiGet: data
    apiGet-->>Client: data
    end
Loading
sequenceDiagram
    participant Service
    participant Scheduler as Scheduler
    participant PollGuard as pollInFlight
    participant API
    participant Backoff as BackoffScheduler

    Service->>Service: start()
    Service->>Scheduler: scheduleNextPoll(delay=0)

    rect rgba(100,150,200,0.5)
    Note over Service,API: Poll `#1` (executes)
    Scheduler->>PollGuard: check/set pollInFlight
    PollGuard-->>Service: in-flight set
    Service->>API: pollForNewInsights()
    API-->>Service: success
    Service->>PollGuard: clear pollInFlight
    Service->>Backoff: computeNextPollDelayMs(0)
    Backoff-->>Scheduler: schedule(delay)
    end

    rect rgba(150,100,200,0.5)
    Note over Service,API: Overlap avoided
    Scheduler->>PollGuard: check pollInFlight
    PollGuard-->>Scheduler: true (skip)
    Scheduler->>Scheduler: reschedule with backoff
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

Hops through timeouts, backoff in my paws, 🐇
One poll at a time, no overlapping flaws,
Retries that bounce until servers agree,
Epochs and guards keep the schedule free,
I celebrate with carrot-shaped glee! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding single-flight polling and request timeouts to the bridge API, which are the primary objectives of this PR.
Linked Issues check ✅ Passed All acceptance criteria from issue #44 are met: single-flight polling implemented via epoch mechanism and runPollCycle [openclaw-service.ts], request timeouts via AbortSignal.timeout in requestJson [api-client.ts], and retry/backoff for transient failures [api-client.ts, openclaw-service.ts with computeNextPollDelayMs].
Out of Scope Changes check ✅ Passed All changes are scoped to the stated objectives: API client timeout/retry logic, single-flight polling mechanism, service backoff recovery, comprehensive tests for both behaviors, and a minor .gitignore update for local data.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/issue-44-polling-timeouts

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements to the API client and the OpenClaw polling service. The API client now includes robust request timeouts and retry mechanisms with exponential backoff for GET requests, enhancing resilience against transient network issues. For POST requests, retries are correctly disabled by default to prevent unintended side effects. The OpenClaw service has been refactored to use a single-flight polling mechanism, which prevents overlapping poll jobs and incorporates exponential backoff for polling intervals after failures, ensuring graceful recovery from outages. Comprehensive tests have been added to validate these new behaviors, demonstrating a strong focus on correctness and reliability. Overall, these changes greatly improve the stability and efficiency of the bridge API interactions and the background polling service.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/bridge/api-client.ts (1)

40-75: Verify AbortSignal.timeout support in your target runtime.
If Node/Bun versions vary, this can throw at runtime. Consider a small fallback for older runtimes.

💡 Optional fallback example
-      const res = await fetch(url, {
-        ...init,
-        signal: AbortSignal.timeout(opts.timeoutMs),
-      });
+      const signal =
+        typeof AbortSignal !== 'undefined' && 'timeout' in AbortSignal
+          ? AbortSignal.timeout(opts.timeoutMs)
+          : createTimeoutSignal(opts.timeoutMs);
+      const res = await fetch(url, {
+        ...init,
+        signal,
+      });
function createTimeoutSignal(timeoutMs: number): AbortSignal {
  const controller = new AbortController();
  setTimeout(() => controller.abort(), timeoutMs);
  return controller.signal;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/bridge/api-client.ts` around lines 40 - 75, The code uses
AbortSignal.timeout inside requestJson which may not exist in older Node/Bun
runtimes; replace direct AbortSignal.timeout usage with a small helper (e.g.,
createTimeoutSignal) that returns an AbortSignal using AbortController and
setTimeout fallback when AbortSignal.timeout is unavailable, and use that helper
in requestJson when building the fetch options; implement the helper near
requestJson and reference it in the fetch call, ensuring the fallback signal
semantics match the original timeout behavior.
src/bridge/__tests__/openclaw-service.test.ts (1)

29-32: Consider resetting the notify callback between tests.
This avoids cross-test leakage if other suites rely on a null/no-op notify handler.

🔧 Optional isolation tweak
 afterEach(() => {
   stop();
   resetCounter();
+  setNotifyFn(() => {});
 });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/bridge/__tests__/openclaw-service.test.ts` around lines 29 - 32, The
tests call stop() and resetCounter() in afterEach but don't reset the notify
callback, which can leak state across tests; update the afterEach hook to also
clear or restore the notify handler — e.g., call resetNotify() or
setNotify(null)/setNotify(() => {}) (or the module's restoreDefaultNotify
function) alongside stop() and resetCounter() so the notify callback is returned
to a no-op/default between tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/bridge/__tests__/openclaw-service.test.ts`:
- Around line 29-32: The tests call stop() and resetCounter() in afterEach but
don't reset the notify callback, which can leak state across tests; update the
afterEach hook to also clear or restore the notify handler — e.g., call
resetNotify() or setNotify(null)/setNotify(() => {}) (or the module's
restoreDefaultNotify function) alongside stop() and resetCounter() so the notify
callback is returned to a no-op/default between tests.

In `@src/bridge/api-client.ts`:
- Around line 40-75: The code uses AbortSignal.timeout inside requestJson which
may not exist in older Node/Bun runtimes; replace direct AbortSignal.timeout
usage with a small helper (e.g., createTimeoutSignal) that returns an
AbortSignal using AbortController and setTimeout fallback when
AbortSignal.timeout is unavailable, and use that helper in requestJson when
building the fetch options; implement the helper near requestJson and reference
it in the fetch call, ensuring the fallback signal semantics match the original
timeout behavior.

@serrrfirat serrrfirat left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: REQUEST CHANGES

Blocking Issues

  1. MEDIUM: stop() + start() race conditionstop() resets pollInFlight = false immediately, but in-flight HTTP request still runs. Its finally block calls scheduleNextPoll() which, if start() was called again, sees state.running = true and creates a duplicate scheduling chain.

  2. MEDIUM: TypeError too broad a retryable error classificationTypeError from fetch() is retryable (DNS/network), but TypeError from programming bugs (property access on undefined) gets retried too, masking bugs and wasting time.

  3. MEDIUM: Timing-sensitive test assertions — Tests rely on await wait(260) and await wait(900) to observe poll cycles. Flaky under CI load. Use event-driven synchronization instead.

Non-Blocking Issues

  • AbortSignal.timeout() overrides caller-supplied signal (use AbortSignal.any())
  • AbortError classified as both retryable AND timeout — correct for now but fragile
  • getServiceState() doesn't expose consecutiveFailures for debugging
  • No jitter in backoff delays (thundering herd risk)
  • Test files mutate global ~/.synapse/bridge.json — should use temp directory
  • Missing Content-Type validation on response before res.json()
  • No test for exhausted retries, non-retryable status codes, or stop() during in-flight poll

state.pollInFlight = true;
try {
await pollForNewInsights();
state.consecutiveFailures = 0;

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium Severity — Race condition on rapid stop()/start() cycle

If stop() is called while pollForNewInsights() is in-flight, then start() is called immediately, the OLD runPollCycle() still holds a reference to the shared state object. When the old poll resolves:

  1. Its finally { state.pollInFlight = false } clobbers the NEW cycle's in-flight flag
  2. Its if (!state.running) return check passes (because the new start() set running = true)
  3. It calls scheduleNextPoll() — scheduling a rogue extra poll

This can cause concurrent overlapping polls — the exact thing the single-flight guard is meant to prevent.

Trace:

1. start() → running=true, schedules poll
2. runPollCycle() → pollInFlight=true, awaits pollForNewInsights()
3. stop()  → running=false, pollInFlight=false, timerHandle cleared
4. start() → running=true, pollInFlight=false, schedules NEW poll
5. New runPollCycle() → pollInFlight=true, awaits pollForNewInsights()
6. OLD pollForNewInsights() from step 2 resolves
7. OLD finally: pollInFlight=false  ← CLOBBERS step 5's flag
8. OLD check: if(!state.running) — TRUE now, so CONTINUES
9. OLD scheduleNextPoll() → rogue 3rd poll

Suggested fix: Add a generation counter — increment in start(), capture in runPollCycle(), bail if it changed after each await:

let pollGeneration = 0;

export async function start(): Promise<void> {
  // ... existing checks ...
  pollGeneration += 1;
  // ... rest of start ...
}

async function runPollCycle(): Promise<void> {
  const myGeneration = pollGeneration;
  // ... poll logic ...
  if (pollGeneration !== myGeneration) return;
  scheduleNextPoll(nextDelay);
}

Comment thread src/bridge/api-client.ts
...init,
signal: AbortSignal.timeout(opts.timeoutMs),
});

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium Severity — 429 retries ignore Retry-After header

When the server returns 429 (Too Many Requests), the code retries with its own exponential backoff but ignores the server's Retry-After header directive. This can cause the client to retry more aggressively than the server allows, worsening rate-limit pressure and potentially causing the server to escalate enforcement.

Suggested fix: Parse Retry-After and use it as a floor:

if (attempt < opts.maxRetries && RETRYABLE_STATUS_CODES.has(res.status)) {
  let delay = backoffDelayMs(attempt + 1);
  
  // Respect server's Retry-After for 429s
  const retryAfter = res.headers.get('Retry-After');
  if (retryAfter && res.status === 429) {
    const serverDelay = Number(retryAfter);
    if (Number.isFinite(serverDelay) && serverDelay > 0) {
      delay = Math.max(delay, serverDelay * 1000);
    }
  }
  
  await sleep(delay);
  continue;
}

Comment thread src/bridge/api-client.ts

throw error;
} catch (error) {
if (attempt < opts.maxRetries && isRetryableFetchError(error)) {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low Severity — Silent retries with no logging

When requestJson retries after transient failures (network errors, 5xx, timeouts), there is no log output. Operators debugging intermittent issues won't know retries are happening or how many attempts were needed before success/failure.

Suggested fix: Add a warning log before each retry:

if (attempt < opts.maxRetries && RETRYABLE_STATUS_CODES.has(res.status)) {
  console.warn(`[synapse] Retrying ${init.method ?? 'GET'} ${path} (attempt ${attempt + 1}/${opts.maxRetries}): HTTP ${res.status}`);
  await sleep(backoffDelayMs(attempt + 1));
  continue;
}

Same for the catch block handling fetch errors.


if (url.pathname === '/api/insights') {
return Response.json([{ id: 0 }]);
}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low Severity — Timing-dependent tests may be flaky

Both tests rely on fixed await wait(260) and await wait(900) for polls to complete. On slow CI runners or under load, these could fail if polls haven't completed within the window.

Suggested fix: Consider polling a condition in a tight loop with a timeout ceiling instead of fixed sleeps:

async function waitUntil(predicate: () => boolean, timeoutMs = 2000, intervalMs = 10): Promise<void> {
  const start = Date.now();
  while (!predicate()) {
    if (Date.now() - start > timeoutMs) throw new Error('waitUntil timed out');
    await wait(intervalMs);
  }
}

// Usage:
await waitUntil(() => pollCalls >= 3, 2000);

This makes tests faster when things work and more robust when the system is slow.

Comment thread src/bridge/api-client.ts
return error.name === 'AbortError' || error.name === 'TimeoutError' || error.name === 'TypeError';
}

function isTimeoutError(error: unknown): boolean {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit — backoffDelayMs base of 200ms may be too aggressive for 429

For server errors like 503 (temporarily overloaded), retrying after 200ms is reasonable. But for 429 (rate-limited), the first retry at 200ms is likely to immediately re-hit the rate limit.

This is partially addressed by respecting the Retry-After header (see other comment). If that fix is adopted, this becomes moot since the server-provided delay will override the 200ms floor.

serrrfirat and others added 2 commits February 21, 2026 08:15
…ests

- Add epoch counter to openclaw-service to prevent duplicate scheduling
  chains when stop()+start() is called while a poll is in-flight
- Narrow TypeError retry classification in api-client to only retry
  fetch-related TypeErrors (DNS/network), not programming bugs
- Replace timing-sensitive test waits with condition-based polling
  (waitUntil) to eliminate CI flakiness
- Use AbortSignal.any() to combine caller-supplied and timeout signals
- Expose consecutiveFailures in getServiceState()
- Add jitter (up to 10%) to backoff delays to prevent thundering herd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
src/bridge/__tests__/openclaw-service.test.ts (2)

171-173: Consider clearing the notify function in afterEach to avoid cross-test pollution.

If tests are added that don't set their own notifyFn, they could inadvertently use the one set by this test. While stop() doesn't clear state.notify, it might be worth resetting in cleanup.

💡 Optional: Clear notifyFn in afterEach
 afterEach(() => {
   stop();
   resetCounter();
+  setNotifyFn(() => {}); // Clear any test-specific notify function
 });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/bridge/__tests__/openclaw-service.test.ts` around lines 171 - 173, The
test sets a global notify function via setNotifyFn which can leak into other
tests; add cleanup in the test file's afterEach to reset state.notify (or call
setNotifyFn(undefined)) so notifyFn is cleared between tests—locate where
setNotifyFn is used in this spec and ensure the afterEach resets state.notify
(or invokes setNotifyFn with a noop/undefined) to avoid cross-test pollution
without changing stop().

115-188: Consider adding assertion for consecutiveFailures state.

The test verifies recovery behavior but doesn't assert that consecutiveFailures resets to 0 after successful recovery. Since getServiceState() now exposes this field, it would strengthen the test.

💡 Optional: Assert consecutiveFailures resets after recovery
       expect(pollCalls).toBeGreaterThanOrEqual(3);
       expect(getServiceState().lastSeenInsightId).toBe(1);
       expect(notifications.length).toBeGreaterThan(0);
+      expect(getServiceState().consecutiveFailures).toBe(0);
     } finally {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/bridge/__tests__/openclaw-service.test.ts` around lines 115 - 188, The
test "recovers from transient API failures and continues polling" should also
assert that the service cleared its failure counter after recovery: call
getServiceState() after the wait and include an assertion that
getServiceState().consecutiveFailures === 0 (e.g., add
expect(getServiceState().consecutiveFailures).toBe(0)) to ensure
consecutiveFailures is reset when the /api/insights/new request finally
succeeds; place this assertion alongside the existing expects after stop() and
before server.stop() in the test.
src/bridge/openclaw-service.ts (1)

63-94: Potential edge case: pollInFlight guard reschedules with base interval instead of backoff.

When pollInFlight is true (line 67-70), the function schedules the next poll using getBridgeConfig().openclawPollingIntervalMs (base interval) rather than applying any backoff based on consecutiveFailures. This could lead to rapid retry scheduling if the in-flight poll is taking a long time due to failures.

However, this edge case is unlikely in practice since the in-flight poll would need to be genuinely stuck for another timer to fire. The current behavior is acceptable as a defensive fallback.

💡 Optional: Use backoff delay even when deferring due to in-flight poll
   if (state.pollInFlight) {
-    scheduleNextPoll(getBridgeConfig().openclawPollingIntervalMs);
+    const config = getBridgeConfig();
+    const deferDelay = computeNextPollDelayMs(config.openclawPollingIntervalMs, state.consecutiveFailures);
+    scheduleNextPoll(deferDelay);
     return;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/bridge/openclaw-service.ts` around lines 63 - 94, The pollInFlight branch
in runPollCycle currently calls
scheduleNextPoll(getBridgeConfig().openclawPollingIntervalMs) which ignores
exponential backoff based on state.consecutiveFailures; change the branch to
compute the nextDelay using
computeNextPollDelayMs(getBridgeConfig().openclawPollingIntervalMs,
state.consecutiveFailures) and pass that to scheduleNextPoll so deferred
scheduling due to state.pollInFlight respects backoff; keep the single-flight
guard and state checks otherwise unchanged.
src/bridge/api-client.ts (1)

59-108: Consider handling body consumption on non-retryable errors to avoid resource leaks.

When a non-retryable error status is received (line 92), the response body has been read via res.text(). However, if the request is retried and eventually fails after all retries (reaching line 107), the error thrown is generic and loses context from the last attempt.

More importantly, if res.json() throws (e.g., invalid JSON on a 200 response), the error flows to the catch block and may be incorrectly classified. Consider wrapping res.json() in its own try-catch to distinguish JSON parse errors from fetch errors.

💡 Optional: Distinguish JSON parse errors
       if (res.ok) {
-        return res.json() as Promise<T>;
+        try {
+          return await res.json() as T;
+        } catch (parseError) {
+          throw new Error(`API returned invalid JSON: ${parseError}`);
+        }
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/bridge/api-client.ts` around lines 59 - 108, The requestJson function
currently lets res.json() errors bubble into the outer catch and be
misclassified as fetch/timeout errors and loses the last-response context; wrap
the successful-200 parsing in its own try-catch around res.json() inside
requestJson so JSON parse errors are converted into a distinct, non-retryable
Error (e.g., "JSON parse error" including the response body or raw text) and
rethrown, and when handling non-2xx responses keep the last response body
attached to the thrown Error (include status and body from res.text()) so the
final "API request failed" path preserves the last attempt's details; ensure
these parse errors are not treated as retryable by isRetryableFetchError so they
won't trigger retries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/bridge/api-client.ts`:
- Around line 94-97: The retry logic erroneously retries on any AbortError
because isRetryableFetchError treats all aborts the same; update the catch block
around the fetch retry (where AbortSignal.any() is used) to check the specific
signals so you only retry when the timeoutSignal fired (e.g.
timeoutSignal.aborted) and not when the caller's init.signal was explicitly
aborted (e.g. init.signal?.aborted), keeping the existing checks for attempt <
opts.maxRetries and isRetryableFetchError(error) and using
backoffDelayMs(attempt + 1) to sleep before continue.

---

Nitpick comments:
In `@src/bridge/__tests__/openclaw-service.test.ts`:
- Around line 171-173: The test sets a global notify function via setNotifyFn
which can leak into other tests; add cleanup in the test file's afterEach to
reset state.notify (or call setNotifyFn(undefined)) so notifyFn is cleared
between tests—locate where setNotifyFn is used in this spec and ensure the
afterEach resets state.notify (or invokes setNotifyFn with a noop/undefined) to
avoid cross-test pollution without changing stop().
- Around line 115-188: The test "recovers from transient API failures and
continues polling" should also assert that the service cleared its failure
counter after recovery: call getServiceState() after the wait and include an
assertion that getServiceState().consecutiveFailures === 0 (e.g., add
expect(getServiceState().consecutiveFailures).toBe(0)) to ensure
consecutiveFailures is reset when the /api/insights/new request finally
succeeds; place this assertion alongside the existing expects after stop() and
before server.stop() in the test.

In `@src/bridge/api-client.ts`:
- Around line 59-108: The requestJson function currently lets res.json() errors
bubble into the outer catch and be misclassified as fetch/timeout errors and
loses the last-response context; wrap the successful-200 parsing in its own
try-catch around res.json() inside requestJson so JSON parse errors are
converted into a distinct, non-retryable Error (e.g., "JSON parse error"
including the response body or raw text) and rethrown, and when handling non-2xx
responses keep the last response body attached to the thrown Error (include
status and body from res.text()) so the final "API request failed" path
preserves the last attempt's details; ensure these parse errors are not treated
as retryable by isRetryableFetchError so they won't trigger retries.

In `@src/bridge/openclaw-service.ts`:
- Around line 63-94: The pollInFlight branch in runPollCycle currently calls
scheduleNextPoll(getBridgeConfig().openclawPollingIntervalMs) which ignores
exponential backoff based on state.consecutiveFailures; change the branch to
compute the nextDelay using
computeNextPollDelayMs(getBridgeConfig().openclawPollingIntervalMs,
state.consecutiveFailures) and pass that to scheduleNextPoll so deferred
scheduling due to state.pollInFlight respects backoff; keep the single-flight
guard and state checks otherwise unchanged.

Comment thread src/bridge/api-client.ts
Comment on lines +94 to +97
if (attempt < opts.maxRetries && isRetryableFetchError(error)) {
await sleep(backoffDelayMs(attempt + 1));
continue;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the api-client.ts file to understand the implementation
cat -n src/bridge/api-client.ts | head -100

Repository: serrrfirat/synapse-daemon

Length of output: 3861


🌐 Web query:

How to distinguish AbortSignal.timeout from user-initiated AbortSignal in JavaScript?

💡 Result:

AbortSignal.timeout(ms) is specified to abort with signal.reason set to a DOMException named "TimeoutError". (dom.spec.whatwg.org)

So the robust way to distinguish “timed out” vs “user/manual abort” is: check the abort reason, not just signal.aborted.

1) If you have access to the signal

if (signal.aborted) {
  if (signal.reason?.name === "TimeoutError") {
    // timeout
  } else {
    // user/manual abort (or other reason)
  }
}

AbortController.abort(reason) stores your reason on signal.reason; if you pass nothing, the reason becomes an "AbortError" DOMException. (dom.spec.whatwg.org)

2) When using fetch(...) (common gotcha)

Depending on the environment, fetch may reject with an "AbortError" even when the signal’s reason is a "TimeoutError". The reliable test is still the signal’s reason (keep a reference to the signal you passed to fetch). (github.com)

const timeoutSignal = AbortSignal.timeout(5000);

try {
  await fetch(url, { signal: timeoutSignal });
} catch (e) {
  if (timeoutSignal.reason?.name === "TimeoutError") {
    // timed out
  } else {
    // aborted for some other reason
  }
}

3) If you combine signals (user + timeout)

Track which input signal fired:

const controller = new AbortController();
const timeoutSignal = AbortSignal.timeout(5000);
const signal = AbortSignal.any([controller.signal, timeoutSignal]);

// later, in catch / abort handler:
if (timeoutSignal.aborted) { /* timeout */ }
else if (controller.signal.aborted) { /* user/manual */ }

AbortSignal.any([...]) adopts the reason of whichever source signal aborted first. (dom.spec.whatwg.org)

Citations:


Retry behavior incorrectly handles caller-initiated AbortSignal.

The code combines caller and timeout signals via AbortSignal.any() (lines 71–73), but isRetryableFetchError treats all AbortError exceptions identically—including those from explicit caller cancellation. This will retry requests the caller explicitly aborted.

To fix, distinguish which signal fired using the references available in the catch block:

} catch (error) {
  if (attempt < opts.maxRetries && isRetryableFetchError(error)) {
    // Only retry if the timeout signal aborted, not the caller's signal
    if (timeoutSignal.aborted && !init.signal?.aborted) {
      await sleep(backoffDelayMs(attempt + 1));
      continue;
    }
  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/bridge/api-client.ts` around lines 94 - 97, The retry logic erroneously
retries on any AbortError because isRetryableFetchError treats all aborts the
same; update the catch block around the fetch retry (where AbortSignal.any() is
used) to check the specific signals so you only retry when the timeoutSignal
fired (e.g. timeoutSignal.aborted) and not when the caller's init.signal was
explicitly aborted (e.g. init.signal?.aborted), keeping the existing checks for
attempt < opts.maxRetries and isRetryableFetchError(error) and using
backoffDelayMs(attempt + 1) to sleep before continue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add single-flight polling and request timeouts in bridge API client/service

2 participants