perf(retry): add jitter to all retry/backoff delays#1354
perf(retry): add jitter to all retry/backoff delays#1354ariel-formance wants to merge 5 commits into
Conversation
Replace deterministic fixed waits with exponential+jitter (base + up to 50% random) across TooManyClient retries, replication pipeline push and pull error retries, manager sync period, and driver facade startup retries to desynchronize concurrent instances and prevent thundering herd.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughAdds randomized jitter (about up to half the base interval) to retry and periodic timers in the ledger controller and replication subsystem; several period setters are clamped to a minimum of 2. No public APIs or signatures changed. ChangesRetry and Timer Jitter Additions
Sequence Diagram — DriverFacade startup retrysequenceDiagram
participant DriverFacade
participant Timer
participant Driver
DriverFacade->>Driver: Start()
Driver-->>DriverFacade: error (startup failure)
DriverFacade->>Timer: wait retryInterval + jitter
Timer-->>DriverFacade: timer fires
DriverFacade->>Driver: Start() (retry)
Sequence Diagram — Pipeline fetch and push-retrysequenceDiagram
participant Manager
participant Storage
participant PipelineHandler
participant TimerPull
participant Exporter
participant TimerPushRetry
Manager->>Storage: ListEnabledPipelines()
Storage-->>Manager: pipelines
Manager->>PipelineHandler: start fetch loop
PipelineHandler->>TimerPull: wait PullInterval + jitter
TimerPull-->>PipelineHandler: timer fires -> ListLogs
PipelineHandler->>Exporter: Accept(log)
Exporter-->>PipelineHandler: error
PipelineHandler->>TimerPushRetry: wait PushRetryPeriod + jitter
TimerPushRetry-->>PipelineHandler: retry Accept
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/controller/ledger/controller_with_too_many_client_handling.go (1)
232-239:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard jitter bounds in
handleRetryto preventrand.Int63n(0/negative)panics
internal/controller/ledger/controller_with_too_many_client_handling.gochecks onlydelay == 0, butdelay == 1nsmakesint64(delay/2)==0, sorand.Int63n(0)panics (same for any negativedelay). The retry jitter needs adelay/2 > 0guard.💡 Proposed fix
delay := delayCalculator.Next(count) - if delay == 0 { + if delay <= 0 { return err } + wait := delay + if delay > 1 { + wait += time.Duration(rand.Int63n(int64(delay / 2))) + } select { case <-ctx.Done(): return ctx.Err() - case <-time.After(delay + time.Duration(rand.Int63n(int64(delay/2)))): + case <-time.After(wait): count++ span.SetAttributes(attribute.Int("retry", count)) continueSimilar unguarded
rand.Int63n(interval/2)jitter patterns exist ininternal/replication/pipeline.go,internal/replication/driver_facade.go, andinternal/replication/manager.go.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controller/ledger/controller_with_too_many_client_handling.go` around lines 232 - 239, The retry jitter uses rand.Int63n(delay/2) and can panic when delay <= 1ns (or negative); update handleRetry to compute a safe jitterBound = int64(delay/2) and only call rand.Int63n(jitterBound) when jitterBound > 0 (otherwise use zero jitter or a fixed minimal sleep), and replace the current time.After(delay + time.Duration(rand.Int63n(int64(delay/2)))) with time.After(delay + time.Duration(jitter)) where jitter is guarded. Apply the same safe-guarded pattern to the similar jitter uses in internal/replication/pipeline.go, internal/replication/driver_facade.go, and internal/replication/manager.go (look for rand.Int63n(interval/2) occurrences).
🧹 Nitpick comments (1)
internal/replication/driver_facade.go (1)
45-45: ⚡ Quick winRetry jitter
Int63npanics only ifretryInterval <= 0; current code path is safe
rand.Int63n(int64(c.retryInterval/2))would panic whenc.retryInterval/2 <= 0(atinternal/replication/driver_facade.go:45), butDriverFacadeis only constructed ininternal/replication/manager.gowith2*time.Second, so this should not occur today. Add a small validation/guard innewDriverFacadeto prevent future regressions.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/replication/driver_facade.go` at line 45, The retry jitter expression can panic if retryInterval/2 <= 0; add a guard in newDriverFacade to validate and normalize the provided retryInterval (e.g., if c.retryInterval <= 0 then set it to a sensible default like 2*time.Second or return an error) before constructing the DriverFacade so that the field used by the select case (retryInterval) is guaranteed > 0; update newDriverFacade to enforce this invariant for DriverFacade.retryInterval to prevent rand.Int63n panics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/replication/manager.go`:
- Line 278: The sync timer jitter can panic when m.syncPeriod/2 <= 0; update the
code that sets m.syncPeriod (the WithSyncPeriod method and any place using
cfg.SyncPeriod) to validate and clamp the provided period to a safe positive
minimum (e.g., enforce m.syncPeriod > 0), and ensure the jitter calculation
before time.After uses a non‑zero positive value (compute jitter using a
max(minJitter, m.syncPeriod/2) or otherwise ensure the argument to rand.Int63n
is > 0) so the time.After expression with rand.Int63n cannot panic.
---
Outside diff comments:
In `@internal/controller/ledger/controller_with_too_many_client_handling.go`:
- Around line 232-239: The retry jitter uses rand.Int63n(delay/2) and can panic
when delay <= 1ns (or negative); update handleRetry to compute a safe
jitterBound = int64(delay/2) and only call rand.Int63n(jitterBound) when
jitterBound > 0 (otherwise use zero jitter or a fixed minimal sleep), and
replace the current time.After(delay +
time.Duration(rand.Int63n(int64(delay/2)))) with time.After(delay +
time.Duration(jitter)) where jitter is guarded. Apply the same safe-guarded
pattern to the similar jitter uses in internal/replication/pipeline.go,
internal/replication/driver_facade.go, and internal/replication/manager.go (look
for rand.Int63n(interval/2) occurrences).
---
Nitpick comments:
In `@internal/replication/driver_facade.go`:
- Line 45: The retry jitter expression can panic if retryInterval/2 <= 0; add a
guard in newDriverFacade to validate and normalize the provided retryInterval
(e.g., if c.retryInterval <= 0 then set it to a sensible default like
2*time.Second or return an error) before constructing the DriverFacade so that
the field used by the select case (retryInterval) is guaranteed > 0; update
newDriverFacade to enforce this invariant for DriverFacade.retryInterval to
prevent rand.Int63n panics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1988a191-aab6-4038-90dd-bf55e452eb2e
📒 Files selected for processing (4)
internal/controller/ledger/controller_with_too_many_client_handling.gointernal/replication/driver_facade.gointernal/replication/manager.gointernal/replication/pipeline.go
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1354 +/- ##
==========================================
+ Coverage 80.36% 81.20% +0.83%
==========================================
Files 205 205
Lines 11104 11109 +5
==========================================
+ Hits 8924 9021 +97
+ Misses 1585 1550 -35
+ Partials 595 538 -57 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…gative periods Guard all four jitter sites against rand.Int63n panicking when the base duration divided by two rounds to zero or is negative. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gfyrag
left a comment
There was a problem hiding this comment.
the jitter is legit on the retry following too many clients, but the orh
… retry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/replication/manager_test.go`:
- Around line 50-58: The test TestNewDriverFacadeMinimumRetryInterval asserts
the wrong time unit for facade.retryInterval; change the expectation from
time.Duration(2) (2 nanoseconds) to 2 * time.Second so it matches the intended
minimum retry interval used by newDriverFacade and the retryInterval field.
- Around line 42-48: The test TestWithSyncPeriodMinimum asserts the wrong unit
(2 nanoseconds) for the minimum sync period; update the assertion to expect 2
seconds instead. Locate the test function TestWithSyncPeriodMinimum, which
constructs m := &Manager{} and calls WithSyncPeriod(1)(m), and change the
require.Equal call to compare m.syncPeriod with time.Duration(2)*time.Second (or
simply 2*time.Second) so the minimum syncPeriod is asserted in seconds rather
than nanoseconds.
In `@internal/replication/pipeline_test.go`:
- Around line 48-54: The test TestWithPushRetryPeriodMinimum asserts the wrong
time unit (2 nanoseconds) for the minimum push retry period; update the
expectation to use seconds by asserting config.PushRetryPeriod equals 2 *
time.Second after calling WithPushRetryPeriod(1) on PipelineHandlerConfig so the
test checks a 2-second minimum instead of time.Duration(2); locate the assertion
in TestWithPushRetryPeriodMinimum and replace time.Duration(2) with 2 *
time.Second referencing PipelineHandlerConfig, WithPushRetryPeriod and
PushRetryPeriod.
- Around line 40-46: The test TestWithPullPeriodMinimum is asserting
time.Duration(2) (2 nanoseconds) instead of a 2-second minimum; update the
assertion to compare config.PullInterval against 2 * time.Second so the
PipelineHandlerConfig minimum pull interval set by WithPullPeriod is validated
in seconds (e.g., require.Equal(t, 2 * time.Second, config.PullInterval)),
keeping references to TestWithPullPeriodMinimum, PipelineHandlerConfig,
WithPullPeriod and config.PullInterval to locate the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a1ddceb2-e7be-4327-933b-64421aace7f4
📒 Files selected for processing (3)
internal/controller/ledger/controller_with_too_many_client_handling_test.gointernal/replication/manager_test.gointernal/replication/pipeline_test.go
| func TestWithPullPeriodMinimum(t *testing.T) { | ||
| t.Parallel() | ||
|
|
||
| config := PipelineHandlerConfig{} | ||
| WithPullPeriod(1)(&config) | ||
| require.Equal(t, time.Duration(2), config.PullInterval) | ||
| } |
There was a problem hiding this comment.
Incorrect time unit in assertion.
The test expects time.Duration(2), which equals 2 nanoseconds. For a pull period minimum, this should be 2 * time.Second. A 2-nanosecond minimum makes no practical sense in this polling context.
🐛 Proposed fix
func TestWithPullPeriodMinimum(t *testing.T) {
t.Parallel()
config := PipelineHandlerConfig{}
WithPullPeriod(1)(&config)
- require.Equal(t, time.Duration(2), config.PullInterval)
+ require.Equal(t, 2*time.Second, config.PullInterval)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func TestWithPullPeriodMinimum(t *testing.T) { | |
| t.Parallel() | |
| config := PipelineHandlerConfig{} | |
| WithPullPeriod(1)(&config) | |
| require.Equal(t, time.Duration(2), config.PullInterval) | |
| } | |
| func TestWithPullPeriodMinimum(t *testing.T) { | |
| t.Parallel() | |
| config := PipelineHandlerConfig{} | |
| WithPullPeriod(1)(&config) | |
| require.Equal(t, 2*time.Second, config.PullInterval) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/replication/pipeline_test.go` around lines 40 - 46, The test
TestWithPullPeriodMinimum is asserting time.Duration(2) (2 nanoseconds) instead
of a 2-second minimum; update the assertion to compare config.PullInterval
against 2 * time.Second so the PipelineHandlerConfig minimum pull interval set
by WithPullPeriod is validated in seconds (e.g., require.Equal(t, 2 *
time.Second, config.PullInterval)), keeping references to
TestWithPullPeriodMinimum, PipelineHandlerConfig, WithPullPeriod and
config.PullInterval to locate the change.
| func TestWithPushRetryPeriodMinimum(t *testing.T) { | ||
| t.Parallel() | ||
|
|
||
| config := PipelineHandlerConfig{} | ||
| WithPushRetryPeriod(1)(&config) | ||
| require.Equal(t, time.Duration(2), config.PushRetryPeriod) | ||
| } |
There was a problem hiding this comment.
Incorrect time unit in assertion.
The test expects time.Duration(2), which equals 2 nanoseconds. For a push retry period minimum, this should be 2 * time.Second. A 2-nanosecond minimum makes no practical sense in this retry context.
🐛 Proposed fix
func TestWithPushRetryPeriodMinimum(t *testing.T) {
t.Parallel()
config := PipelineHandlerConfig{}
WithPushRetryPeriod(1)(&config)
- require.Equal(t, time.Duration(2), config.PushRetryPeriod)
+ require.Equal(t, 2*time.Second, config.PushRetryPeriod)
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/replication/pipeline_test.go` around lines 48 - 54, The test
TestWithPushRetryPeriodMinimum asserts the wrong time unit (2 nanoseconds) for
the minimum push retry period; update the expectation to use seconds by
asserting config.PushRetryPeriod equals 2 * time.Second after calling
WithPushRetryPeriod(1) on PipelineHandlerConfig so the test checks a 2-second
minimum instead of time.Duration(2); locate the assertion in
TestWithPushRetryPeriodMinimum and replace time.Duration(2) with 2 * time.Second
referencing PipelineHandlerConfig, WithPushRetryPeriod and PushRetryPeriod.
What is orh? |
Add tests for four previously uncovered paths: - pipeline.go 104-106: stop signal received during fetch-retry wait - pipeline.go 137-139: stop signal received during push-retry wait - pipeline.go 140-141: push-retry timer fires, Accept succeeds on second attempt - manager.go 260-262: synchronizePipelines error logged, manager continues running
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/replication/manager_test.go`:
- Around line 73-90: The test is racing because Stop may run before Start is
actually entered; modify the mock Start expectation to signal when Start begins
(e.g., create a startedChan and in
mockDriver.EXPECT().Start(...).DoAndReturn(func(ctx context.Context) error {
startedChan <- struct{}{}; <-blockStart; return ctx.Err() }) ) then wait on that
startedChan after calling facade.Run() and before cancelling stopCtx/calling
facade.Stop so the test deterministically exercises stopping while the driver is
starting; keep the existing blockStart, facade.Run, and facade.Stop usage but
add the startedChan synchronization.
🪄 Autofix (Beta)
✅ Autofix completed
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2a1a9470-9f66-4ed7-9c17-ae77090feb99
📒 Files selected for processing (2)
internal/replication/manager_test.gointernal/replication/pipeline_test.go
|
Note Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it. Fixes Applied SuccessfullyFixed 1 file(s) based on 1 unresolved review comment. Files modified:
Commit: The changes have been pushed to the Time taken: |
Fixed 1 file(s) based on 1 unresolved review comment. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
Summary
base + rand(0, base/2)jitter across all retry and periodic-wait sitesTooManyClientDB retries, replication pipeline push-error and pull-error retries, replication manager sync period, driver facade startup retriesChanged files
internal/controller/ledger/controller_with_too_many_client_handling.gointernal/replication/pipeline.gointernal/replication/manager.gointernal/replication/driver_facade.goSkipped: cron-scheduled workers (naturally staggered), deadlock retry (intentionally immediate), client SDK (already has ±25% jitter).
Test plan
go test ./internal/...)