Skip to content

feat(sync): request-driven pre_confirmed polling#3694

Open
RafaelGranza wants to merge 6 commits into
thiagodeev/feeder-preconfirmed-optimizationfrom
granza/preconfirmed-request-driven-polling
Open

feat(sync): request-driven pre_confirmed polling#3694
RafaelGranza wants to merge 6 commits into
thiagodeev/feeder-preconfirmed-optimizationfrom
granza/preconfirmed-request-driven-polling

Conversation

@RafaelGranza
Copy link
Copy Markdown
Contributor

@RafaelGranza RafaelGranza commented Jun 2, 2026

Summary

  • Each call to PreConfirmed() now triggers a fresh pre_confirmed fetch instead of waiting for the next ticker.
  • If a fetch is already running, extra refresh requests are dropped. A burst of RPC calls produces a single fetch, not a storm.
  • Each fetch is capped at 2s so a slow or hung feeder cannot block future fetches.
  • The --preconfirmed-poll-interval ticker is now a fallback when there is no RPC traffic to drive fetches.

Benchmark

Numbers come from BenchmarkPreConfirmedUpdateFrequency: a 5-second simulation where one RPC call hits PreConfirmed() every 50ms (mimicking a client, or more than one, dictating a higher request frequency).

Run: go test -bench=PreConfirmedUpdateFrequency -run=^$ ./sync/

feeder latency baseline (ticker only) request-driven improvement
50ms 1.8 upd/s, mean 500ms 10.8 upd/s, mean 91ms 6x
100ms 1.8 upd/s, mean 500ms 7.2 upd/s, mean 134ms 4x
200ms 1.8 upd/s, mean 500ms 4.2 upd/s, mean 228ms 2.3x

The running guard caught 44-78% of triggers as duplicates, so it was not a feeder hammer.

@RafaelGranza RafaelGranza self-assigned this Jun 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Claude finished @RafaelGranza's task in 5m 2s —— View job


PR Review

Reviewed against origin/thiagodeev/feeder-preconfirmed-optimization. Overall the design is sound — the fetch-once-per-trigger pattern with a running guard and per-fetch timeout is a reasonable shape for request-driven polling, and the new tests cover the trigger + storm cases well. Findings are mostly clarifications and trade-off questions, with one nit on test robustness.

Important / worth a closer look

  • sync/pending_polling.go:22 — Is the 2s per-fetch cap correctly sized vs the feeder's growing HTTP timeout + backoff? On a slow feeder, the cap can cancel a single attempt before any retry, which may be intentional but isn't documented. Consider scaling with preConfirmedPollInterval.
  • sync/pending_polling.go:319-325 — The drain-then-clear order in fetchPreConfirmed can silently swallow a freshness request that arrives in the window between the trigger read and Store(true). Self-heals on the next RPC, but worth documenting (or flipping the order).
  • sync/sync.go:611-614 — The requestPreConfirmedRefresh() call in PreConfirmed() is async — the current caller gets the stored value, not the fresh one. The PR description reads as if the caller gets the fresh fetch. Tighten the inline comment. Also: gate with if s.preConfirmedPollInterval > 0 (harmless today, but cleaner).

Nits

  • sync/pending_polling.go:279 — Ticker tick buffered during a fetch causes an immediate "free" second fetch when the loop re-enters the select. Pre-existing condition; could be drained in the same defer.
  • sync/pending_polling_test.go:476-478 — The time.Sleep(50ms) + strict equality is potentially flaky on slow CI. Prefer require.Never or synctest.

Looks good

  • Race fix in the test mock counters (atomic.Uint32) — nice catch.
  • Trigger storm test correctly waits for the in-flight fetch via Eventually before dispatching the storm.
  • pollPreConfirmed correctly handles targetBlockNum == 0 / not-at-tip by continuing after consuming the trigger; the channel buffer of 1 keeps wakeups bounded.
  • fetchPreConfirmed correctly uses context.WithTimeout rooted at the parent ctx, so shutdown propagates.
  • Usage string for --preconfirmed-poll-interval now matches the new behavior.

Inline comments posted on the specific lines.
· Branch

Comment thread sync/pending_polling.go Outdated
preLatestCacheSize = 10
// Per-fetch cap so a hung feeder can't hold the running guard for the
// feeder client's full retry budget (~20s). TODO: consider exposing as flag.
preConfirmedFetchTimeout = 2 * time.Second
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded for now. Worth exposing as a flag, or fine to keep at 2s?

Comment on lines +34 to +35
numCallsPreConfirmed atomic.Uint32
numCallsPending atomic.Uint32
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to avoid race conditions during tests

Comment thread sync/pending_polling.go Outdated
preLatestCacheSize = 10
// Per-fetch cap so a hung feeder can't hold the running guard for the
// feeder client's full retry budget (~20s). TODO: consider exposing as flag.
preConfirmedFetchTimeout = 2 * time.Second
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question — 2s timeout vs single feeder attempt.

The feeder client (clients/feeder/feeder.go) sets HTTP timeouts that grow under load (mediumGrowThreshold / IncreaseTimeout). With preConfirmedPollInterval defaulting to 500ms, the cap of 2s is fine for the steady-state case, but a single HTTP attempt that's already paying an increased timeout (say >1s) plus the backoff before the retry will not complete inside 2s. The comment frames this as "can't hold the running guard for the feeder client's full retry budget (~20s)" — but in practice the cap also cancels any retry path, which means under a slow feeder we'll silently never recover within a single fetchPreConfirmed call. The next trigger/tick will restart the attempt from scratch.

That may be the intent (you'd rather retry-from-scratch on the next 500ms tick than wait), but worth being explicit in the comment, and worth confirming this is acceptable on a saturated/sluggish feeder. Two thoughts:

  • Consider tying the timeout to max(preConfirmedPollInterval*N, floor) so the cap scales with the configured cadence — a node configured with --preconfirmed-poll-interval=5s probably doesn't want a 2s cap.
  • Optionally surface as a flag (the existing TODO hints at this).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this is NOT a problem, since the ticker will keep trying if the request failed.

Comment thread sync/pending_polling.go
Comment thread sync/sync.go Outdated
Comment thread sync/pending_polling_test.go Outdated
Comment thread sync/pending_polling.go
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

❌ Patch coverage is 95.12195% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.51%. Comparing base (a5e3520) to head (3049974).

Files with missing lines Patch % Lines
sync/pending_polling.go 94.87% 2 Missing ⚠️
Additional details and impacted files
@@                             Coverage Diff                              @@
##           thiagodeev/feeder-preconfirmed-optimization    #3694   +/-   ##
============================================================================
  Coverage                                        76.51%   76.51%           
============================================================================
  Files                                              402      402           
  Lines                                            36952    36940   -12     
============================================================================
- Hits                                             28273    28264    -9     
+ Misses                                            6697     6685   -12     
- Partials                                          1982     1991    +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Claude finished @RafaelGranza's task in 3m 13s —— View job


PR Review (incremental — only the new commit since prior review)

  • Check whether new commits have been pushed since the previous review
  • Re-read modified files
  • Identify any new findings
  • Post inline comment on new benchmark
  • Post final summary

Only one new commit since the prior review: 0dad90b — test(sync): benchmark the new pooling. It adds TestPreConfirmedUpdateFrequency in sync/pending_polling_test.go (a synctest-based instrumented benchmark comparing ticker-only vs request-driven polling across three fetch latencies). The four production files (sync/pending_polling.go, sync/sync.go, cmd/juno/juno.go) are unchanged from the previous review — so the prior findings still apply, modulo the author's responses below.

New finding (this push)

Nit — sync/pending_polling_test.go:1119 — benchmark probe asserts nothing. Posted inline. Cheap to upgrade into a regression guard by bounding fetches for the request-driven scenario (when fetchLatency >= triggerInterval, the running guard must drop the bulk of triggers). Also a minor metrics-accuracy nit on emitted.Load() racing the trigger goroutine post-cancel — synctest.Wait() after cancel() would pin it.

Re: author responses on prior findings

  • 2s per-fetch timeout (sync/pending_polling.go:22) — author: "NOT a problem, the ticker will keep trying if the request failed." That's a defensible call as long as the default --preconfirmed-poll-interval=500ms stays the operational default. Two foot-guns to be aware of: (a) if an operator sets --preconfirmed-poll-interval=5s (legitimate for cost-conscious deployments), the 2s cap aborts a single attempt well before that 5s cadence wants to retry; (b) the new benchmark you just added would happily detect this — wire triggerInterval larger than preConfirmedFetchTimeout and you'll see throughput collapse. Either scale with the interval (max(interval, 2s)) or document the implicit min-cadence requirement in the flag's usage string.
  • Async requestPreConfirmedRefresh() in PreConfirmed() (sync/sync.go:614) — author: "if the client is dictating frequency, the call will re-trigger again." Agreed for steady-state RPC traffic. The trade-off only bites on the first call after a long idle period (e.g., a websocket subscriber waking up after the connection was quiet) — that first caller gets stale data. Minor; the inline comment could just say so.
  • Ticker tick buffered during a fetch (sync/pending_polling.go:279) — author: "inclined not to care." Fine — pre-existing behavior, doesn't regress, and the benchmark's fetches count will surface it if it ever matters.
  • time.Sleep(50ms) flakiness in trigger-storm test (sync/pending_polling_test.go:479) — still worth swapping to require.Never; the change is one line and removes a CI flake class. Up to you.

Looks good

  • Benchmark structure is sound — synctest.Test per-scenario, proper buffering on out, correct teardown via <-recorderDone.
  • No new files written; no production code touched.
    · Branch

Comment thread sync/pending_polling_test.go Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Claude finished @RafaelGranza's task in 3m 3s —— View job


PR Review (incremental — only the new commit since prior review)

  • Check what commits are new since the previous review
  • Re-read changed files
  • Identify any new findings
  • Post inline comments (none required this round)
  • Update this comment with final summary

Only one new commit since the prior review: b1bb551 — test(sync): Make tests deterministic. It converts the Trigger-storm subtest in sync/pending_polling_test.go to use testing/synctest, replacing the require.Eventually + time.Sleep(50ms) + strict require.Equal combination with synctest.Wait() checkpoints. No production-code changes in this push.

New findings

None. The synctest rewrite is sound and directly addresses the flake nit from the prior review. Tracing the scenario through the bubble:

  • requestPreConfirmedRefresh() → channel send buffers (cap 1) → polling goroutine reads trigger, enters fetchPreConfirmed, sets preConfirmedFetching=true, calls mock, parks on <-release. synctest.Wait() returns because the polling goroutine is durably blocked. numCallsPreConfirmed == 1 ✓.
  • Storm of 100 requestPreConfirmedRefresh() calls all bail at if s.preConfirmedFetching.Load() { return } since the mock is still blocked. Trigger channel stays empty. ✓.
  • close(release) → mock returns → pollPreConfirmed sends to out (buf 32, non-blocking) → loops back → blocks on the select (no trigger, ticker = time.Hour). synctest.Wait() returns and <-out succeeds. ✓.

Two minor things to note (not new — context for the rewrite):

  • The release close ordering doesn't exercise the drain-before-release window flagged earlier — the storm happens entirely while fetching=true, so no trigger ever lands in preConfirmedTrigger. That's fine for what this test is checking (the running guard), but the drain-vs-release race noted in the earlier review is not covered by this test. If you ever want to assert "drain runs cleanly", you'd need a separate scenario where a storm request does land in the channel (e.g. release a fetch then immediately storm in a tight loop without synctest.Wait between).
  • Ticker is time.Hour — correct for synctest, since synctest.Wait() itself doesn't advance virtual time (only when all goroutines including the test goroutine are blocked on timers). Worth keeping the comment you added so future readers know why.

Re: prior-review items still open

  • Drain-then-clear order (sync/pending_polling.go:319-325) — author replied "I'll apply changes" in the previous round, but the order is unchanged in this push. Flagging in case it was just deferred to a follow-up commit and not dropped.
  • 2s per-fetch timeout, async requestPreConfirmedRefresh() in PreConfirmed(), ticker-tick-buffered nit — author NACK'd previously; no changes expected here.
  • TestPreConfirmedUpdateFrequency asserts nothing (from the previous incremental review) — unchanged in this push; the benchmark probe still has no require.* bounds.

Looks good

  • synctest.Test rewrite is clean: t.Context() instead of a hardcoded 2s deadline, explicit checkpoints replacing brittle sleeps, comment explaining why the ticker has to be set to time.Hour under synctest.
  • No production code touched — risk surface is purely test-only.
    · Branch

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Claude finished @RafaelGranza's task in 2m 48s —— View job


PR Review (incremental — only the new commits since prior review)

  • Check what commits are new since the previous review
  • Re-read changed files
  • Identify any new findings
  • Post inline comment on the stale doc block
  • Update this comment with final summary

Two new commits since the last incremental review (which covered through b1bb551):

  • b7415b1b — refactor(sync): scale pre_confirmed fetch timeout
  • d0d0d55e — test(sync): rename test to benchmark

New findings

Nit — sync/pending_polling_test.go:1115-1118: stale TestPreConfirmedUpdateFrequency doc block left behind by the rename. Posted inline. The old four-line comment still references the previous Test* function and go test -v -run invocation; the new BenchmarkPreConfirmedUpdateFrequency comment is immediately below it. Looks like the rename added the new block without removing the old one.

That's the only new code-level finding this push. The changes in b7415b1b and the rename are otherwise clean.

Re: prior-review items

  • 2s per-fetch timeout (sync/pending_polling.go:22-28)addressed. Now max(preConfirmedPollInterval*4, 500ms). Defaults (500ms interval) preserve the prior 2s cap; a 5s operator-configured interval now gets a 20s cap, which matches the feeder retry budget. The floor of 500ms only binds for preConfirmedPollInterval < 125ms, which is a debug-only configuration. Good fix.
  • Async PreConfirmed() comment (sync/sync.go:611-612)addressed. New comment explicitly says "Fire-and-forget: triggers an async refresh whose result is seen by the next caller and by preConfirmedDataFeed subscribers, not by this call." Accurate.
  • Gate requestPreConfirmedRefresh() on preConfirmedPollInterval == 0 (sync/pending_polling.go:253)addressed. Early-returns cleanly when polling is disabled.
  • Drain-then-clear order (sync/pending_polling.go:327-333)documented, not flipped. The previous response was "I'll apply changes" but the structural order is unchanged in this push. Instead, the comment now honestly admits the trade-off: "that request is silently swallowed, but the next RPC call or tick re-triggers if needed." Acceptable as a documented choice — the self-healing argument is sound for steady-state RPC traffic. Flagging only because the original commitment was to change the code.
  • Ticker-tick buffered during a fetch / 2s cap on slow feeder / async-call benchmark asserts nothing — author previously NACK'd; unchanged.

Minor observations on the rename

  • Dropping synctest was unavoidable (it requires *testing.T, not *testing.B) and is the right call for a benchmark that measures real wall-clock rates. The trade-off is ~30s wall-clock per -bench invocation (6 sub-benchmarks × 5s runDuration), which is fine for a diagnostic.
  • context.WithCancel(context.Background()) at line 1191 — consider b.Context() so goroutines auto-cancel if the benchmark panics or hits -timeout before the explicit cancel(). Tiny safety nit.

Looks good

  • preConfirmedFetchTimeout() is a clean refactor; the floor + multiplier constants are well-named and the comment justifies both.
  • The new comment on fetchPreConfirmed is more honest about the drain semantics than the prior one.
  • Function-signature changes from t.Context()b.Context() are consistently applied.
  • No production changes in d0d0d55e; risk surface is test-only.
    · Branch

Comment thread sync/pending_polling_test.go Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Claude finished @RafaelGranza's task in 32s —— View job


PR Review (incremental — only the new commit since prior review)

  • Check what commits are new since the previous review
  • Re-read changed files
  • Identify any new findings
  • Update this comment with final summary

Only one new commit since the last incremental review (which covered through d0d0d55e):

  • 3049974a — docs(sync): Update benchmark description

New findings

None. This commit does exactly one thing: it removes the stale four-line TestPreConfirmedUpdateFrequency doc block that the previous review flagged at sync/pending_polling_test.go:1115-1118 and keeps only the corrected BenchmarkPreConfirmedUpdateFrequency comment. The leftover reference to the old Test* name and the go test -v -run invocation is gone, and the surviving comment accurately describes the benchmark (baseline vs request-driven subtests, runDuration in real time, b.N ignored, go test -bench=... invocation). Test-only, doc-only change — no production code touched.

Resolves prior nit: stale rename doc comment.

Status of prior-review items (for reference)

  • Stale benchmark doc comment — ✅ fixed by this commit.
  • 2s → scaled per-fetch timeout (max(preConfirmedPollInterval*4, 500ms)) — addressed in b7415b1b.
  • Async PreConfirmed() doc + gate on preConfirmedPollInterval == 0 — addressed.
  • Drain-then-clear order in fetchPreConfirmed — documented as a deliberate, self-healing trade-off rather than flipped. Acceptable for steady-state RPC traffic.
  • Ticker-tick-buffered "free" fetch / benchmark asserts nothing — author NACK'd / left as a diagnostic; unchanged and not regressing.

Nothing blocks here. The PR is in good shape.
· Branch

Comment thread sync/pending_polling.go
Comment on lines +248 to +250
if s.preConfirmedFetching.Load() {
return
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, I don't like this very much. You're trying to avoid multiple requests but I suspect there might be a better way.

Comment thread cmd/juno/juno.go
Comment on lines -202 to +203
preConfirmedPollIntervalUsage = "Sets how frequently pre_confirmed block will be updated" +
"(0s will disable fetching of pre_confirmed block)."
preConfirmedPollIntervalUsage = "Fallback ticker interval for pre_confirmed polling when no " +
"RPC traffic is driving refreshes. (0s disables polling.)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original description is better, please switch it back

Comment thread sync/pending_polling.go
preLatestCacheSize = 10
// Per-fetch cap scales with poll interval so slower configs get
// proportionally longer budgets. Floor avoids pathologically short caps.
preConfirmedFetchTimeoutMultiplier = 4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to work in your text, I have full context of this PR and this text seems unnecessarily complex to understand.

Why not something as:

// Fetch timeout = max(pollInterval * multiplier, floor).

Comment thread sync/pending_polling.go
func (s *Synchronizer) preConfirmedFetchTimeout() time.Duration {
return max(s.preConfirmedPollInterval*preConfirmedFetchTimeoutMultiplier, preConfirmedFetchTimeoutFloor)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally your constants are defined here, why not defined them just here and putting the comment here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants