Skip to content

perf(driver): cache OpenLedger result to eliminate per-request DB queries#1355

Open
ariel-formance wants to merge 16 commits into
mainfrom
perf/openledger-cache
Open

perf(driver): cache OpenLedger result to eliminate per-request DB queries#1355
ariel-formance wants to merge 16 commits into
mainfrom
perf/openledger-cache

Conversation

@ariel-formance

Copy link
Copy Markdown
Contributor

Summary

  • Every HTTP request to the ledger service ran two DB queries (GetLedger + CountLedgersInBucket) inside OpenLedger before any business logic executed, consuming connection pool slots on every request
  • This worsens thundering-herd reconnect bursts: when the pool is under pressure, metadata queries compete with actual transaction queries for the same limited connections
  • There was already a // todo: keep the ledger in cache somewhere comment in driver.go acknowledging this

What this PR adds:

  1. TTL cache (60s default)Driver now holds a mutex-protected map[string]cachedLedger. Cache hits skip both DB queries entirely. Explicit eviction on UpdateLedgerMetadata and DeleteLedgerMetadata makes single-pod behaviour correct; the TTL is a cross-pod safety net.

  2. Singleflightgolang.org/x/sync/singleflight (already in go.mod) collapses concurrent cache-miss requests for the same ledger name into a single DB call. Without this, a post-TTL burst of N concurrent requests would still fan out to N×2 DB queries. There is a double-check inside the singleflight function to handle the case where a previous waiter has already populated the cache.

  3. aloneInBucket correctnessCountLedgersInBucket was the second query per request. It's safe to skip on cache hits because ledgerstore.DefaultFactory already maintains a shared *atomic.Bool per bucket; any CreateLedger call updates it immediately for all stores in the same bucket.

Test plan

  • 9 unit tests (driver_cache_test.go, no build tag, no DB) — hit before TTL, miss after TTL (1ns TTL), unknown key, eviction, TTL=0 disables cache, concurrent access with -race, isolation between two ledger names, evict non-existent key, update existing entry
  • TestOpenLedgerCacheHitNoDBQueries (integration) — bun query hook on a fresh bun.DB instance proves zero DB queries on warm-cache hit
  • TestOpenLedgerConcurrentColdCacheCoalesces (integration) — 50 concurrent goroutines on cold cache, all succeed, query count well below 50
  • TestOpenLedgerCacheTTLExpiry (integration) — 50ms TTL, metadata mutated via a second driver instance (bypassing cache), sleep, verify stale entry re-reads from DB
  • TestOpenLedgerCacheEvictionOnMetadataUpdate / ...OnMetadataDelete (integration) — eviction wired correctly on both write paths
  • Integration tests require Docker (go:build it)

Related

Part of connection pool saturation fix for prod-us-east-1-deriv. Companion PR: #1354 (retry jitter).

…ries

Every HTTP request previously ran GetLedger + CountLedgersInBucket before
any business logic, consuming connection slots during burst load.

Add a TTL-based in-memory cache (default 60s) on Driver with explicit
eviction on UpdateLedgerMetadata and DeleteLedgerMetadata. A singleflight
group collapses concurrent cache-miss requests for the same ledger into a
single DB call, preventing thundering-herd fan-out on cold-cache bursts.

aloneInBucket is already shared per-bucket via the ledgerstore Factory
atomic, so cache hits can skip CountLedgersInBucket entirely.

Scope-risk: low — cache is write-through evicted on all mutation paths;
TTL bounds cross-pod staleness to 60s for user-defined ledger metadata only.
Not-tested: integration tests require Docker (go:build it tag).
@ariel-formance ariel-formance requested a review from a team as a code owner May 22, 2026 11:19
@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8938dda7-57e5-48ab-9852-cf4cea5e20cb

📥 Commits

Reviewing files that changed from the base of the PR and between d737ef9 and a03a3c4.

📒 Files selected for processing (2)
  • internal/storage/driver/driver_cache_test.go
  • internal/storage/driver/driver_errors_test.go

Walkthrough

Driver now caches loaded ledgers in-memory with TTL expiry, coalesces concurrent cache-miss loads per ledger using singleflight, and evicts entries after metadata/bucket update or delete; OpenLedger serves cached stores when valid and reloads from systemStore on miss or TTL expiry.

Changes

Ledger Caching Implementation

Layer / File(s) Summary
Cache data structures and Driver fields
internal/storage/driver/driver.go
Adds singleflight import; defines cachedLedger (ledger + expiresAt); extends Driver with mu, ledgerCache, cacheGens, cacheTTL, and group; initializes maps in New and adds WithCacheTTL.
OpenLedger cache logic and singleflight coalescing
internal/storage/driver/driver.go
Adds get/set/evict helpers; rewrites OpenLedger to return cached entries when unexpired, coalesce concurrent misses with singleflight.Group using a loader timeout, load via systemStore.GetLedger, compute aloneInBucket, and conditionally populate cache guarded by generation snapshots.
Cache eviction, bucket ops, and controller/store wiring
internal/storage/driver/driver.go, internal/controller/system/controller.go, internal/controller/system/store.go
Evicts cached ledger entries after successful UpdateLedgerMetadata/DeleteLedgerMetadata; adds evictCachedLedgersByBucket and evicts on DeleteBucket/RestoreBucket; controller methods delegate to driver and Driver/Store interfaces adjusted to surface cache-aware mutation methods.
Cache unit tests
internal/storage/driver/driver_cache_test.go
Adds newTestDriver(ttl) helper and unit tests covering TTL hit/miss, unknown keys, eviction, zero-TTL disabling, concurrent access, per-ledger isolation, safe non-existent-key eviction, cache update overwrites, generation-based stale-write protection, and bucket-scoped eviction scenarios.
Driver error-path tests
internal/storage/driver/driver_errors_test.go
Adds stubbed system store and tests verifying mutation method error propagation, OpenLedger error propagation for GetLedger/CountLedgersInBucket, cold/warm cache behavior, and context-cancel handling for blocked lookups.
OpenLedger integration tests with query counting
internal/storage/driver/driver_test.go
Adds bun query-counting instrumentation and integration tests validating cache hits, eviction on metadata update/delete, warmed-cache zero DB queries, concurrent cold-cache coalescing, TTL expiry refresh, and DeleteBucket/RestoreBucket eviction effects.

Sequence Diagram — OpenLedger cache flow

sequenceDiagram
  participant Client
  participant Driver
  participant Cache as Driver.ledgerCache
  participant SF as singleflight.Group
  participant SystemStore

  Client->>Driver: OpenLedger(ctx, name)
  Driver->>Cache: getCachedLedger(name)
  alt cache hit (not expired)
    Cache-->>Driver: cached ledger
    Driver-->>Client: return ledger
  else cache miss
    Driver->>SF: Do(name, loader)
    SF->>SystemStore: GetLedger(name) [loader ctx ~30s]
    SystemStore-->>SF: ledger + metadata
    SF->>Driver: loader result
    Driver->>Cache: setCachedLedgerGen(name) (if gen unchanged)
    Driver-->>Client: return ledger (respect caller ctx)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I cached a ledger in a sunny patch,
Timed to nap until the TTL's latch,
Singleflight hums — one fetch for many,
Evict on change, fresh reads for any,
Rabbit hops off, the cache stays merry.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.39% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly summarizes the main change: adding a cache for OpenLedger results to eliminate repeated DB queries per request.
Description check ✅ Passed The description comprehensively explains the problem (DB queries on every request, connection pool saturation), the solution (TTL cache, singleflight, eviction strategy), and test coverage.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/openledger-cache

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
internal/storage/driver/driver_test.go (1)

34-36: ⚡ Quick win

Keep the counting hook transparent to the caller context.

Returning context.Background() here drops cancellation, deadlines, and any values carried by logging.TestingContext(). This helper should return the incoming context unchanged so it only counts queries and does not alter behavior under test.

♻️ Suggested fix
-func (h *queryCounter) BeforeQuery(_ context.Context, _ *bun.QueryEvent) context.Context {
-	return context.Background()
+func (h *queryCounter) BeforeQuery(ctx context.Context, _ *bun.QueryEvent) context.Context {
+	return ctx
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/driver/driver_test.go` around lines 34 - 36, The BeforeQuery
hook in queryCounter currently returns context.Background(), which discards
cancellation, deadlines, and any values (e.g., logging.TestingContext()); change
queryCounter.BeforeQuery to return the incoming context parameter unchanged so
the hook remains transparent to the caller context and only performs counting.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/storage/driver/driver_test.go`:
- Around line 334-338: The test's assertion on DB query coalescing is too loose;
replace the current require.Less on counter.n.Load() (used after the concurrent
OpenLedger calls) with a tight fixed upper bound reflecting the singleflight
contract (e.g., assert counter.n.Load() <= 2 to allow only the single cold-miss
load plus an optional count), so update the assertion that references
counter.n.Load(), goroutines and the concurrent OpenLedger calls to require a
small fixed maximum (2) instead of `< goroutines`.

In `@internal/storage/driver/driver.go`:
- Around line 107-126: getCachedLedger and setCachedLedger need to remove
expired entries and treat non‑positive TTL as "disabled": change setCachedLedger
(check d.cacheTTL <= 0) to return early so you never store entries when TTL is
zero or negative, and change getCachedLedger to delete expired entries from
d.ledgerCache instead of just returning a miss — when you detect entry.expiresAt
is before time.Now(), drop the map key under a write lock (e.g. RUnlock(),
Lock(), check again, delete, Unlock()) and then return (ledger.Ledger{}, false);
keep using the cachedLedger type and the ledgerCache map and preserve proper
locking around reads/writes.
- Around line 260-267: The current implementation in Driver.UpdateLedgerMetadata
and Driver.DeleteLedgerMetadata calls d.evictCachedLedger(name) before
performing the DB mutation, which can race and leave the cache stale; change
both methods to call the store update/delete first via
d.systemStoreFactory.Create(d.db).UpdateLedgerMetadata(...) and
.DeleteLedgerMetadata(...), check for a nil error, and only then call
d.evictCachedLedger(name) so eviction happens after a successful write (do not
evict if the write returned an error).
- Around line 149-174: The singleflight call currently uses d.group.Do(name,
func...) which binds the shared DB work to the caller's ctx and can cancel all
waiters; change this to use d.group.DoChan(name) and run the DB lookup inside
the singleflight winner with an internal context (e.g., ctxDB :=
context.WithTimeout(context.Background(), <reasonableTimeout>)) when calling
systemStore.GetLedger and systemStore.CountLedgersInBucket so the shared work
isn't tied to a single request's ctx; concurrently, have each waiter select on
their own ctx.Done() vs the DoChan result so they can return early on
cancellation, and still call d.setCachedLedger, create the store via
d.ledgerStoreFactory.Create/d.bucketFactory.Create and return openLedgerResult
as before when the winner completes. Ensure you keep existing behavior like
calling store.SetAloneInBucket(count == 1) and caching via
d.setCachedLedger(*ret).

---

Nitpick comments:
In `@internal/storage/driver/driver_test.go`:
- Around line 34-36: The BeforeQuery hook in queryCounter currently returns
context.Background(), which discards cancellation, deadlines, and any values
(e.g., logging.TestingContext()); change queryCounter.BeforeQuery to return the
incoming context parameter unchanged so the hook remains transparent to the
caller context and only performs counting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa9a83db-e252-423a-a2e6-e426c2900227

📥 Commits

Reviewing files that changed from the base of the PR and between 01ccc04 and 870272b.

📒 Files selected for processing (3)
  • internal/storage/driver/driver.go
  • internal/storage/driver/driver_cache_test.go
  • internal/storage/driver/driver_test.go

Comment thread internal/storage/driver/driver_test.go Outdated
Comment thread internal/storage/driver/driver.go
Comment thread internal/storage/driver/driver.go Outdated
Comment thread internal/storage/driver/driver.go Outdated
- getCachedLedger: delete expired entries from the map under a write lock
  (double-checked) instead of returning a silent miss; prevents unbounded
  growth of stale entries
- setCachedLedger: guard on cacheTTL <= 0 (not just == 0) to treat any
  non-positive TTL as disabled
- UpdateLedgerMetadata / DeleteLedgerMetadata: evict after a successful DB
  write, not before — evicting first lets a concurrent OpenLedger re-populate
  the cache with the old value before the mutation commits
- TestOpenLedgerConcurrentColdCacheCoalesces: tighten assertion to
  counter.n.Load() <= 2 (GetLedger + CountLedgersInBucket) instead of the
  loose < goroutines bound that would pass even with broken singleflight

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/storage/driver/driver.go (1)

168-180: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid re-caching a snapshot loaded before a metadata write committed.

A cold OpenLedger can read the old row, then UpdateLedgerMetadata/DeleteLedgerMetadata can succeed and evict the key, and finally this path still executes d.setCachedLedger(*ret). That restores stale metadata for the full TTL. You need a per-ledger invalidation/version check, or equivalent coordination, before writing back into ledgerCache.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/driver/driver.go` around lines 168 - 180, The code is
re-caching a potentially-stale ledger snapshot after a concurrent metadata
write/delete; before calling d.setCachedLedger(*ret) in OpenLedger (after
systemStore.GetLedger/CountLedgersInBucket), check a per-ledger
version/invalidation token in ledgerCache (or add one) and only write back if
the cached entry is absent or its version equals the version returned by ret;
alternatively implement an atomic compare-and-set on the ledgerCache keyed by
ledger name (use a version/updatedAt/etag field on the Ledger struct returned by
systemStore.GetLedger) so concurrent UpdateLedgerMetadata/DeleteLedgerMetadata
that evict/increment version prevents overwriting with the old snapshot.
internal/storage/driver/driver_test.go (1)

372-372: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove the extra closing brace at line 372.

internal/storage/driver/driver_test.go has two consecutive } right after the final require.Equal(...); the extra one leaves invalid Go syntax and will stop this test package from compiling.

🛠️ Proposed fix
-}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/driver/driver_test.go` at line 372, There is an extra
closing brace after the final require.Equal(...) in
internal/storage/driver/driver_test.go which breaks Go syntax; remove the
redundant `}` so that the test function and enclosing block (the function
containing the require.Equal call) have balanced braces and the file compiles,
verifying the only remaining `}` closes the function or test scope referenced by
the require.Equal assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@internal/storage/driver/driver_test.go`:
- Line 372: There is an extra closing brace after the final require.Equal(...)
in internal/storage/driver/driver_test.go which breaks Go syntax; remove the
redundant `}` so that the test function and enclosing block (the function
containing the require.Equal call) have balanced braces and the file compiles,
verifying the only remaining `}` closes the function or test scope referenced by
the require.Equal assertion.

In `@internal/storage/driver/driver.go`:
- Around line 168-180: The code is re-caching a potentially-stale ledger
snapshot after a concurrent metadata write/delete; before calling
d.setCachedLedger(*ret) in OpenLedger (after
systemStore.GetLedger/CountLedgersInBucket), check a per-ledger
version/invalidation token in ledgerCache (or add one) and only write back if
the cached entry is absent or its version equals the version returned by ret;
alternatively implement an atomic compare-and-set on the ledgerCache keyed by
ledger name (use a version/updatedAt/etag field on the Ledger struct returned by
systemStore.GetLedger) so concurrent UpdateLedgerMetadata/DeleteLedgerMetadata
that evict/increment version prevents overwriting with the old snapshot.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f46c0d36-1aa5-4853-bf31-f163c38e2805

📥 Commits

Reviewing files that changed from the base of the PR and between 870272b and 76b52da.

📒 Files selected for processing (2)
  • internal/storage/driver/driver.go
  • internal/storage/driver/driver_test.go

…Chan

Switch OpenLedger from group.Do to group.DoChan so the shared DB queries
run under an independent context (30s timeout). A single request
cancellation can no longer abort the in-flight lookup for all other
waiters sharing the same singleflight key.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/storage/driver/driver.go (2)

164-186: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Prevent stale misses from repopulating the cache after a metadata write.

A miss can read the old ledger, then UpdateLedgerMetadata or DeleteLedgerMetadata succeeds and evicts the key, and then Line 186 writes the stale row back into ledgerCache. That makes the post-write eviction ineffective under concurrency and can serve old metadata for the full TTL. You need a write barrier here — e.g. a per-ledger invalidation generation/version captured before the load and checked again before setCachedLedger.

Also applies to: 283-295

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/driver/driver.go` around lines 164 - 186, The open-path read
can race with metadata writes so a stale read may be written back into the
ledger cache after a concurrent UpdateLedgerMetadata/DeleteLedgerMetadata
evicted it; to fix, introduce a per-ledger invalidation generation/version (or
reuse an existing ledgerCache generation) and capture its value before the DB
load inside d.group.DoChan, then after retrieving ret and before calling
d.setCachedLedger(*ret) compare the current generation for that ledger name to
the captured one and skip setCachedLedger if they differ; apply the same fix at
the other occurrence referenced (the block around the 283-295 region) and update
getCachedLedger/setCachedLedger logic to maintain and check the generation on
evictions/updates.

164-202: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return fresh objects to each coalesced caller.

Line 164 builds a single *openLedgerResult, and Line 201 hands that same *ledgerstore.Store / *ledger.Ledger to every waiter on a cold miss. Warm-cache hits return fresh instances, so this introduces cross-request aliasing only on misses. Please have the singleflight closure return immutable data and construct the store per waiter after the receive.

Proposed fix
 type openLedgerResult struct {
-	store  *ledgerstore.Store
-	ledger *ledger.Ledger
+	ledger        ledger.Ledger
+	aloneInBucket bool
 }
@@
 	ch := d.group.DoChan(name, func() (any, error) {
@@
 		if l, ok := d.getCachedLedger(name); ok {
-			store := d.ledgerStoreFactory.Create(d.bucketFactory.Create(l.Bucket), l)
-			return &openLedgerResult{store: store, ledger: &l}, nil
+			return openLedgerResult{ledger: l}, nil
 		}
@@
 		d.setCachedLedger(*ret)
-
-		store := d.ledgerStoreFactory.Create(d.bucketFactory.Create(ret.Bucket), *ret)
-		store.SetAloneInBucket(count == 1)
-
-		return &openLedgerResult{store: store, ledger: ret}, nil
+		return openLedgerResult{
+			ledger:        *ret,
+			aloneInBucket: count == 1,
+		}, nil
 	})
@@
 	case result := <-ch:
 		if result.Err != nil {
 			return nil, nil, result.Err
 		}
-		res := result.Val.(*openLedgerResult)
-		return res.store, res.ledger, nil
+		res := result.Val.(openLedgerResult)
+		store := d.ledgerStoreFactory.Create(d.bucketFactory.Create(res.ledger.Bucket), res.ledger)
+		store.SetAloneInBucket(res.aloneInBucket)
+		l := res.ledger
+		return store, &l, nil
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/driver/driver.go` around lines 164 - 202, The singleflight
closure passed to d.group.DoChan currently returns a shared *openLedgerResult
containing a store and ledger pointer which is then handed to every waiter;
change the closure to return only immutable ledger metadata (e.g. the Ledger
value or a small struct with Bucket and other immutable fields) and any derived
flags (like count==1) but not a ledgerstore.Store or pointer to ledger; after
receiving the result in the select branch (where result := <-ch and res :=
result.Val.(*openLedgerResult)), create a fresh store for that specific waiter
using d.ledgerStoreFactory.Create(d.bucketFactory.Create(...)) and a fresh
ledger value (copy the returned metadata or call d.getCachedLedger as needed),
call store.SetAloneInBucket(...) with the derived flag, and then return that
freshly constructed store and ledger so each caller gets its own instances; keep
setCachedLedger as-is for caching the metadata only.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@internal/storage/driver/driver.go`:
- Around line 164-186: The open-path read can race with metadata writes so a
stale read may be written back into the ledger cache after a concurrent
UpdateLedgerMetadata/DeleteLedgerMetadata evicted it; to fix, introduce a
per-ledger invalidation generation/version (or reuse an existing ledgerCache
generation) and capture its value before the DB load inside d.group.DoChan, then
after retrieving ret and before calling d.setCachedLedger(*ret) compare the
current generation for that ledger name to the captured one and skip
setCachedLedger if they differ; apply the same fix at the other occurrence
referenced (the block around the 283-295 region) and update
getCachedLedger/setCachedLedger logic to maintain and check the generation on
evictions/updates.
- Around line 164-202: The singleflight closure passed to d.group.DoChan
currently returns a shared *openLedgerResult containing a store and ledger
pointer which is then handed to every waiter; change the closure to return only
immutable ledger metadata (e.g. the Ledger value or a small struct with Bucket
and other immutable fields) and any derived flags (like count==1) but not a
ledgerstore.Store or pointer to ledger; after receiving the result in the select
branch (where result := <-ch and res := result.Val.(*openLedgerResult)), create
a fresh store for that specific waiter using
d.ledgerStoreFactory.Create(d.bucketFactory.Create(...)) and a fresh ledger
value (copy the returned metadata or call d.getCachedLedger as needed), call
store.SetAloneInBucket(...) with the derived flag, and then return that freshly
constructed store and ledger so each caller gets its own instances; keep
setCachedLedger as-is for caching the metadata only.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cfa1fe2c-21d8-4fee-bc59-cbbf23eb0e33

📥 Commits

Reviewing files that changed from the base of the PR and between 76b52da and 4fa2a69.

📒 Files selected for processing (1)
  • internal/storage/driver/driver.go

Track a monotonic generation counter per ledger name. evictCachedLedger
bumps it; OpenLedger snapshots it before the DB queries and skips the
write-back if the generation advanced (i.e. a metadata mutation evicted
the entry while the query was in-flight), preventing stale snapshots
from re-populating the cache for up to the TTL period.
The singleflight closure previously embedded a *ledgerstore.Store in its
return value, which was then shared with every waiter. Change the result
type to carry only the immutable ledger.Ledger value (plus an optional
aloneInBucket flag from the count query). Each waiter now constructs its
own fresh Store after receiving the result, eliminating the shared
pointer and making the intent explicit for future Store mutations.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
internal/storage/driver/driver_cache_test.go (2)

121-133: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Missing test coverage for generation counter behavior.

The commit message describes a monotonic generation counter that prevents stale cache writes: evictions bump the generation, and singleflight write-backs skip populating the cache if the generation has advanced. This is a critical correctness mechanism to prevent races where a slow DB load overwrites a fresher eviction. However, there are no unit tests verifying:

  1. That eviction increments the generation counter for that ledger
  2. That a cache write is rejected when the generation has advanced

While TestCacheUpdateEntry verifies overwriting an existing entry, it doesn't test the stale-write prevention logic.

🧪 Suggested test outline
func TestCacheGenerationPreventsStaleWrites(t *testing.T) {
	d := newTestDriver(time.Minute)
	l := ledger.MustNewWithDefault("test-ledger")
	
	// Simulate: set entry (gen=0), evict (gen=1), attempt stale set
	d.setCachedLedger(l)
	gen0 := d.cacheGens[l.Name] // snapshot generation
	
	d.evictCachedLedger(l.Name)
	gen1 := d.cacheGens[l.Name]
	require.Greater(t, gen1, gen0, "eviction should bump generation")
	
	// Test that a write with old generation is rejected
	// (may require exposing generation-aware set method or
	// testing through OpenLedger integration test instead)
}

Note: If cacheGens is not exported, consider testing the observable behavior through integration tests where concurrent OpenLedger calls race with evictions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/driver/driver_cache_test.go` around lines 121 - 133, Add a
unit test that asserts eviction increments the per-ledger generation and that a
write with an older generation is rejected: create
TestCacheGenerationPreventsStaleWrites which uses newTestDriver/new ledger, call
setCachedLedger(l), snapshot the generation for l.Name from the driver
(cacheGens), call evictCachedLedger(l.Name) and require the generation
increased, then attempt a stale write and assert it does not populate the cache
(using getCachedLedger to confirm absence). If cacheGens or a generation-aware
set is not accessible, add a small test-only accessor or expose a
generation-aware set method used by the singleflight writeback path so the test
can simulate an older-generation write and verify it’s ignored.

35-39: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Flaky test: nanosecond-level timing is unreliable.

The test uses a 1-nanosecond TTL and sleeps for 2 nanoseconds. OS scheduling granularity, GC pauses, and timing precision can make this test flaky in CI environments, even with the 2× margin.

⏱️ Proposed fix using millisecond-scale timing
 func TestCacheMissAfterTTL(t *testing.T) {
-	d := newTestDriver(time.Nanosecond)
+	d := newTestDriver(10 * time.Millisecond)
 	l := ledger.MustNewWithDefault("test-ledger")
 
 	d.setCachedLedger(l)
-	time.Sleep(2 * time.Nanosecond)
+	time.Sleep(20 * time.Millisecond)
 
 	_, ok := d.getCachedLedger(l.Name)
 	require.False(t, ok)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/driver/driver_cache_test.go` around lines 35 - 39, The test
is flaky due to nanosecond TTL/sleep; change the test to use millisecond-scale
timing by calling newTestDriver with time.Millisecond (or another ms-level TTL)
instead of time.Nanosecond and replace the Sleep(2 * time.Nanosecond) with
Sleep(2 * time.Millisecond) (or an equivalent >TTL margin) so that
setCachedLedger expiry behavior is tested reliably; update newTestDriver and the
Sleep call in the test that references setCachedLedger and d to use these
millisecond values.
internal/storage/driver/driver.go (1)

166-170: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return fresh Store and Ledger objects to each waiter.

singleflight.DoChan fan-outs the same Val to every waiter, so this cold-miss path now gives concurrent callers one shared *ledgerstore.Store and *ledger.Ledger. Cache hits still return fresh objects per call, so coalesced misses now have different ownership semantics and can race if either value is mutated downstream. Have the singleflight closure return immutable load data only, then build the store and take the ledger address after each caller receives the result.

Suggested refactor
-		if l, ok := d.getCachedLedger(name); ok {
-			store := d.ledgerStoreFactory.Create(d.bucketFactory.Create(l.Bucket), l)
-			return &openLedgerResult{store: store, ledger: &l}, nil
-		}
+		if l, ok := d.getCachedLedger(name); ok {
+			return &openLedgerResult{ledger: l}, nil
+		}
@@
-		store := d.ledgerStoreFactory.Create(d.bucketFactory.Create(ret.Bucket), *ret)
-		store.SetAloneInBucket(count == 1)
-
-		return &openLedgerResult{store: store, ledger: ret}, nil
+		aloneInBucket := count == 1
+		return &openLedgerResult{
+			ledger:        *ret,
+			aloneInBucket: &aloneInBucket,
+		}, nil
 	})
@@
-		res := result.Val.(*openLedgerResult)
-		return res.store, res.ledger, nil
+		res := result.Val.(*openLedgerResult)
+		store := d.ledgerStoreFactory.Create(d.bucketFactory.Create(res.ledger.Bucket), res.ledger)
+		if res.aloneInBucket != nil {
+			store.SetAloneInBucket(*res.aloneInBucket)
+		}
+		l := res.ledger
+		return store, &l, nil
 	}
 }

Also applies to: 207-221

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/driver/driver.go` around lines 166 - 170, The singleflight
closure used in d.group.DoChan returns shared *ledgerstore.Store and
*ledger.Ledger (via openLedgerResult), causing coalesced waiters to receive the
same mutable objects; instead have the closure return only immutable load data
(e.g., a ledger value copy or identifier) and an error, and then inside each
waiter build its own store and take its own ledger address by calling
ledgerStoreFactory.Create and bucketFactory.Create after receiving the DoChan
result; update the closure around d.group.DoChan, the getCachedLedger path, and
the similar coalescing block later (the other DoChan block) so openLedgerResult
holds only immutable data and each caller constructs its own Store and *Ledger
locally.
🧹 Nitpick comments (1)
internal/storage/driver/driver_cache_test.go (1)

73-94: ⚡ Quick win

Consider adding final state assertions.

The test validates thread-safety (no panics/deadlocks), which is valuable. However, it could also verify correctness properties such as checking the final cache state or that concurrent operations don't corrupt data.

📊 Example: add final state check
 	}
 	wg.Wait()
+	
+	// Verify cache state is consistent (not corrupted)
+	if cached, ok := d.getCachedLedger(l.Name); ok {
+		require.Equal(t, l.Name, cached.Name, "cached ledger should have correct name")
+	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/driver/driver_cache_test.go` around lines 73 - 94, The test
TestCacheConcurrentAccess currently only ensures no panics; update it to assert
final cache correctness by calling d.getCachedLedger(l.Name) after wg.Wait() and
asserting the returned value is either nil (if evict should win) or a valid
ledger equal to ledger.MustNewWithDefault("test-ledger"), and additionally
verify any internal flag/state on the driver (via newTestDriver helpers) that
indicates cache emptiness or size; use the existing methods setCachedLedger,
getCachedLedger, and evictCachedLedger to perform the check and fail the test if
the cache is corrupted or inconsistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@internal/storage/driver/driver_cache_test.go`:
- Around line 121-133: Add a unit test that asserts eviction increments the
per-ledger generation and that a write with an older generation is rejected:
create TestCacheGenerationPreventsStaleWrites which uses newTestDriver/new
ledger, call setCachedLedger(l), snapshot the generation for l.Name from the
driver (cacheGens), call evictCachedLedger(l.Name) and require the generation
increased, then attempt a stale write and assert it does not populate the cache
(using getCachedLedger to confirm absence). If cacheGens or a generation-aware
set is not accessible, add a small test-only accessor or expose a
generation-aware set method used by the singleflight writeback path so the test
can simulate an older-generation write and verify it’s ignored.
- Around line 35-39: The test is flaky due to nanosecond TTL/sleep; change the
test to use millisecond-scale timing by calling newTestDriver with
time.Millisecond (or another ms-level TTL) instead of time.Nanosecond and
replace the Sleep(2 * time.Nanosecond) with Sleep(2 * time.Millisecond) (or an
equivalent >TTL margin) so that setCachedLedger expiry behavior is tested
reliably; update newTestDriver and the Sleep call in the test that references
setCachedLedger and d to use these millisecond values.

In `@internal/storage/driver/driver.go`:
- Around line 166-170: The singleflight closure used in d.group.DoChan returns
shared *ledgerstore.Store and *ledger.Ledger (via openLedgerResult), causing
coalesced waiters to receive the same mutable objects; instead have the closure
return only immutable load data (e.g., a ledger value copy or identifier) and an
error, and then inside each waiter build its own store and take its own ledger
address by calling ledgerStoreFactory.Create and bucketFactory.Create after
receiving the DoChan result; update the closure around d.group.DoChan, the
getCachedLedger path, and the similar coalescing block later (the other DoChan
block) so openLedgerResult holds only immutable data and each caller constructs
its own Store and *Ledger locally.

---

Nitpick comments:
In `@internal/storage/driver/driver_cache_test.go`:
- Around line 73-94: The test TestCacheConcurrentAccess currently only ensures
no panics; update it to assert final cache correctness by calling
d.getCachedLedger(l.Name) after wg.Wait() and asserting the returned value is
either nil (if evict should win) or a valid ledger equal to
ledger.MustNewWithDefault("test-ledger"), and additionally verify any internal
flag/state on the driver (via newTestDriver helpers) that indicates cache
emptiness or size; use the existing methods setCachedLedger, getCachedLedger,
and evictCachedLedger to perform the check and fail the test if the cache is
corrupted or inconsistent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 31fd6c44-f96c-4ef4-8850-b997aa7e9fa4

📥 Commits

Reviewing files that changed from the base of the PR and between 4fa2a69 and 194f767.

📒 Files selected for processing (2)
  • internal/storage/driver/driver.go
  • internal/storage/driver/driver_cache_test.go

- Extract inline singleflight write-back into setCachedLedgerGen so the
  generation check path can be exercised directly in unit tests
- Add TestCacheGenerationPreventsStaleWrites: sets cache, snapshots
  generation, evicts, verifies generation bumped, then confirms a
  setCachedLedgerGen call with the old generation is silently rejected
- Fix TestCacheMissAfterTTL: nanosecond TTL/sleep was unreliable on
  loaded CI; switch to millisecond scale
- Extend TestCacheConcurrentAccess: assert the cache holds a valid
  ledger or nothing after concurrent set/get/evict (was panic-only before)
@flemzord

Copy link
Copy Markdown
Member

This PR adds an in-memory cache around Driver.OpenLedger to avoid reading _system.ledgers and running CountLedgersInBucket on every ledger open. It also adds singleflight so concurrent cold-cache opens for the same ledger are collapsed into one DB lookup. The goal is clearly to reduce DB pressure on hot paths that repeatedly call OpenLedger.

The performance intent makes sense, but this introduces a new invalidation contract: every mutation of _system.ledgers must either go through a cache-aware driver method or explicitly evict the affected cache entries. That contract is not fully respected today by the production controller path (GetSystemStore().UpdateLedgerMetadata/DeleteLedgerMetadata) nor by bucket-level mutations such as delete/restore, so OpenLedger can return stale ledger state until the TTL expires.

There is also an operational impact in multi-instance deployments: this cache is local to each process. If one Ledger instance updates metadata, deletes/restores a bucket, or otherwise changes _system.ledgers, the other instances will keep their own independent cached values until their local TTL expires. That means different Ledger instances can temporarily disagree about the same ledger’s metadata or deleted/restored state.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/storage/driver/driver.go (1)

160-165: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Break the in-flight singleflight entry when you evict.

setCachedLedgerGen only prevents writing stale data back into ledgerCache (cacheGens[name] check), but OpenLedger still returns the DB snapshot (ret) to the caller even if eviction advanced the generation mid-flight. Because UpdateLedgerMetadata/DeleteLedgerMetadata call evictCachedLedger without also clearing the singleflight in-flight entry, a later OpenLedger for the same name can join the pre-eviction DoChan and observe old ledger/metadata until the flight completes.

Suggested fix
func (d *Driver) evictCachedLedger(name string) {
-	d.mu.Lock()
-	defer d.mu.Unlock()
+	d.mu.Lock()
 	delete(d.ledgerCache, name)
 	d.cacheGens[name]++
+	d.mu.Unlock()
+
+	d.group.Forget(name)
}

Applies to internal/storage/driver/driver.go around 160-165 and the OpenLedger cache-miss/singleflight path around 184-216.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/driver/driver.go` around lines 160 - 165, When evicting a
ledger you must also break any in-flight singleflight entry so concurrent
OpenLedger callers can't join a pre-eviction flight and receive stale snapshots;
update evictCachedLedger to not only delete d.ledgerCache[name] and bump
d.cacheGens[name] but also call the singleflight "forget" for that key (e.g.
d.sf.Forget(name) or d.singleflight.Forget(name) depending on the group's field
name) so the OpenLedger cache-miss/singleflight path (around OpenLedger and
setCachedLedgerGen) will start a fresh flight after eviction; ensure the call is
no-op if the group field is nil.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@internal/storage/driver/driver.go`:
- Around line 160-165: When evicting a ledger you must also break any in-flight
singleflight entry so concurrent OpenLedger callers can't join a pre-eviction
flight and receive stale snapshots; update evictCachedLedger to not only delete
d.ledgerCache[name] and bump d.cacheGens[name] but also call the singleflight
"forget" for that key (e.g. d.sf.Forget(name) or d.singleflight.Forget(name)
depending on the group's field name) so the OpenLedger cache-miss/singleflight
path (around OpenLedger and setCachedLedgerGen) will start a fresh flight after
eviction; ensure the call is no-op if the group field is nil.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 24d01ba2-be59-4314-a137-5d471fb1e4e7

📥 Commits

Reviewing files that changed from the base of the PR and between 194f767 and aa3cb7f.

📒 Files selected for processing (3)
  • internal/storage/driver/driver.go
  • internal/storage/driver/driver_cache_test.go
  • internal/storage/driver/driver_test.go
💤 Files with no reviewable changes (1)
  • internal/storage/driver/driver_test.go

…tations

UpdateLedgerMetadata and DeleteLedgerMetadata in the system controller
were calling ctrl.driver.GetSystemStore().xxx() directly, bypassing the
Driver methods that call evictCachedLedger — making the eviction code
effectively dead in the production path.

DeleteBucket and RestoreBucket had no eviction path at all: they soft-
delete/restore every ledger in a bucket via a single SQL UPDATE, but
nothing touched the per-ledger cache entries.

Fixes:
- Add evictCachedLedgersByBucket to Driver: scans ledgerCache under lock
  and bumps cacheGens for every entry whose bucket matches, ensuring any
  in-flight singleflight write-backs for those ledgers are also rejected.
- Add DeleteBucket and RestoreBucket to driver.Driver, each delegating
  to the system store and then calling evictCachedLedgersByBucket.
- Add UpdateLedgerMetadata, DeleteLedgerMetadata, DeleteBucket, and
  RestoreBucket to the controller Driver interface so the system
  controller can call them directly instead of going through GetSystemStore.
- Route all four mutation methods in DefaultController through
  ctrl.driver.xxx so eviction is always triggered.
After evicting a cache entry, call d.group.Forget(name) so any callers
that arrive after the eviction start a fresh singleflight instead of
joining a pre-eviction in-flight query that will return stale data.

Forget is called after releasing d.mu in both evictCachedLedger and
evictCachedLedgersByBucket to avoid a lock-ordering dependency with the
singleflight closure, which acquires d.mu.RLock while executing.
@codecov

codecov Bot commented May 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.19820% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.15%. Comparing base (d81ce89) to head (a03a3c4).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
internal/storage/driver/driver.go 98.13% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1355      +/-   ##
==========================================
+ Coverage   80.36%   81.15%   +0.78%     
==========================================
  Files         205      205              
  Lines       11104    11195      +91     
==========================================
+ Hits         8924     9085     +161     
+ Misses       1585     1570      -15     
+ Partials      595      540      -55     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/controller/system/store.go (1)

14-30: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Don't keep cache-bypassing mutators on GetSystemStore().

Driver now has cache-aware mutation methods, but GetSystemStore() still hands out a Store with UpdateLedgerMetadata, DeleteLedgerMetadata, DeleteBucket, and RestoreBucket. Any caller that uses that older path can still skip eviction and reintroduce stale reads, so this boundary should be narrowed to a read-only store surface.

♻️ Suggested API direction
 type Store interface {
 	GetLedger(ctx context.Context, name string) (*ledger.Ledger, error)
 	Ledgers() common.PaginatedResource[ledger.Ledger, system.ListLedgersQueryPayload]
-	UpdateLedgerMetadata(ctx context.Context, name string, m metadata.Metadata) error
-	DeleteLedgerMetadata(ctx context.Context, param string, key string) error
-	DeleteBucket(ctx context.Context, bucket string) error
-	RestoreBucket(ctx context.Context, bucket string) error
 }

Callers that need mutations should go through the cache-aware Driver methods instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/system/store.go` around lines 14 - 30, The Store
interface currently exposes mutators (UpdateLedgerMetadata,
DeleteLedgerMetadata, DeleteBucket, RestoreBucket) which allows callers using
GetSystemStore() to bypass cache-aware logic on Driver; remove these mutation
methods from the Store interface so Store becomes read-only (keep GetLedger and
Ledgers only), update GetSystemStore() to return this read-only Store type, and
ensure all mutation callers use the cache-aware Driver methods (CreateLedger,
UpdateLedgerMetadata, DeleteLedgerMetadata, DeleteBucket, RestoreBucket)
instead; also update any concrete implementations to implement the reduced Store
interface and adjust call sites to call Driver mutation methods where needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/storage/driver/driver.go`:
- Around line 331-345: evictCachedLedgersByBucket only bumps generations for
entries already in ledgerCache, allowing concurrent cold OpenLedger to
repopulate a stale snapshot; fix by introducing a bucket-level generation
counter (e.g., cacheBucketGens map[string]uint64) guarded by d.mu, increment
cacheBucketGens[bucket] inside evictCachedLedgersByBucket (in addition to
deleting entries and calling d.group.Forget), and update setCachedLedgerGen and
the OpenLedger hot/cold-paths to consult the bucket generation (compare
ledger-specific gen + bucket gen) and refuse to install a cache entry if the
bucket gen changed. Ensure all accesses to cacheBucketGens, ledgerCache and
cacheGens use d.mu to avoid races.

---

Outside diff comments:
In `@internal/controller/system/store.go`:
- Around line 14-30: The Store interface currently exposes mutators
(UpdateLedgerMetadata, DeleteLedgerMetadata, DeleteBucket, RestoreBucket) which
allows callers using GetSystemStore() to bypass cache-aware logic on Driver;
remove these mutation methods from the Store interface so Store becomes
read-only (keep GetLedger and Ledgers only), update GetSystemStore() to return
this read-only Store type, and ensure all mutation callers use the cache-aware
Driver methods (CreateLedger, UpdateLedgerMetadata, DeleteLedgerMetadata,
DeleteBucket, RestoreBucket) instead; also update any concrete implementations
to implement the reduced Store interface and adjust call sites to call Driver
mutation methods where needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d83ddd2a-d7e9-447d-80a6-59e661341946

📥 Commits

Reviewing files that changed from the base of the PR and between 194f767 and 0a53a91.

📒 Files selected for processing (5)
  • internal/controller/system/controller.go
  • internal/controller/system/store.go
  • internal/storage/driver/driver.go
  • internal/storage/driver/driver_cache_test.go
  • internal/storage/driver/driver_test.go
💤 Files with no reviewable changes (1)
  • internal/storage/driver/driver_test.go

Comment thread internal/storage/driver/driver.go
Unit tests (no DB):
- TestSetCachedLedgerGenValidGeneration: happy path — matching gen writes entry
- TestSetCachedLedgerGenDisabledWhenTTLZero: TTL=0 skips write
- TestEvictCachedLedgersByBucketEvictsMatchingEntries: all ledgers in bucket removed
- TestEvictCachedLedgersByBucketPreservesOtherBuckets: other buckets unaffected
- TestEvictCachedLedgersByBucketBumpsGenerations: per-ledger generation incremented
- TestEvictCachedLedgersByBucketGenerationPreventsStaleWrites: stale gen rejected
- TestEvictCachedLedgersByBucketEmptyBucketIsNoOp: no match is a safe no-op

Integration tests (//go:build it):
- TestDeleteBucketEvictsCache: after DeleteBucket, OpenLedger re-queries DB and
  gets not-found (proving cache was evicted, not a stale cache hit)
- TestRestoreBucketEvictsCache: after a cross-driver delete+restore, the local
  cache is evicted and OpenLedger returns fresh data

Coverage for new cache methods: getCachedLedger, setCachedLedger,
setCachedLedgerGen, evictCachedLedger, evictCachedLedgersByBucket all 100%.
Remove UpdateLedgerMetadata, DeleteLedgerMetadata, DeleteBucket, and
RestoreBucket from the Store interface returned by GetSystemStore().
These mutations now live exclusively on the cache-aware Driver methods,
so exposing them on Store created a path to bypass cache eviction.

The only GetSystemStore() call sites in the controller already use only
GetLedger and Ledgers — no callers are affected.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
internal/storage/driver/driver.go (1)

331-345: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Bucket eviction still misses in-flight cold loads.

evictCachedLedgersByBucket only bumps cacheGens for names already present in ledgerCache. A concurrent cold OpenLedger for another ledger in this bucket snapshots an unchanged generation at Line 196 and can still pass setCachedLedgerGen at Line 215 after DeleteBucket or RestoreBucket, repopulating pre-mutation state until TTL expiry. This still needs bucket-scoped invalidation, or another way to advance generations for all ledgers in the bucket, not just cached ones.

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/storage/driver/driver_test.go`:
- Around line 413-445: The test currently can pass via a stale in-memory cache;
fix by asserting the post-restore OpenLedger actually hits the DB: create a
small test spy around ledgerstore.Factory (or wrap ledgerstore.NewFactory(db))
that increments a counter whenever the ledger load path performs a DB read,
construct d using that spy factory, warm the cache, perform DeleteBucket with
the other driver, call d.RestoreBucket(ctx, bucketName) and then call
d.OpenLedger(ctx, l.Name) and assert the spy's DB-read counter increased (i.e.
>0) to prove eviction triggered a re-read from the DB; reference symbols:
TestRestoreBucketEvictsCache, d.OpenLedger, driver.RestoreBucket,
ledgerstore.NewFactory.

In `@internal/storage/driver/driver.go`:
- Around line 160-167: The eviction helpers (e.g., evictCachedLedger) are
incrementing d.cacheGens even when caching is disabled (d.cacheTTL <= 0); add an
early-return to skip generation bookkeeping when cache is off: inside
evictCachedLedger (and the similar helper around lines 331-345), acquire the
mutex, perform any necessary delete from d.ledgerCache/d.bucketCache, then if
d.cacheTTL <= 0 release the mutex and call d.group.Forget(name) (or release and
return if group.Forget should be skipped) but do not touch d.cacheGens; in
short, guard updates to d.cacheGens with a d.cacheTTL > 0 check so names are not
accumulated when caching is disabled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1a436e4d-f3ca-4f7d-9a0d-be852adcc8b7

📥 Commits

Reviewing files that changed from the base of the PR and between 194f767 and fea7926.

📒 Files selected for processing (5)
  • internal/controller/system/controller.go
  • internal/controller/system/store.go
  • internal/storage/driver/driver.go
  • internal/storage/driver/driver_cache_test.go
  • internal/storage/driver/driver_test.go

Comment thread internal/storage/driver/driver_test.go
Comment thread internal/storage/driver/driver.go
Add driver_errors_test.go with a stub systemstore.Store so error-return
branches in UpdateLedgerMetadata, DeleteLedgerMetadata, DeleteBucket,
RestoreBucket, and OpenLedger (GetLedger / CountLedgersInBucket failures,
ctx.Done cancellation, cold/warm cache paths) are exercised without a DB.
Raises non-integration unit coverage from 17.6% to ~40%.

Guard cacheGens increments in evictCachedLedger and
evictCachedLedgersByBucket with cacheTTL > 0 so the map does not
accumulate entries when caching is disabled.

Strengthen TestRestoreBucketEvictsCache to use a query-counting bun.DB
and assert counter > before after the post-restore OpenLedger, proving
the cache was evicted and a real DB read occurred rather than a stale
cache hit returning the same ledger name.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
internal/storage/driver/driver_cache_test.go (1)

34-43: ⚡ Quick win

Avoid fixed millisecond sleep for TTL expiry checks.

Line 39 uses a very tight sleep window (2ms for a 1ms TTL), which is prone to CI flakiness due to scheduler jitter. Prefer polling with a bounded timeout.

Proposed change
 func TestCacheMissAfterTTL(t *testing.T) {
 	d := newTestDriver(time.Millisecond)
 	l := ledger.MustNewWithDefault("test-ledger")

 	d.setCachedLedger(l)
-	time.Sleep(2 * time.Millisecond)
-
-	_, ok := d.getCachedLedger(l.Name)
-	require.False(t, ok)
+	require.Eventually(t, func() bool {
+		_, ok := d.getCachedLedger(l.Name)
+		return !ok
+	}, 200*time.Millisecond, 5*time.Millisecond)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/driver/driver_cache_test.go` around lines 34 - 43,
TestCacheMissAfterTTL uses a fixed Sleep(2ms) against a 1ms TTL which can flake;
replace the hard sleep with a polling loop that repeatedly calls
d.getCachedLedger(l.Name) until it returns missing or a bounded timeout elapses.
Update the TestCacheMissAfterTTL test to create d via newTestDriver(...), set
the cache with d.setCachedLedger(l), then poll (e.g., loop with short sleeps)
checking d.getCachedLedger(l.Name) and fail if the ledger is still present after
the timeout to make the test robust against scheduler jitter.
internal/storage/driver/driver_errors_test.go (1)

173-201: ⚡ Quick win

Warm-cache behavior is not actually asserted.

The second call currently validates only success/output. It doesn’t verify that GetLedger was skipped on warm hit, so cache regressions can still pass this test.

Proposed change
 func TestOpenLedgerColdCacheSuccess(t *testing.T) {
 	ctrl := gomock.NewController(t)
 	l := ledger.MustNewWithDefault("test-ledger")
+	getLedgerCalls := 0

 	mockBucket := NewMockBucket(ctrl)

 	bucketFact := NewBucketFactory(ctrl)
 	bucketFact.EXPECT().Create(l.Bucket).Return(mockBucket).AnyTimes()

 	ledgerFact := NewLedgerStoreFactory(ctrl)
 	ledgerFact.EXPECT().Create(gomock.Any(), gomock.Any()).Return(&ledgerstore.Store{}).AnyTimes()

 	stub := &stubSystemStore{
-		getLedgerFn:        func() (*ledger.Ledger, error) { return &l, nil },
+		getLedgerFn: func() (*ledger.Ledger, error) {
+			getLedgerCalls++
+			return &l, nil
+		},
 		countLedgersResult: 1,
 	}
@@
 	_, got2, err := d.OpenLedger(context.Background(), l.Name)
 	require.NoError(t, err)
 	require.Equal(t, l.Name, got2.Name)
+	require.Equal(t, 1, getLedgerCalls, "warm cache hit should not call GetLedger again")
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/driver/driver_errors_test.go` around lines 173 - 201, The
test TestOpenLedgerColdCacheSuccess currently only checks outputs and not that
the warm-cache path skipped backing-store lookups; modify the test to assert
GetLedger was not called on the second OpenLedger invocation by instrumenting
the stubSystemStore/getLedgerFn (or replacing it before the second call) so any
invocation fails the test (e.g., set getLedgerFn = func() (*ledger.Ledger,
error) { t.Fatalf("GetLedger called on warm cache"); return nil, nil }) or by
recording a call counter on stubSystemStore and asserting it did not increase
after the warm call; keep all other assertions (OpenLedger success and
ledger.Name checks) intact and refer to TestOpenLedgerColdCacheSuccess,
d.OpenLedger and stubSystemStore.getLedgerFn when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/storage/driver/driver_cache_test.go`:
- Around line 131-145: The test TestCacheEvictionDoesNotBumpGenWhenDisabled
should assert the cache key is absent rather than conditionally checking its
value: after calling d.evictCachedLedger("test-ledger") twice, acquire the lock
(d.mu.RLock/d.mu.RUnlock) and assert that "test-ledger" does not exist in
d.cacheGens (i.e., existence is false) to ensure evictCachedLedger did not
create an entry when caching is disabled; update the assertion to explicitly
require non-existence of the key instead of only validating gen when exists is
true.

---

Nitpick comments:
In `@internal/storage/driver/driver_cache_test.go`:
- Around line 34-43: TestCacheMissAfterTTL uses a fixed Sleep(2ms) against a 1ms
TTL which can flake; replace the hard sleep with a polling loop that repeatedly
calls d.getCachedLedger(l.Name) until it returns missing or a bounded timeout
elapses. Update the TestCacheMissAfterTTL test to create d via
newTestDriver(...), set the cache with d.setCachedLedger(l), then poll (e.g.,
loop with short sleeps) checking d.getCachedLedger(l.Name) and fail if the
ledger is still present after the timeout to make the test robust against
scheduler jitter.

In `@internal/storage/driver/driver_errors_test.go`:
- Around line 173-201: The test TestOpenLedgerColdCacheSuccess currently only
checks outputs and not that the warm-cache path skipped backing-store lookups;
modify the test to assert GetLedger was not called on the second OpenLedger
invocation by instrumenting the stubSystemStore/getLedgerFn (or replacing it
before the second call) so any invocation fails the test (e.g., set getLedgerFn
= func() (*ledger.Ledger, error) { t.Fatalf("GetLedger called on warm cache");
return nil, nil }) or by recording a call counter on stubSystemStore and
asserting it did not increase after the warm call; keep all other assertions
(OpenLedger success and ledger.Name checks) intact and refer to
TestOpenLedgerColdCacheSuccess, d.OpenLedger and stubSystemStore.getLedgerFn
when making the change.
🪄 Autofix (Beta)

✅ Autofix completed


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: da1b7543-c505-4e32-9fab-952a3e7c3736

📥 Commits

Reviewing files that changed from the base of the PR and between fea7926 and 8ed94e8.

📒 Files selected for processing (4)
  • internal/storage/driver/driver.go
  • internal/storage/driver/driver_cache_test.go
  • internal/storage/driver/driver_errors_test.go
  • internal/storage/driver/driver_test.go

Comment thread internal/storage/driver/driver_cache_test.go
…assertion

Replace conditional existence check with an unconditional require.False so
the test fails if evictCachedLedger creates a cacheGens entry when caching
is disabled, rather than silently passing when the key is absent.
@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 1 file(s) based on 1 unresolved review comment.

Files modified:

  • internal/storage/driver/driver_cache_test.go

Commit: d737ef97b83455db18857e31b0d5110006159d4c

The changes have been pushed to the perf/openledger-cache branch.

Time taken: 2m 9s

coderabbitai Bot and others added 2 commits May 22, 2026 14:28
Fixed 1 file(s) based on 1 unresolved review comment.

Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
Replace polling loop in TestCacheMissAfterTTL with require.Eventually
for idiomatic testify-style flake resistance.

Replace fatal-sentinel swap in TestOpenLedgerColdCacheSuccess with a
getLedgerCalls counter asserted to equal 1 after both calls, giving a
diagnostic count on failure rather than an abrupt t.Fatal.
@ariel-formance ariel-formance requested a review from flemzord May 26, 2026 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants