perf(storage): sparse-wallet query fix — functional indexes + adaptive fallback#1408
perf(storage): sparse-wallet query fix — functional indexes + adaptive fallback#1408ariel-formance wants to merge 8 commits into
Conversation
WalkthroughAdds an adaptive probe‑then‑retry transactions-list paginator with configurable timeouts and wires it through FX/CLI. Separately introduces an indexed-metadata-keys feature flag enabling query predicate rewriting, adds ledger-scoped functional indices on wallet IDs, and includes integration tests for both features. ChangesAdaptive transactions-list pagination and configuration
Indexed metadata keys feature
Sequence DiagramsequenceDiagram
participant Client
participant Store
participant AdaptivePaginator
participant PostgresDB
Client->>Store: Transactions()
Store->>AdaptivePaginator: instantiate with TransactionListConfig
Client->>AdaptivePaginator: Paginate(query, cursor)
AdaptivePaginator->>PostgresDB: SELECT with SET LOCAL statement_timeout=firstAttemptMs
alt probe times out (SQLSTATE 57014)
PostgresDB-->>AdaptivePaginator: query canceled
AdaptivePaginator->>PostgresDB: SELECT with SET LOCAL statement_timeout=retryMs and SET LOCAL enable_indexscan=off
PostgresDB-->>AdaptivePaginator: rows
AdaptivePaginator->>AdaptivePaginator: record fallback metric
else probe succeeds
PostgresDB-->>AdaptivePaginator: rows
end
AdaptivePaginator-->>Client: page, nextCursor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
internal/storage/common/resource.go (1)
345-350: 💤 Low valueConsider validating the OrderExpression format.
The code assumes
paginator.OrderExpression()always returns"column direction"format and usesstrings.Cutto split on space. If a futurePaginatorimplementation returns a different format (e.g., no space), the parsing silently produces invalid SQL. Consider checking the boolean return value fromstrings.Cutor adding a comment documenting the required format.🛡️ Defensive validation
orderExpr := paginator.OrderExpression() col, dir, _ := strings.Cut(orderExpr, " ") +if dir == "" { + return nil, fmt.Errorf("OrderExpression must return 'column direction' format, got: %q", orderExpr) +} finalQuery = finalQuery.Order(fmt.Sprintf("dataset.%s %s", col, dir))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/storage/common/resource.go` around lines 345 - 350, Validate the format returned by paginator.OrderExpression() before using strings.Cut: capture the boolean ok from strings.Cut(orderExpr, " ") and if false, handle the invalid format (e.g., return an error from the enclosing function or fall back to a safe default order) instead of blindly using col/dir; then only call finalQuery.Order(fmt.Sprintf("dataset.%s %s", col, dir)) when ok is true. Also consider adding a brief comment on the expected "column direction" format next to the paginator.OrderExpression() usage to document the requirement.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/storage/ledger/store.go`:
- Around line 123-128: The Transactions() fast-path check treats only 0 as "no
timeout", causing negative StatementTimeoutMs to still use the wrapper; change
the guard in Store.Transactions from checking == 0 to <= 0 so negative/zero
values disable the hints and return transactionsBase(), and ensure the code that
applies statement_timeout (currently gated by > 0) remains unchanged; update the
condition involving txListConfig.StatementTimeoutMs in Transactions() (and any
analogous checks) to use <= 0 so behavior is consistent with the timeout
application in the RunInTx wrapper (transactionsPaginationWithHints).
- Around line 196-206: The current bun.Tx branch calls issueSetLocal on the
caller-managed transaction which leaves planner/session settings active for the
rest of that outer transaction; instead, start a nested transaction
(savepoint-style) for the Paginate work so the SET LOCAL doesn't leak.
Concretely: in the case bun.Tx branch, begin a new transaction from the outer tx
(create a nested bun.Tx via BeginTx on the store/outer tx), call issueSetLocal
on that nested tx, invoke transactionsBase() against that nested tx and call
Paginate(ctx, q), then commit/rollback the nested tx and return the Paginate
result; keep using the existing symbols issueSetLocal, transactionsBase(),
Paginate and the bun.Tx branch to locate where to change.
---
Nitpick comments:
In `@internal/storage/common/resource.go`:
- Around line 345-350: Validate the format returned by
paginator.OrderExpression() before using strings.Cut: capture the boolean ok
from strings.Cut(orderExpr, " ") and if false, handle the invalid format (e.g.,
return an error from the enclosing function or fall back to a safe default
order) instead of blindly using col/dir; then only call
finalQuery.Order(fmt.Sprintf("dataset.%s %s", col, dir)) when ok is true. Also
consider adding a brief comment on the expected "column direction" format next
to the paginator.OrderExpression() usage to document the requirement.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: de0298dd-34ef-4ba0-80be-243ec3dcfd96
📒 Files selected for processing (9)
cmd/serve.gointernal/storage/common/paginator.gointernal/storage/common/paginator_column.gointernal/storage/common/paginator_offset.gointernal/storage/common/resource.gointernal/storage/driver/module.gointernal/storage/ledger/store.gointernal/storage/ledger/transactions_planner_hints_test.gointernal/storage/module.go
1012b5e to
577f762
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
internal/storage/ledger/store.go (1)
271-278:⚠️ Potential issue | 🟡 Minor | ⚖️ Poor tradeoffSET LOCAL leaks to subsequent queries when caller is already in a transaction.
When
store.dbis already abun.Tx(lines 273-278), the code issuesSET LOCALdirectly on the caller's transaction. Those settings (enable_indexscan = off,statement_timeout) persist for all subsequent statements within that outer transaction—not just thisPaginatecall.This means if a caller does:
store, tx, _ := baseStore.BeginTX(ctx, nil) defer tx.Rollback() store.Transactions().Paginate(ctx, q) // triggers fallback → enable_indexscan = off store.Logs().Paginate(ctx, q2) // now runs with enable_indexscan = offThe workaround documented in the past review (nested transaction via savepoint) would fully isolate the settings. Given this is a defensive branch for an uncommon path, consider whether the current behavior is acceptable or if savepoint isolation is worth adding.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/storage/ledger/store.go` around lines 271 - 278, The current defensive branch mutates the caller's bun.Tx by calling issueSetLocal directly on store.db (when it implements bun.Tx), leaking SET LOCAL state; instead, wrap the scope in a SAVEPOINT to isolate settings: on detecting store.db as bun.Tx, execute a SAVEPOINT (unique name), call issueSetLocal using that same tx, invoke a.store.transactionsBase().Paginate(ctx, q) while the savepoint is active, then ROLLBACK TO SAVEPOINT (to undo the session-local settings) and RELEASE the savepoint; update the branch handling store.db.(bun.Tx) around issueSetLocal and transactionsBase().Paginate to use this savepoint pattern so outer transaction settings are not mutated.
🧹 Nitpick comments (1)
internal/storage/ledger/store.go (1)
621-638: ⚡ Quick win
DefaultTransactionListConfig()is not applied automatically — zero-valued config disables fallback.The struct field comment on line 50-53 says "Default true", but
defaultOptions(line 640) doesn't includeWithTransactionListConfig(DefaultTransactionListConfig()). If aStoreis created without explicitly passingWithTransactionListConfig, the zero-valuedTransactionListConfighasEnableAdaptiveFallback: false.This is fine for the CLI path (which always supplies the config via FX wiring), but direct callers of
ledgerstore.New()must remember to pass the option if they want the documented default behavior. Consider either:
- Updating the doc comment to clarify "default when wired through FX" vs "zero-value when created directly", or
- Adding
WithTransactionListConfig(DefaultTransactionListConfig())todefaultOptionsso the behavior matches the documentation.Option 2: Apply default config automatically
var defaultOptions = []Option{ WithMeter(noopmetrics.Meter{}), WithTracer(nooptracer.Tracer{}), + WithTransactionListConfig(DefaultTransactionListConfig()), }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/storage/ledger/store.go` around lines 621 - 638, DefaultTransactionListConfig() is never applied automatically so a zero-valued TransactionListConfig disables adaptive fallback; update the store defaults by adding WithTransactionListConfig(DefaultTransactionListConfig()) into the defaultOptions slice so new Store instances get the documented defaults (adjust symbols: DefaultTransactionListConfig, WithTransactionListConfig, defaultOptions, TransactionListConfig, Store).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@internal/storage/ledger/store.go`:
- Around line 271-278: The current defensive branch mutates the caller's bun.Tx
by calling issueSetLocal directly on store.db (when it implements bun.Tx),
leaking SET LOCAL state; instead, wrap the scope in a SAVEPOINT to isolate
settings: on detecting store.db as bun.Tx, execute a SAVEPOINT (unique name),
call issueSetLocal using that same tx, invoke
a.store.transactionsBase().Paginate(ctx, q) while the savepoint is active, then
ROLLBACK TO SAVEPOINT (to undo the session-local settings) and RELEASE the
savepoint; update the branch handling store.db.(bun.Tx) around issueSetLocal and
transactionsBase().Paginate to use this savepoint pattern so outer transaction
settings are not mutated.
---
Nitpick comments:
In `@internal/storage/ledger/store.go`:
- Around line 621-638: DefaultTransactionListConfig() is never applied
automatically so a zero-valued TransactionListConfig disables adaptive fallback;
update the store defaults by adding
WithTransactionListConfig(DefaultTransactionListConfig()) into the
defaultOptions slice so new Store instances get the documented defaults (adjust
symbols: DefaultTransactionListConfig, WithTransactionListConfig,
defaultOptions, TransactionListConfig, Store).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f0313601-844c-4765-9ac0-8f2fc7402a49
📒 Files selected for processing (5)
cmd/serve.gointernal/storage/driver/module.gointernal/storage/ledger/store.gointernal/storage/ledger/transactions_planner_hints_test.gointernal/storage/module.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/storage/driver/module.go
- internal/storage/module.go
…imeout
Background
----------
SELECT … ORDER BY id DESC LIMIT N with JSONB @> predicates leads Postgres to
an Index Scan Backward on the id B-tree. For dense/recent wallets that plan is
fast. For sparse wallets (few matching rows scattered across a large id range)
the scan walks most of the table before accumulating N results — observed ~50 s
at prod-us-east-1-deriv (ledger v2.4.9).
Forcing GIN bitmap scan globally would hurt dense wallets, so a deployment-wide
planner override is not viable.
Adaptive mitigation (this PR)
------------------------------
This is a stopgap, not the real fix. The real fix is a composite or
denormalised index that serves both the wallet filter and id ordering without a
full sort step. The deliberate cost is up to <firstAttemptTimeout> of wasted
work per uncached sparse-wallet request.
Strategy (exactly one retry, no loop):
1. Execute the transactions-list SELECT inside an explicit read-only transaction
with SET LOCAL statement_timeout = <firstAttemptTimeout> (default 5 s).
Dense wallets finish well inside the budget — no fallback, no overhead beyond
the transaction wrapper.
2. If Postgres cancels with SQLSTATE 57014 AND the request context is still alive
(i.e. our timeout fired, not the client disconnecting), roll back and retry
once with SET LOCAL enable_indexscan = off plus SET LOCAL statement_timeout =
<retryTimeout> (default 40 s). Disabling index scans forces the planner onto
the GIN bitmap path for the JSONB @> predicates.
3. Any other error, or a client-context cancellation, is returned unchanged.
Exactly one retry. Never a loop.
Transaction hygiene: SET LOCAL is strictly scoped to each explicit BEGIN/COMMIT
block, so no planner setting or timeout leaks to subsequent queries on the same
pooled connection.
Changes
-------
- internal/storage/ledger/store.go
· TransactionListConfig struct (EnableAdaptiveFallback, FirstAttemptTimeoutMs,
RetryTimeoutMs) + DefaultTransactionListConfig() + WithTransactionListConfig()
· transactionsAdaptivePaginator: probe-then-retry Paginate; GetOne/Count
delegate straight to the base repository (no adaptive logic needed)
· paginateInTx / issueSetLocal helpers for the explicit-tx + SET LOCAL dance
· OTel metrics: store.tx_list_fallback_total (counter),
store.tx_list_first_attempt_duration (histogram),
store.tx_list_retry_duration (histogram, outcome attribute)
· Structured log on every fallback event
- internal/storage/driver/module.go — ModuleConfig carries TransactionListConfig
- internal/storage/module.go — forwards TransactionListConfig to driver
- cmd/serve.go
· --tx-list-adaptive-fallback (default true)
· --tx-list-first-attempt-timeout-ms (default 5000)
· --tx-list-retry-timeout-ms (default 40000)
- internal/storage/ledger/transactions_planner_hints_test.go (//go:build it)
· FastPathUnchanged — generous timeout, dense wallet, result == baseline
· FallbackTriggered — 1 ms probe, retry succeeds, rows correct + ordered DESC
· FallbackRowsMatch — fallback result == baseline result row-for-row
· NoLeakage — pinned conn: enable_indexscan + statement_timeout restored
· RetryAlsoTimesOut — both timeouts 1 ms → error propagated, no loop
· DisabledFallback — EnableAdaptiveFallback=false → plain path, no overhead
· CursorIntegrity — next-page cursor from fallback path is decodable and correct
· GetOneAndCountUnaffected — adaptive store GetOne/Count == base store
577f762 to
e27296c
Compare
Three gaps addressed: 1. ClientCancelNoRetry (new test): proves that a cancelled request context returns context.Canceled immediately and never triggers the retry. This is the critical client-disconnect safety property — if the ctx.Err() guard in Paginate were wrong, a dead-client request would fire a wasted retry. 2. RetryAlsoTimesOut: bumped data set from 3 to 200 rows so the 1 ms statement_timeout reliably fires SQLSTATE 57014 even on fast CI runners. With only 3 rows the probe could complete in < 1 ms and require.Error would fail spuriously. 3. FallbackTriggeredByTimeout / FallbackRowsMatchBaseline: same 200-row fix — both tests used 1 ms probes with tiny data sets that might not timeout, causing the assertions to pass without exercising the fallback path at all.
…d edge-case tests Code fix -------- paginateInTx previously issued SET LOCAL directly onto an outer bun.Tx when store.db was already a transaction (BeginTX path). If the probe timed out (57014), Postgres put the outer transaction into an error state. The retry then attempted to issue another SET LOCAL on the aborted transaction, which failed with InFailedSqlTransaction (25P02), leaving the caller's transaction permanently unusable. This path is unreachable today (list operations are never called within an explicit BeginTX), but it was a silent trap for future callers. Fix: when store.db is a bun.Tx, skip the adaptive machinery entirely and delegate straight to the base paginator. New tests --------- - EmptyResultSet: zero wallet rows → empty cursor, no error, HasMore false. Also confirms the fast path doesn't blow up on a zero-row response. - AlreadyInTxSkipsAdaptive: opens an outer transaction via BeginTX, then calls Transactions().Paginate() with a 1 ms probe timeout. Verifies the call succeeds, returns the right rows, and that statement_timeout on the outer transaction is still "0" (the adaptive SET LOCAL never ran).
The test was flaky on GitHub Actions because the GIN bitmap scan (used on the retry path) completed in < 1ms even with 200 rows, so statement_timeout never fired and require.Error always saw nil. Add a testHookBeforePaginateSelect field to Store and a setter SetTestHookBeforePaginateSelect that tests can use to inject arbitrary SQL inside paginateInTx, sharing the same transaction and therefore the same SET LOCAL statement_timeout scope. The test now installs SELECT pg_sleep(0.005), which reliably exceeds a 1ms timeout regardless of how fast the real SELECT would have been. Also fix gofmt alignment on three struct fields in Store (extra spaces → gofmt canonical alignment) and a double-space in the test (1, // → 1, //).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1408 +/- ##
==========================================
+ Coverage 80.55% 80.84% +0.29%
==========================================
Files 206 206
Lines 11293 11372 +79
==========================================
+ Hits 9097 9194 +97
+ Misses 1646 1619 -27
- Partials 550 559 +9 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
The FallbackTriggeredByTimeout / FallbackRowsMatchBaseline / NoLeakage tests
relied on dataset size (200 rows) to make the 1 ms probe fire SQLSTATE 57014.
On a fast CI runner those queries complete in < 1 ms and the fallback never
triggers, leaving the retry-success branch in Paginate (the else { Infof } block)
and several issueSetLocal branches permanently uncovered.
Add three targeted tests that use the pg_sleep hook introduced in the previous
commit so the fallback fires deterministically regardless of query speed:
* FallbackRetrySucceedsViaPgSleep — hook makes the probe fire 57014 and is a
no-op on the retry, covering the retry-success path and its logging.
* CtxCancelAfterProbe — hook cancels the outer ctx then fires 57014, covering
the ctx.Err() != nil branch in Paginate (no retry when client disconnects).
* ZeroTimeoutNoSetLocal — FirstAttemptTimeoutMs = 0 skips SET LOCAL entirely,
covering the if timeoutMs > 0 { } false branch in issueSetLocal; also
exercises DefaultTransactionListConfig().
Also add a no-op hook to FastPathUnchanged to cover the "hook set but returns
nil" fall-through path in paginateInTx.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/storage/ledger/transactions_planner_hints_test.go (1)
553-561: ⚡ Quick winStrengthen no-retry verification in the cancel-after-probe test.
Line 559 currently verifies only that an error is returned. A retry regression could still pass this assertion. Track hook calls and assert exactly one attempt.
Suggested test hardening
func TestTransactionListAdaptive_CtxCancelAfterProbe(t *testing.T) { @@ - adaptive.SetTestHookBeforePaginateSelect(func(_ context.Context, tx bun.Tx) error { + callCount := 0 + adaptive.SetTestHookBeforePaginateSelect(func(_ context.Context, tx bun.Tx) error { + callCount++ cancel() _, err := tx.ExecContext(context.Background(), "SELECT pg_sleep(0.005)") return err // SQLSTATE 57014; ctx is now cancelled }) @@ _, err := adaptive.Transactions().Paginate(ctx, walletQuery(15)) require.Error(t, err, "must return an error when context is cancelled after probe timeout") + require.Equal(t, 1, callCount, "must not retry after context cancellation") }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/storage/ledger/transactions_planner_hints_test.go` around lines 553 - 561, The test should verify the paginate hook is invoked exactly once to ensure no retries: add a counter (e.g. attempts := 0) captured by the closure passed to adaptive.SetTestHookBeforePaginateSelect and increment it each time the hook runs, leaving the existing cancel() and tx.ExecContext sleep behavior unchanged; after calling adaptive.Transactions().Paginate(ctx, walletQuery(15)) assert that an error was returned and that the counter equals 1 (no retries), referencing the existing hook setup function adaptive.SetTestHookBeforePaginateSelect, the call adaptive.Transactions().Paginate, and walletQuery(15) to locate where to insert the counter and final assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/storage/ledger/transactions_planner_hints_test.go`:
- Around line 553-561: The test should verify the paginate hook is invoked
exactly once to ensure no retries: add a counter (e.g. attempts := 0) captured
by the closure passed to adaptive.SetTestHookBeforePaginateSelect and increment
it each time the hook runs, leaving the existing cancel() and tx.ExecContext
sleep behavior unchanged; after calling adaptive.Transactions().Paginate(ctx,
walletQuery(15)) assert that an error was returned and that the counter equals 1
(no retries), referencing the existing hook setup function
adaptive.SetTestHookBeforePaginateSelect, the call
adaptive.Transactions().Paginate, and walletQuery(15) to locate where to insert
the counter and final assertion.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2a7d7c3f-8ea1-4db1-97d8-0932f9cf3b7e
⛔ Files ignored due to path filters (1)
go.modis excluded by!**/*.mod
📒 Files selected for processing (2)
internal/storage/ledger/store.gointernal/storage/ledger/transactions_planner_hints_test.go
🛑 Changes requested — multi-model reviewThe adaptive fallback implementation is well-structured overall — the in-transaction bypass is a sound safety measure, the fx/config/metrics plumbing is clean, and the test coverage is a good foundation. However, several correctness and robustness issues require attention before merging. The most critical concerns are: (1) the test hook field is read and written on a shared Store without synchronization, creating a real data race; (2) SQLSTATE 57014 is shared between statement_timeout, pg_cancel_backend, and other server-side cancellations, so the current classification can trigger spurious 40-second retries on non-timeout cancellations; (3) there is no integration test verifying that a real statement_timeout error propagates unwrapped through the bun/paginate stack to actually trigger the fallback; and (4) existing callers that construct ModuleConfig without the new TransactionListConfig field silently receive the zero value, disabling the fallback outside the cmd/serve path despite the documented defaults. Minor issues include: the SET LOCAL statement_timeout = 0 omission meaning inherited session timeouts are not overridden as documented; no validation of negative or implausibly large timeout values in the config; metric instruments that are unconditionally dereferenced and could panic in non-standard construction paths; the in-transaction bypass being silent and unobservable; and probe timeouts being logged at error level for what is an expected, handled condition. 🟠 [major] testHookBeforePaginateSelect mutates shared Store without synchronization
SetTestHookBeforePaginateSelect writes store.testHookBeforePaginateSelect while paginateInTx reads it concurrently across goroutines serving HTTP requests. A Store is shared across goroutines, so even though the hook is intended for tests only, the nil-check and read on every Paginate call creates a data race. Exposing a test-only mutation method on the production Store type is also a maintenance and security smell. Suggestion: Gate the hook behind a build tag, or inject it via an unexported field set only at construction time (e.g. an Option used solely in tests), so it cannot be mutated on a live shared Store. At minimum document that it must only be set before the store serves any traffic. 🟠 [major] SQLSTATE 57014 covers client/admin cancellation as well as statement_timeout
isStatementTimeout treats any pgerrcode.QueryCanceled (57014) as a probe timeout. Postgres returns 57014 for statement_timeout, pg_cancel_backend, and lock_timeout-adjacent cancellations alike. The ctx.Err()==nil guard only filters client disconnects detected on the Go side; a server-side cancel (e.g. an operator running pg_cancel_backend) would be misclassified as a probe timeout and trigger a spurious retry with enable_indexscan=off, potentially running for up to RetryTimeoutMs (40s) doing unwanted work. Suggestion: Distinguish statement_timeout from generic cancellation by also inspecting pgErr.Message (e.g. checking for 'canceling statement due to statement timeout') in addition to the code. At minimum document this residual ambiguity in a comment on isStatementTimeout. 🟠 [major] Error classification depends on raw *pgconn.PgError surviving the bun/paginate stack
isStatementTimeout relies on errors.As reaching the underlying *pgconn.PgError. If any layer in the paginator stack (bun, the common paginate package, etc.) wraps or replaces the driver error with a non-wrapping type or calls ResolveError internally, errors.As will fail and the fallback will silently never trigger — the exact pathological case this PR targets would still time out. There is no test asserting that a real statement_timeout from the SELECT is classified correctly through the full stack. Suggestion: Add an integration test that drives a genuine statement_timeout on the actual transactions SELECT (without the hook) and asserts the fallback fires, to lock in that the error type propagates unwrapped through the paginator. 🟠 [major] Zero-value TransactionListConfig silently disables the adaptive fallback for existing callers
TransactionListConfig is forwarded unconditionally. Existing callers that construct storage.ModuleConfig without the new field will pass the zero value through to ledgerstore.WithTransactionListConfig, leaving EnableAdaptiveFallback=false and both timeouts at 0. This contradicts the PR's documented default of fallback enabled with 5s/40s timeouts and means the mitigation may be silently disabled outside the cmd serve path where Cobra provides defaults. Suggestion: Normalize defaults at the module/store boundary. Initialize from ledgerstore.DefaultTransactionListConfig() and only override explicitly supplied values, or make ModuleConfig.TransactionListConfig a pointer so nil means defaults while an explicit config can disable the fallback. 🟡 [minor] No validation that probe+retry budgets are non-negative or fit within client timeout
The flags accept arbitrary int64 values. Negative timeouts would be passed straight into SET LOCAL statement_timeout = , which Postgres rejects at query time, surfacing as an opaque error. The design also requires RetryTimeoutMs ≤ client_timeout − FirstAttemptTimeoutMs, but nothing enforces or warns about this invariant. Suggestion: Add validation in cfg.Validate() (or wherever serve config is validated) to reject negative timeouts and optionally warn when RetryTimeoutMs is implausibly large relative to known client timeouts. 🟡 [minor] timeoutMs=0 does not actually disable the statement timeout when the session already has one set
paginateInTx documents timeoutMs: 0 = no timeout, and issueSetLocal skips SET LOCAL statement_timeout when the value is 0. However, if the role, database, DSN, or session already has a nonzero statement_timeout, the query will still be bounded by that inherited timeout and may unexpectedly trigger the fallback. Suggestion: If 0 is intended to disable the server-side timeout for that attempt, explicitly execute SET LOCAL statement_timeout = 0 when timeoutMs is 0. Otherwise update the comment and configuration documentation to clarify that 0 means 'do not override the existing session timeout'. 🟡 [minor] Fallback metric instruments may be nil if Store is constructed outside New()
txListFallbackCounter, txListFirstAttemptDurationMs, and txListRetryDurationMs are only initialized in New(). They are dereferenced unconditionally in the fallback branch. While current construction paths go through New(), any future struct-literal construction or refactor would cause a nil-pointer panic in the fallback branch. Suggestion: Defensively nil-check the metric instruments before calling Add/Record, or add a comment and construction invariant guaranteeing they are always non-nil when EnableAdaptiveFallback is true. 🟡 [minor] In-transaction bypass of adaptive fallback is silent and unobservable
paginateInTx returns the plain base paginator whenever store.db is a bun.Tx, which is correct for safety, but means any transactions-list query inside an outer transaction silently loses the sparse-wallet protection and can still hit the long timeout. There is no metric or log indicating the mitigation was skipped, making it impossible for operators to tell whether a slow list ran with or without the fallback. Suggestion: Emit a debug log or increment a counter when the adaptive path is bypassed due to an outer transaction, so the skipped-mitigation case is observable in production. ⚪ [nit] FirstAttemptTimeoutMs=0 silently makes the fallback unreachable
When FirstAttemptTimeoutMs is 0, issueSetLocal skips SET LOCAL statement_timeout, so the probe runs with no server-side timeout and can only be cancelled by the Go context. In that configuration the fallback can effectively never fire on the probe (no 57014 can be raised), making the adaptive machinery a pure read-only-transaction wrapper. This behaviour is non-obvious to operators. Suggestion: Document that FirstAttemptTimeoutMs=0 disables the probe-timeout trigger entirely, rendering the fallback unreachable under normal conditions. ⚪ [nit] Probe timeout logged at error level for expected, handled condition
Every fallback event logs at Errorf level ('transactions list probe timed out'). For deployments with sparse wallets this is expected/benign behaviour, not an error, and will create noisy error-level log volume that may trip error-rate alerts. Suggestion: Log the probe timeout at Warn or Info level since it is a handled, expected condition. Reserve Errorf for the retry-failure case. Reviewed in parallel by claude (anthropic/claude-opus-4-8) and gpt (openai/gpt-5.5), then consolidated. This comment is updated on each push. |
…a queries
Adds INDEXED_METADATA_KEYS feature flag (comma-separated key names). When a
metadata key is flagged, the query builder emits
metadata ->> 'key' = 'value'
instead of
metadata @> '{"key": "value"}'
allowing Postgres to use a partial functional BTree index rather than falling
back to a full backward scan on the id index.
Migrations 53 and 54 create partial functional indexes scoped to ledger =
'deriv' for source_wallet_id and destination_wallet_id.
Simulation at 1 M rows (0.005% selectivity, matching Deriv production scale):
plain @> path: 497 ms
indexed ->> path: 1 ms
speedup: 452×
To activate for Deriv after migrations have run:
PATCH /ledgers/deriv {"features": {"INDEXED_METADATA_KEYS": "source_wallet_id,destination_wallet_id"}}
NumaryBot
left a comment
There was a problem hiding this comment.
🛑 Changes requested — automated review
The adaptive retry path is broadly structured correctly, but the newly added functional-index path has planner-visibility issues that can make the intended indexes unusable in realistic deployments. These are performance bugs in the mitigation/index work and should be addressed.
| @@ -0,0 +1,3 @@ | |||
| create index {{ if not .Transactional }}concurrently{{end}} if not exists transactions_metadata_source_wallet_id_deriv | |||
| on "{{.Schema}}".transactions ((metadata->>'source_wallet_id')) | |||
| where ledger = 'deriv'; | |||
There was a problem hiding this comment.
🟠 [major] Keep ledger predicate usable for partial indexes
When the bucket contains only the deriv ledger, newScopedSelect() deliberately omits the ledger = ? filter, so a transactions-list query on that bucket will not imply this partial-index predicate. PostgreSQL therefore cannot use these new indexes in the single-ledger-bucket case, which is a likely deployment shape for the targeted Deriv ledger and makes the migration ineffective for the sparse-wallet queries it is meant to help.
| return "metadata @> ?", []any{map[string]any{ | ||
| match[0][1]: value, | ||
| }}, nil | ||
| if slices.Contains(h.store.ledger.GetIndexedMetadataKeys(), key) { |
There was a problem hiding this comment.
🟠 [major] Inline the indexed metadata key in the expression
When INDEXED_METADATA_KEYS is enabled, this emits metadata ->> $1 = $2 with the key as a bind parameter rather than the indexed expression metadata ->> 'source_wallet_id' = $1. That parameterized expression is not reliably matched to the new functional index on (metadata->>'source_wallet_id'), especially once prepared/generic plans are involved, so the optimization can silently miss the index for the sparse-wallet path.
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
internal/storage/ledger/transactions_metadata_index_test.go (1)
36-40: ⚡ Quick winAdd a regression test for comma+space feature values.
Please add a case like
"source_wallet_id, destination_wallet_id"to ensure indexed-key detection remains robust for human-edited config strings.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/storage/ledger/transactions_metadata_index_test.go` around lines 36 - 40, Add a new test case in the transactions_metadata_index_test.go file that uses the withIndexedMetadataKeys helper function with a comma+space separated string value like "source_wallet_id, destination_wallet_id" to verify that the indexed-key detection logic correctly handles human-edited configuration strings with spaces following commas, not just comma-separated values without spaces. This regression test should validate that the feature value parsing remains robust for both formats.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/ledger.go`:
- Around line 41-50: The GetIndexedMetadataKeys method does not normalize the
tokens parsed from the comma-separated feature value, preserving whitespace
around delimiters. When splitting "source_wallet_id, destination_wallet_id" by
comma, the result includes " destination_wallet_id" with leading space, which
causes downstream slices.Contains checks to fail. Modify the return statement to
iterate through each token from strings.Split, trim whitespace from each token
using strings.TrimSpace, skip any empty entries after trimming, and return only
the cleaned, non-empty tokens.
In `@internal/storage/ledger/resource_transactions.go`:
- Around line 151-154: The code is checking if a key exists in indexed metadata
keys using slices.Contains with h.store.ledger.GetIndexedMetadataKeys(), but
indexed metadata keys may contain space padding (like "source_wallet_id,
destination_wallet_id"), causing the exact match to fail and silently skipping
the indexed branch. Normalize both the key parameter and the indexed metadata
keys by trimming whitespace before performing the membership check in the
slices.Contains call to ensure space-padded keys are properly matched.
In `@internal/storage/ledger/transactions_deriv_sim_test.go`:
- Around line 229-233: The totalRows variable obtained from getEnvInt for
DERIV_SIM_ROWS is used as a divisor in the selectivity calculation without
validation, which will cause a panic if set to zero and could produce misleading
results if set to an invalid value. Add validation after retrieving totalRows
and walletRows using getEnvInt to ensure totalRows is greater than zero,
walletRows is positive, and walletRows does not exceed totalRows. If validation
fails, either use sensible defaults or fail the test early with a clear error
message before attempting the division in the t.Logf call.
- Around line 193-210: The EXPLAIN query in the test hardcodes the filter value
directly into the SQL string using fmt.Sprintf with %s in the else branch of the
filter condition, but the actual runtime paginator uses parameterized queries
with placeholders. To make the test query shape-equivalent to the real paginator
query, modify the predicate construction in the else block to use a
parameterized placeholder (?) instead of embedding the value directly with
string formatting, so the EXPLAIN plan accurately reflects what the runtime
query path will execute.
In `@pkg/features/features.go`:
- Around line 26-31: The FeatureIndexedMetadataKeys flag accepts arbitrary
metadata key lists without validation, but downstream code in ResolveFilter
rewrites these keys using text extraction (metadata ->> ? = ?) which only works
correctly for string-valued metadata. This causes incorrect behavior for
numeric, boolean, or object metadata values. Add a dedicated validator or parser
function that processes the INDEXED_METADATA_KEYS configuration value
(referenced at lines 56 and 65-66) to enforce that only string-valued indexed
metadata keys are allowed, either by validating against a whitelist of known
string-indexed keys like wallet IDs or by implementing a contract validator that
ensures type safety for the configured keys.
---
Nitpick comments:
In `@internal/storage/ledger/transactions_metadata_index_test.go`:
- Around line 36-40: Add a new test case in the
transactions_metadata_index_test.go file that uses the withIndexedMetadataKeys
helper function with a comma+space separated string value like
"source_wallet_id, destination_wallet_id" to verify that the indexed-key
detection logic correctly handles human-edited configuration strings with spaces
following commas, not just comma-separated values without spaces. This
regression test should validate that the feature value parsing remains robust
for both formats.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4764d899-30cb-4f91-b5c0-d40c0bd32f72
⛔ Files ignored due to path filters (2)
internal/storage/bucket/migrations/53-add-metadata-source-wallet-id-index-deriv/notes.yamlis excluded by!**/*.yamlinternal/storage/bucket/migrations/54-add-metadata-destination-wallet-id-index-deriv/notes.yamlis excluded by!**/*.yaml
📒 Files selected for processing (7)
internal/ledger.gointernal/storage/bucket/migrations/53-add-metadata-source-wallet-id-index-deriv/up.sqlinternal/storage/bucket/migrations/54-add-metadata-destination-wallet-id-index-deriv/up.sqlinternal/storage/ledger/resource_transactions.gointernal/storage/ledger/transactions_deriv_sim_test.gointernal/storage/ledger/transactions_metadata_index_test.gopkg/features/features.go
| // GetIndexedMetadataKeys returns the list of metadata keys for which the query | ||
| // builder will emit a functional-index-compatible predicate (metadata ->> 'key' = 'value') | ||
| // instead of the default JSONB containment form. The list is stored as a comma-separated | ||
| // string in Features[FeatureIndexedMetadataKeys]. | ||
| func (l Ledger) GetIndexedMetadataKeys() []string { | ||
| val := l.Features[features.FeatureIndexedMetadataKeys] | ||
| if val == "" { | ||
| return nil | ||
| } | ||
| return strings.Split(val, ",") |
There was a problem hiding this comment.
Normalize tokens from the comma-separated feature value.
Line 50 preserves whitespace and empty tokens, so a common config like source_wallet_id, destination_wallet_id returns " destination_wallet_id" and the downstream exact slices.Contains(..., key) check will not enable the indexed predicate for that key. Trim and skip empty entries before returning.
Proposed parsing fix
func (l Ledger) GetIndexedMetadataKeys() []string {
val := l.Features[features.FeatureIndexedMetadataKeys]
if val == "" {
return nil
}
- return strings.Split(val, ",")
+
+ keys := make([]string, 0, strings.Count(val, ",")+1)
+ for _, key := range strings.Split(val, ",") {
+ key = strings.TrimSpace(key)
+ if key != "" {
+ keys = append(keys, key)
+ }
+ }
+ return keys
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // GetIndexedMetadataKeys returns the list of metadata keys for which the query | |
| // builder will emit a functional-index-compatible predicate (metadata ->> 'key' = 'value') | |
| // instead of the default JSONB containment form. The list is stored as a comma-separated | |
| // string in Features[FeatureIndexedMetadataKeys]. | |
| func (l Ledger) GetIndexedMetadataKeys() []string { | |
| val := l.Features[features.FeatureIndexedMetadataKeys] | |
| if val == "" { | |
| return nil | |
| } | |
| return strings.Split(val, ",") | |
| // GetIndexedMetadataKeys returns the list of metadata keys for which the query | |
| // builder will emit a functional-index-compatible predicate (metadata ->> 'key' = 'value') | |
| // instead of the default JSONB containment form. The list is stored as a comma-separated | |
| // string in Features[FeatureIndexedMetadataKeys]. | |
| func (l Ledger) GetIndexedMetadataKeys() []string { | |
| val := l.Features[features.FeatureIndexedMetadataKeys] | |
| if val == "" { | |
| return nil | |
| } | |
| keys := make([]string, 0, strings.Count(val, ",")+1) | |
| for _, key := range strings.Split(val, ",") { | |
| key = strings.TrimSpace(key) | |
| if key != "" { | |
| keys = append(keys, key) | |
| } | |
| } | |
| return keys | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/ledger.go` around lines 41 - 50, The GetIndexedMetadataKeys method
does not normalize the tokens parsed from the comma-separated feature value,
preserving whitespace around delimiters. When splitting "source_wallet_id,
destination_wallet_id" by comma, the result includes " destination_wallet_id"
with leading space, which causes downstream slices.Contains checks to fail.
Modify the return statement to iterate through each token from strings.Split,
trim whitespace from each token using strings.TrimSpace, skip any empty entries
after trimming, and return only the cleaned, non-empty tokens.
| if slices.Contains(h.store.ledger.GetIndexedMetadataKeys(), key) { | ||
| return "metadata ->> ? = ?", []any{key, value}, nil | ||
| } | ||
| return "metadata @> ?", []any{map[string]any{key: value}}, nil |
There was a problem hiding this comment.
Normalize indexed metadata keys before membership checks.
Space-padded feature values (for example "source_wallet_id, destination_wallet_id") won’t match key exactly, so the indexed branch is silently skipped.
Suggested fix
// internal/ledger.go
func (l Ledger) GetIndexedMetadataKeys() []string {
val := l.Features[features.FeatureIndexedMetadataKeys]
if val == "" {
return nil
}
- return strings.Split(val, ",")
+ raw := strings.Split(val, ",")
+ keys := make([]string, 0, len(raw))
+ for _, k := range raw {
+ k = strings.TrimSpace(k)
+ if k != "" {
+ keys = append(keys, k)
+ }
+ }
+ return keys
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if slices.Contains(h.store.ledger.GetIndexedMetadataKeys(), key) { | |
| return "metadata ->> ? = ?", []any{key, value}, nil | |
| } | |
| return "metadata @> ?", []any{map[string]any{key: value}}, nil | |
| func (l Ledger) GetIndexedMetadataKeys() []string { | |
| val := l.Features[features.FeatureIndexedMetadataKeys] | |
| if val == "" { | |
| return nil | |
| } | |
| raw := strings.Split(val, ",") | |
| keys := make([]string, 0, len(raw)) | |
| for _, k := range raw { | |
| k = strings.TrimSpace(k) | |
| if k != "" { | |
| keys = append(keys, k) | |
| } | |
| } | |
| return keys | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/storage/ledger/resource_transactions.go` around lines 151 - 154, The
code is checking if a key exists in indexed metadata keys using slices.Contains
with h.store.ledger.GetIndexedMetadataKeys(), but indexed metadata keys may
contain space padding (like "source_wallet_id, destination_wallet_id"), causing
the exact match to fail and silently skipping the indexed branch. Normalize both
the key parameter and the indexed metadata keys by trimming whitespace before
performing the membership check in the slices.Contains call to ensure
space-padded keys are properly matched.
| // Replicate what resource_transactions.ResolveFilter + BuildDataset produce. | ||
| // ledgerName and value are test-controlled alphanumeric strings, safe to embed. | ||
| var predicate string | ||
| if filter == "@>" { | ||
| // %q adds Go double-quotes; JSON string values need double-quotes, so this is correct. | ||
| predicate = fmt.Sprintf(`metadata @> '{"source_wallet_id": %q}'`, value) | ||
| } else { | ||
| // SQL string literals use single quotes. | ||
| predicate = fmt.Sprintf(`metadata ->> 'source_wallet_id' = '%s'`, value) | ||
| } | ||
|
|
||
| sql := fmt.Sprintf(` | ||
| EXPLAIN (FORMAT TEXT) | ||
| SELECT id FROM %q.transactions | ||
| WHERE ledger = '%s' | ||
| AND %s | ||
| ORDER BY id DESC LIMIT 16 | ||
| `, schema, ledgerName, predicate) |
There was a problem hiding this comment.
plan_after is not shape-equivalent to the real paginator query.
This EXPLAIN path hardcodes metadata ->> 'source_wallet_id' = ..., but runtime filtering uses metadata ->> ? = ?. That can make the logged plan unrepresentative of the actual query path.
Suggested fix
func explainAnalyze(t *testing.T, store *ledgerstore.Store, filter string, value string) string {
@@
- var predicate string
- if filter == "@>" {
- predicate = fmt.Sprintf(`metadata @> '{"source_wallet_id": %q}'`, value)
- } else {
- predicate = fmt.Sprintf(`metadata ->> 'source_wallet_id' = '%s'`, value)
- }
-
- sql := fmt.Sprintf(`
- EXPLAIN (FORMAT TEXT)
- SELECT id FROM %q.transactions
- WHERE ledger = '%s'
- AND %s
- ORDER BY id DESC LIMIT 16
- `, schema, ledgerName, predicate)
-
- rows, err := store.GetDB().QueryContext(ctx, sql)
+ var (
+ sql string
+ args []any
+ )
+ if filter == "@>" {
+ sql = fmt.Sprintf(`
+ EXPLAIN (FORMAT TEXT)
+ SELECT id FROM %q.transactions
+ WHERE ledger = $1
+ AND metadata @> $2::jsonb
+ ORDER BY id DESC LIMIT 16
+ `, schema)
+ args = []any{ledgerName, fmt.Sprintf(`{"source_wallet_id":%q}`, value)}
+ } else {
+ sql = fmt.Sprintf(`
+ EXPLAIN (FORMAT TEXT)
+ SELECT id FROM %q.transactions
+ WHERE ledger = $1
+ AND metadata ->> $2 = $3
+ ORDER BY id DESC LIMIT 16
+ `, schema)
+ args = []any{ledgerName, "source_wallet_id", value}
+ }
+
+ rows, err := store.GetDB().QueryContext(ctx, sql, args...)
require.NoError(t, err)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Replicate what resource_transactions.ResolveFilter + BuildDataset produce. | |
| // ledgerName and value are test-controlled alphanumeric strings, safe to embed. | |
| var predicate string | |
| if filter == "@>" { | |
| // %q adds Go double-quotes; JSON string values need double-quotes, so this is correct. | |
| predicate = fmt.Sprintf(`metadata @> '{"source_wallet_id": %q}'`, value) | |
| } else { | |
| // SQL string literals use single quotes. | |
| predicate = fmt.Sprintf(`metadata ->> 'source_wallet_id' = '%s'`, value) | |
| } | |
| sql := fmt.Sprintf(` | |
| EXPLAIN (FORMAT TEXT) | |
| SELECT id FROM %q.transactions | |
| WHERE ledger = '%s' | |
| AND %s | |
| ORDER BY id DESC LIMIT 16 | |
| `, schema, ledgerName, predicate) | |
| // Replicate what resource_transactions.ResolveFilter + BuildDataset produce. | |
| // ledgerName and value are test-controlled alphanumeric strings, safe to embed. | |
| var ( | |
| sql string | |
| args []any | |
| ) | |
| if filter == "@>" { | |
| sql = fmt.Sprintf(` | |
| EXPLAIN (FORMAT TEXT) | |
| SELECT id FROM %q.transactions | |
| WHERE ledger = $1 | |
| AND metadata @> $2::jsonb | |
| ORDER BY id DESC LIMIT 16 | |
| `, schema) | |
| args = []any{ledgerName, fmt.Sprintf(`{"source_wallet_id":%q}`, value)} | |
| } else { | |
| sql = fmt.Sprintf(` | |
| EXPLAIN (FORMAT TEXT) | |
| SELECT id FROM %q.transactions | |
| WHERE ledger = $1 | |
| AND metadata ->> $2 = $3 | |
| ORDER BY id DESC LIMIT 16 | |
| `, schema) | |
| args = []any{ledgerName, "source_wallet_id", value} | |
| } | |
| rows, err := store.GetDB().QueryContext(ctx, sql, args...) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/storage/ledger/transactions_deriv_sim_test.go` around lines 193 -
210, The EXPLAIN query in the test hardcodes the filter value directly into the
SQL string using fmt.Sprintf with %s in the else branch of the filter condition,
but the actual runtime paginator uses parameterized queries with placeholders.
To make the test query shape-equivalent to the real paginator query, modify the
predicate construction in the else block to use a parameterized placeholder (?)
instead of embedding the value directly with string formatting, so the EXPLAIN
plan accurately reflects what the runtime query path will execute.
| totalRows := getEnvInt("DERIV_SIM_ROWS", 1_000_000) | ||
| walletRows := getEnvInt("DERIV_SIM_WALLET", 50) | ||
|
|
||
| t.Logf("simulation: %d total rows, %d wallet rows (%.4f%% selectivity)", | ||
| totalRows, walletRows, float64(walletRows)/float64(totalRows)*100) |
There was a problem hiding this comment.
Validate DERIV_SIM_ROWS / DERIV_SIM_WALLET bounds before use.
totalRows is used as a divisor; DERIV_SIM_ROWS=0 will panic, and invalid ranges can silently produce misleading simulations.
Suggested guardrails
totalRows := getEnvInt("DERIV_SIM_ROWS", 1_000_000)
walletRows := getEnvInt("DERIV_SIM_WALLET", 50)
+require.Greater(t, totalRows, 0, "DERIV_SIM_ROWS must be > 0")
+require.GreaterOrEqual(t, walletRows, 0, "DERIV_SIM_WALLET must be >= 0")
+require.LessOrEqual(t, walletRows, totalRows, "DERIV_SIM_WALLET must be <= DERIV_SIM_ROWS")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| totalRows := getEnvInt("DERIV_SIM_ROWS", 1_000_000) | |
| walletRows := getEnvInt("DERIV_SIM_WALLET", 50) | |
| t.Logf("simulation: %d total rows, %d wallet rows (%.4f%% selectivity)", | |
| totalRows, walletRows, float64(walletRows)/float64(totalRows)*100) | |
| totalRows := getEnvInt("DERIV_SIM_ROWS", 1_000_000) | |
| walletRows := getEnvInt("DERIV_SIM_WALLET", 50) | |
| require.Greater(t, totalRows, 0, "DERIV_SIM_ROWS must be > 0") | |
| require.GreaterOrEqual(t, walletRows, 0, "DERIV_SIM_WALLET must be >= 0") | |
| require.LessOrEqual(t, walletRows, totalRows, "DERIV_SIM_WALLET must be <= DERIV_SIM_ROWS") | |
| t.Logf("simulation: %d total rows, %d wallet rows (%.4f%% selectivity)", | |
| totalRows, walletRows, float64(walletRows)/float64(totalRows)*100) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/storage/ledger/transactions_deriv_sim_test.go` around lines 229 -
233, The totalRows variable obtained from getEnvInt for DERIV_SIM_ROWS is used
as a divisor in the selectivity calculation without validation, which will cause
a panic if set to zero and could produce misleading results if set to an invalid
value. Add validation after retrieving totalRows and walletRows using getEnvInt
to ensure totalRows is greater than zero, walletRows is positive, and walletRows
does not exceed totalRows. If validation fails, either use sensible defaults or
fail the test early with a clear error message before attempting the division in
the t.Logf call.
| // FeatureIndexedMetadataKeys is a comma-separated list of metadata keys for which the query builder | ||
| // emits a functional-index-compatible predicate (metadata ->> 'key' = 'value') instead of the default | ||
| // JSONB containment form (metadata @> '{"key":"value"}'). A matching partial functional index must | ||
| // exist on the ledger's transactions table for the rewrite to actually speed up the query. | ||
| // Value: comma-separated key names, e.g. "source_wallet_id,destination_wallet_id". Empty = disabled. | ||
| FeatureIndexedMetadataKeys = "INDEXED_METADATA_KEYS" |
There was a problem hiding this comment.
Constrain this open-ended flag to string-valued indexed metadata keys.
The nil entry on Line 56 plus Line 65-Line 66 accepts arbitrary key lists. Downstream ResolveFilter rewrites every configured key to metadata ->> ? = ?, which is text extraction and is not equivalent to JSONB containment for numeric/bool/object metadata values; misconfigured keys can return different rows or fail instead of falling back. Add a dedicated validator/parser for INDEXED_METADATA_KEYS that enforces the intended string-valued indexed-key contract, or restrict it to the currently indexed wallet-id keys.
Also applies to: 56-56, 65-66
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/features/features.go` around lines 26 - 31, The
FeatureIndexedMetadataKeys flag accepts arbitrary metadata key lists without
validation, but downstream code in ResolveFilter rewrites these keys using text
extraction (metadata ->> ? = ?) which only works correctly for string-valued
metadata. This causes incorrect behavior for numeric, boolean, or object
metadata values. Add a dedicated validator or parser function that processes the
INDEXED_METADATA_KEYS configuration value (referenced at lines 56 and 65-66) to
enforce that only string-valued indexed metadata keys are allowed, either by
validating against a whitelist of known string-indexed keys like wallet IDs or
by implementing a contract validator that ensures type safety for the configured
keys.
Background
Deriv experiences 20+ second transaction-list queries. Root cause:
metadata @> '{"key":"val"}'gives the Postgres planner no selectivity signal — it falls back to a backward index scan onid DESC, scanning 2M+ rows to return ~50.Part 1 — Adaptive fallback (interim mitigation)
Merged earlier on this branch. Probe-then-retry: run the query with a 5 s timeout; on SQLSTATE 57014 retry once with
enable_indexscan=offforcing the GIN bitmap path. Dense wallets pay only the read-only transaction wrapper overhead. Full detail in the original description below.Part 2 — Functional-index routing (proper fix)
What changed
INDEXED_METADATA_KEYSper-ledger feature flag (comma-separated key names).When a key is listed, the query builder emits:
instead of:
Migrations 53 & 54 — partial functional indexes scoped to
ledger = 'deriv':Simulation — 1M rows, 0.005% selectivity (Deriv production scale)
Run it yourself:
Activating for Deriv
After migrations 53/54 have run on Deriv's bucket:
The flag is per-ledger and opt-in. No other ledger is affected. The adaptive fallback remains active as a safety net in case the flag is not yet set.
Files changed (Part 2)
pkg/features/features.goINDEXED_METADATA_KEYSconstant; open-ended (nil) validationinternal/ledger.goGetIndexedMetadataKeys() []stringinternal/storage/ledger/resource_transactions.go->>for flagged keysmigrations/53-…/up.sqlsource_wallet_idfor Derivmigrations/54-…/up.sqldestination_wallet_idfor Derivtransactions_metadata_index_test.gotransactions_deriv_sim_test.goOriginal adaptive-fallback description
What this is
Adaptive mitigation, not the real fix.
The real fix is a composite or denormalised index that serves both the wallet filter predicate and the
idORDER BY without forcing a sort step. This PR buys time until that index work is scheduled and deployed to prod-us-east-1-deriv.Background
SELECT … ORDER BY id DESC LIMIT Nwith JSONB@>predicates leads Postgres to an Index Scan Backward on theidB-tree. For dense/recent wallets this plan is fast (matches cluster near the head of the index). For sparse wallets (few matching rows scattered across a largeidrange) the scan walks most of the table before accumulating N results — observed ~50 s at prod-us-east-1-deriv (ledger v2.4.9).Forcing a GIN bitmap scan globally would hurt dense wallets, so a deployment-wide planner override is not viable.
Approach: probe-then-retry
Detect the pathological plan at runtime and recover within the same request.
SET LOCAL statement_timeout = <firstAttemptTimeout>(default 5 s).SET LOCAL enable_indexscan = offandSET LOCAL statement_timeout = <retryTimeout>(default 40 s).Transaction hygiene:
SET LOCALis strictly scoped to each explicitBEGIN/COMMITblock.