Skip to content

perf(storage): sparse-wallet query fix — functional indexes + adaptive fallback#1408

Open
ariel-formance wants to merge 8 commits into
mainfrom
fix/v2.4/tx-list-planner-hints
Open

perf(storage): sparse-wallet query fix — functional indexes + adaptive fallback#1408
ariel-formance wants to merge 8 commits into
mainfrom
fix/v2.4/tx-list-planner-hints

Conversation

@ariel-formance

@ariel-formance ariel-formance commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Background

Deriv experiences 20+ second transaction-list queries. Root cause: metadata @> '{"key":"val"}' gives the Postgres planner no selectivity signal — it falls back to a backward index scan on id DESC, scanning 2M+ rows to return ~50.


Part 1 — Adaptive fallback (interim mitigation)

Merged earlier on this branch. Probe-then-retry: run the query with a 5 s timeout; on SQLSTATE 57014 retry once with enable_indexscan=off forcing the GIN bitmap path. Dense wallets pay only the read-only transaction wrapper overhead. Full detail in the original description below.


Part 2 — Functional-index routing (proper fix)

What changed

INDEXED_METADATA_KEYS per-ledger feature flag (comma-separated key names).

When a key is listed, the query builder emits:

metadata ->> 'key' = 'value'   -- BTree equality, matches partial functional index

instead of:

metadata @> '{"key": "value"}' -- GIN-only, breaks ORDER BY id DESC planner path

Migrations 53 & 54 — partial functional indexes scoped to ledger = 'deriv':

CREATE INDEX ON transactions ((metadata->>'source_wallet_id'))      WHERE ledger = 'deriv';
CREATE INDEX ON transactions ((metadata->>'destination_wallet_id')) WHERE ledger = 'deriv';

Simulation — 1M rows, 0.005% selectivity (Deriv production scale)

plain  @>  path (no flag):        497 ms  — backward scan, 999 950 rows filtered
indexed ->> path (flag + index):    1 ms  — functional index, 50 rows fetched
speedup: 452×

Run it yourself:

DOCKER_HOST=unix:///~/.colima/default/docker.sock \
  go test -tags it -run TestDerivSparseWalletSimulation -v -timeout 600s \
  ./internal/storage/ledger/...
# DERIV_SIM_ROWS=20000 for a quick 6-second sanity run

Activating for Deriv

After migrations 53/54 have run on Deriv's bucket:

PATCH /ledgers/deriv
{"features": {"INDEXED_METADATA_KEYS": "source_wallet_id,destination_wallet_id"}}

The flag is per-ledger and opt-in. No other ledger is affected. The adaptive fallback remains active as a safety net in case the flag is not yet set.

Files changed (Part 2)

File Change
pkg/features/features.go INDEXED_METADATA_KEYS constant; open-ended (nil) validation
internal/ledger.go GetIndexedMetadataKeys() []string
internal/storage/ledger/resource_transactions.go Route metadata filter to ->> for flagged keys
migrations/53-…/up.sql Partial functional index on source_wallet_id for Deriv
migrations/54-…/up.sql Partial functional index on destination_wallet_id for Deriv
transactions_metadata_index_test.go 5 integration tests (flag on/off, semantic equivalence)
transactions_deriv_sim_test.go 1M-row simulation with GIN drop/rebuild for fast seeding

Original adaptive-fallback description

What this is

Adaptive mitigation, not the real fix.

The real fix is a composite or denormalised index that serves both the wallet filter predicate and the id ORDER BY without forcing a sort step. This PR buys time until that index work is scheduled and deployed to prod-us-east-1-deriv.

Background

SELECT … ORDER BY id DESC LIMIT N with JSONB @> predicates leads Postgres to an Index Scan Backward on the id B-tree. For dense/recent wallets this plan is fast (matches cluster near the head of the index). For sparse wallets (few matching rows scattered across a large id range) the scan walks most of the table before accumulating N results — observed ~50 s at prod-us-east-1-deriv (ledger v2.4.9).

Forcing a GIN bitmap scan globally would hurt dense wallets, so a deployment-wide planner override is not viable.

Approach: probe-then-retry

Detect the pathological plan at runtime and recover within the same request.

  1. Execute the transactions-list SELECT inside an explicit read-only transaction with SET LOCAL statement_timeout = <firstAttemptTimeout> (default 5 s).
  2. If Postgres cancels with SQLSTATE 57014 AND the request context is still alive, retry once with SET LOCAL enable_indexscan = off and SET LOCAL statement_timeout = <retryTimeout> (default 40 s).
  3. Any other error, or a client-context cancellation, is returned unchanged.

Transaction hygiene: SET LOCAL is strictly scoped to each explicit BEGIN/COMMIT block.

@ariel-formance ariel-formance requested a review from a team as a code owner June 11, 2026 11:45
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Adds an adaptive probe‑then‑retry transactions-list paginator with configurable timeouts and wires it through FX/CLI. Separately introduces an indexed-metadata-keys feature flag enabling query predicate rewriting, adds ledger-scoped functional indices on wallet IDs, and includes integration tests for both features.

Changes

Adaptive transactions-list pagination and configuration

Layer / File(s) Summary
TransactionListConfig contract and adaptive paginator implementation
internal/storage/ledger/store.go
Adds TransactionListConfig with enable flag and two statement-timeout durations; implements transactionsAdaptivePaginator that probes with a short timeout, detects Postgres statement timeouts, retries once with a longer timeout and enable_indexscan=off, scopes SET LOCAL in a read-only tx, and records fallback metrics. GetOne/Count delegate to base repo. Exposes DefaultTransactionListConfig() and WithTransactionListConfig().
FX module and storage configuration wiring
internal/storage/module.go, internal/storage/driver/module.go
Adds TransactionListConfig to ModuleConfig; changes driver.NewFXModule() to accept a ModuleConfig and forwards TransactionListConfig into ledgerstore factory options.
CLI flags and ServeCommandConfig
cmd/serve.go
Adds TxListAdaptiveFallback, TxListFirstAttemptTimeoutMs, TxListRetryTimeoutMs to ServeCommandConfig; exports flag name constants; registers three Cobra flags; wires CLI values into ledgerstore.TransactionListConfig when constructing the storage FX module.
Integration tests for adaptive transactions-list pagination
internal/storage/ledger/transactions_planner_hints_test.go
Adds integration tests validating: fast-path parity with baseline, fallback triggered by tight probe timeout, fallback results matching baseline, SET LOCAL no-leakage, retry timeout error propagation, disabled fallback path, pagination cursor integrity across fallback pages, GetOne/Count unaffected, and client-cancel causing no retry.

Indexed metadata keys feature

Layer / File(s) Summary
Feature flag definition and validation semantics
pkg/features/features.go
Adds FeatureIndexedMetadataKeys constant and FeatureConfigurations entry with nil value; changes ValidateFeatureWithValue to accept any value when possibleConfigurations is empty, enabling open-ended feature values.
Ledger feature accessor for indexed metadata keys
internal/ledger.go
Adds Ledger.GetIndexedMetadataKeys() method that parses the comma-separated FeatureIndexedMetadataKeys value from ledger features and returns a slice of indexed key names.
Database migrations for metadata indices
internal/storage/bucket/migrations/53-add-metadata-source-wallet-id-index-deriv/up.sql, internal/storage/bucket/migrations/54-add-metadata-destination-wallet-id-index-deriv/up.sql
Adds two migrations creating ledger-scoped (ledger = 'deriv') partial functional indices on metadata->>'source_wallet_id' and metadata->>'destination_wallet_id' using CREATE INDEX CONCURRENTLY IF NOT EXISTS.
Query predicate rewriting for indexed keys
internal/storage/ledger/resource_transactions.go
Updates ResolveFilter to branch on whether a metadata key is indexed: indexed keys use metadata ->> ? = ? equality predicates, while non-indexed keys fall back to metadata @> ? containment.
Integration tests for indexed metadata keys feature
internal/storage/ledger/transactions_metadata_index_test.go
Adds tests validating indexed-key query correctness, non-indexed-key containment fallback, semantic equivalence between indexed and non-indexed stores, destination wallet ID indexing, and no-flag behavior.
Deriv sparse-wallet query planner simulation tests
internal/storage/ledger/transactions_deriv_sim_test.go
Adds integration tests seeding Deriv sparse-wallet transaction patterns, comparing query plans between metadata @> and metadata ->> predicates, measuring pagination timing across both approaches, and validating semantic equivalence on deterministic datasets.

Sequence Diagram

sequenceDiagram
  participant Client
  participant Store
  participant AdaptivePaginator
  participant PostgresDB
  Client->>Store: Transactions()
  Store->>AdaptivePaginator: instantiate with TransactionListConfig
  Client->>AdaptivePaginator: Paginate(query, cursor)
  AdaptivePaginator->>PostgresDB: SELECT with SET LOCAL statement_timeout=firstAttemptMs
  alt probe times out (SQLSTATE 57014)
    PostgresDB-->>AdaptivePaginator: query canceled
    AdaptivePaginator->>PostgresDB: SELECT with SET LOCAL statement_timeout=retryMs and SET LOCAL enable_indexscan=off
    PostgresDB-->>AdaptivePaginator: rows
    AdaptivePaginator->>AdaptivePaginator: record fallback metric
  else probe succeeds
    PostgresDB-->>AdaptivePaginator: rows
  end
  AdaptivePaginator-->>Client: page, nextCursor
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • flemzord

Poem

🐰 I probed the timeout, then retried with grace,
Indexed metadata keys found their rightful place,
Index scans off, predicates rewritten clean,
Wallet queries faster than ever seen,
Hops of joy echo through this PR's embrace.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly identifies the two main changes: adaptive fallback (an interim mitigation for sparse-wallet timeout issues) and functional indexes (the proper fix). It is concise, specific, and matches the core implementation across the changeset.
Description check ✅ Passed The PR description thoroughly explains the background, root cause, and two-part solution (adaptive fallback + functional-index routing). It includes a simulation showing 452× speedup, deployment instructions, affected files, and original adaptive-fallback details. The description is directly related to all major changes in the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/v2.4/tx-list-planner-hints

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
internal/storage/common/resource.go (1)

345-350: 💤 Low value

Consider validating the OrderExpression format.

The code assumes paginator.OrderExpression() always returns "column direction" format and uses strings.Cut to split on space. If a future Paginator implementation returns a different format (e.g., no space), the parsing silently produces invalid SQL. Consider checking the boolean return value from strings.Cut or adding a comment documenting the required format.

🛡️ Defensive validation
 orderExpr := paginator.OrderExpression()
 col, dir, _ := strings.Cut(orderExpr, " ")
+if dir == "" {
+    return nil, fmt.Errorf("OrderExpression must return 'column direction' format, got: %q", orderExpr)
+}
 finalQuery = finalQuery.Order(fmt.Sprintf("dataset.%s %s", col, dir))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/common/resource.go` around lines 345 - 350, Validate the
format returned by paginator.OrderExpression() before using strings.Cut: capture
the boolean ok from strings.Cut(orderExpr, " ") and if false, handle the invalid
format (e.g., return an error from the enclosing function or fall back to a safe
default order) instead of blindly using col/dir; then only call
finalQuery.Order(fmt.Sprintf("dataset.%s %s", col, dir)) when ok is true. Also
consider adding a brief comment on the expected "column direction" format next
to the paginator.OrderExpression() usage to document the requirement.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/storage/ledger/store.go`:
- Around line 123-128: The Transactions() fast-path check treats only 0 as "no
timeout", causing negative StatementTimeoutMs to still use the wrapper; change
the guard in Store.Transactions from checking == 0 to <= 0 so negative/zero
values disable the hints and return transactionsBase(), and ensure the code that
applies statement_timeout (currently gated by > 0) remains unchanged; update the
condition involving txListConfig.StatementTimeoutMs in Transactions() (and any
analogous checks) to use <= 0 so behavior is consistent with the timeout
application in the RunInTx wrapper (transactionsPaginationWithHints).
- Around line 196-206: The current bun.Tx branch calls issueSetLocal on the
caller-managed transaction which leaves planner/session settings active for the
rest of that outer transaction; instead, start a nested transaction
(savepoint-style) for the Paginate work so the SET LOCAL doesn't leak.
Concretely: in the case bun.Tx branch, begin a new transaction from the outer tx
(create a nested bun.Tx via BeginTx on the store/outer tx), call issueSetLocal
on that nested tx, invoke transactionsBase() against that nested tx and call
Paginate(ctx, q), then commit/rollback the nested tx and return the Paginate
result; keep using the existing symbols issueSetLocal, transactionsBase(),
Paginate and the bun.Tx branch to locate where to change.

---

Nitpick comments:
In `@internal/storage/common/resource.go`:
- Around line 345-350: Validate the format returned by
paginator.OrderExpression() before using strings.Cut: capture the boolean ok
from strings.Cut(orderExpr, " ") and if false, handle the invalid format (e.g.,
return an error from the enclosing function or fall back to a safe default
order) instead of blindly using col/dir; then only call
finalQuery.Order(fmt.Sprintf("dataset.%s %s", col, dir)) when ok is true. Also
consider adding a brief comment on the expected "column direction" format next
to the paginator.OrderExpression() usage to document the requirement.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: de0298dd-34ef-4ba0-80be-243ec3dcfd96

📥 Commits

Reviewing files that changed from the base of the PR and between 52a9943 and 1012b5e.

📒 Files selected for processing (9)
  • cmd/serve.go
  • internal/storage/common/paginator.go
  • internal/storage/common/paginator_column.go
  • internal/storage/common/paginator_offset.go
  • internal/storage/common/resource.go
  • internal/storage/driver/module.go
  • internal/storage/ledger/store.go
  • internal/storage/ledger/transactions_planner_hints_test.go
  • internal/storage/module.go

Comment thread internal/storage/ledger/store.go Outdated
Comment thread internal/storage/ledger/store.go Outdated
@ariel-formance ariel-formance force-pushed the fix/v2.4/tx-list-planner-hints branch from 1012b5e to 577f762 Compare June 11, 2026 12:41
@ariel-formance ariel-formance changed the title fix(storage): add tx-list planner hints to mitigate sparse-wallet timeout fix(storage): adaptive fallback for sparse-wallet transactions-list timeout Jun 11, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
internal/storage/ledger/store.go (1)

271-278: ⚠️ Potential issue | 🟡 Minor | ⚖️ Poor tradeoff

SET LOCAL leaks to subsequent queries when caller is already in a transaction.

When store.db is already a bun.Tx (lines 273-278), the code issues SET LOCAL directly on the caller's transaction. Those settings (enable_indexscan = off, statement_timeout) persist for all subsequent statements within that outer transaction—not just this Paginate call.

This means if a caller does:

store, tx, _ := baseStore.BeginTX(ctx, nil)
defer tx.Rollback()
store.Transactions().Paginate(ctx, q) // triggers fallback → enable_indexscan = off
store.Logs().Paginate(ctx, q2)        // now runs with enable_indexscan = off

The workaround documented in the past review (nested transaction via savepoint) would fully isolate the settings. Given this is a defensive branch for an uncommon path, consider whether the current behavior is acceptable or if savepoint isolation is worth adding.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/ledger/store.go` around lines 271 - 278, The current
defensive branch mutates the caller's bun.Tx by calling issueSetLocal directly
on store.db (when it implements bun.Tx), leaking SET LOCAL state; instead, wrap
the scope in a SAVEPOINT to isolate settings: on detecting store.db as bun.Tx,
execute a SAVEPOINT (unique name), call issueSetLocal using that same tx, invoke
a.store.transactionsBase().Paginate(ctx, q) while the savepoint is active, then
ROLLBACK TO SAVEPOINT (to undo the session-local settings) and RELEASE the
savepoint; update the branch handling store.db.(bun.Tx) around issueSetLocal and
transactionsBase().Paginate to use this savepoint pattern so outer transaction
settings are not mutated.
🧹 Nitpick comments (1)
internal/storage/ledger/store.go (1)

621-638: ⚡ Quick win

DefaultTransactionListConfig() is not applied automatically — zero-valued config disables fallback.

The struct field comment on line 50-53 says "Default true", but defaultOptions (line 640) doesn't include WithTransactionListConfig(DefaultTransactionListConfig()). If a Store is created without explicitly passing WithTransactionListConfig, the zero-valued TransactionListConfig has EnableAdaptiveFallback: false.

This is fine for the CLI path (which always supplies the config via FX wiring), but direct callers of ledgerstore.New() must remember to pass the option if they want the documented default behavior. Consider either:

  1. Updating the doc comment to clarify "default when wired through FX" vs "zero-value when created directly", or
  2. Adding WithTransactionListConfig(DefaultTransactionListConfig()) to defaultOptions so the behavior matches the documentation.
Option 2: Apply default config automatically
 var defaultOptions = []Option{
 	WithMeter(noopmetrics.Meter{}),
 	WithTracer(nooptracer.Tracer{}),
+	WithTransactionListConfig(DefaultTransactionListConfig()),
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/ledger/store.go` around lines 621 - 638,
DefaultTransactionListConfig() is never applied automatically so a zero-valued
TransactionListConfig disables adaptive fallback; update the store defaults by
adding WithTransactionListConfig(DefaultTransactionListConfig()) into the
defaultOptions slice so new Store instances get the documented defaults (adjust
symbols: DefaultTransactionListConfig, WithTransactionListConfig,
defaultOptions, TransactionListConfig, Store).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@internal/storage/ledger/store.go`:
- Around line 271-278: The current defensive branch mutates the caller's bun.Tx
by calling issueSetLocal directly on store.db (when it implements bun.Tx),
leaking SET LOCAL state; instead, wrap the scope in a SAVEPOINT to isolate
settings: on detecting store.db as bun.Tx, execute a SAVEPOINT (unique name),
call issueSetLocal using that same tx, invoke
a.store.transactionsBase().Paginate(ctx, q) while the savepoint is active, then
ROLLBACK TO SAVEPOINT (to undo the session-local settings) and RELEASE the
savepoint; update the branch handling store.db.(bun.Tx) around issueSetLocal and
transactionsBase().Paginate to use this savepoint pattern so outer transaction
settings are not mutated.

---

Nitpick comments:
In `@internal/storage/ledger/store.go`:
- Around line 621-638: DefaultTransactionListConfig() is never applied
automatically so a zero-valued TransactionListConfig disables adaptive fallback;
update the store defaults by adding
WithTransactionListConfig(DefaultTransactionListConfig()) into the
defaultOptions slice so new Store instances get the documented defaults (adjust
symbols: DefaultTransactionListConfig, WithTransactionListConfig,
defaultOptions, TransactionListConfig, Store).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f0313601-844c-4765-9ac0-8f2fc7402a49

📥 Commits

Reviewing files that changed from the base of the PR and between 1012b5e and 577f762.

📒 Files selected for processing (5)
  • cmd/serve.go
  • internal/storage/driver/module.go
  • internal/storage/ledger/store.go
  • internal/storage/ledger/transactions_planner_hints_test.go
  • internal/storage/module.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/storage/driver/module.go
  • internal/storage/module.go

@ariel-formance ariel-formance changed the base branch from release/v2.4 to main June 11, 2026 13:16
…imeout

Background
----------
SELECT … ORDER BY id DESC LIMIT N with JSONB @> predicates leads Postgres to
an Index Scan Backward on the id B-tree.  For dense/recent wallets that plan is
fast.  For sparse wallets (few matching rows scattered across a large id range)
the scan walks most of the table before accumulating N results — observed ~50 s
at prod-us-east-1-deriv (ledger v2.4.9).

Forcing GIN bitmap scan globally would hurt dense wallets, so a deployment-wide
planner override is not viable.

Adaptive mitigation (this PR)
------------------------------
This is a stopgap, not the real fix.  The real fix is a composite or
denormalised index that serves both the wallet filter and id ordering without a
full sort step.  The deliberate cost is up to <firstAttemptTimeout> of wasted
work per uncached sparse-wallet request.

Strategy (exactly one retry, no loop):

1. Execute the transactions-list SELECT inside an explicit read-only transaction
   with SET LOCAL statement_timeout = <firstAttemptTimeout> (default 5 s).
   Dense wallets finish well inside the budget — no fallback, no overhead beyond
   the transaction wrapper.

2. If Postgres cancels with SQLSTATE 57014 AND the request context is still alive
   (i.e. our timeout fired, not the client disconnecting), roll back and retry
   once with SET LOCAL enable_indexscan = off plus SET LOCAL statement_timeout =
   <retryTimeout> (default 40 s).  Disabling index scans forces the planner onto
   the GIN bitmap path for the JSONB @> predicates.

3. Any other error, or a client-context cancellation, is returned unchanged.
   Exactly one retry.  Never a loop.

Transaction hygiene: SET LOCAL is strictly scoped to each explicit BEGIN/COMMIT
block, so no planner setting or timeout leaks to subsequent queries on the same
pooled connection.

Changes
-------
- internal/storage/ledger/store.go
  · TransactionListConfig struct (EnableAdaptiveFallback, FirstAttemptTimeoutMs,
    RetryTimeoutMs) + DefaultTransactionListConfig() + WithTransactionListConfig()
  · transactionsAdaptivePaginator: probe-then-retry Paginate; GetOne/Count
    delegate straight to the base repository (no adaptive logic needed)
  · paginateInTx / issueSetLocal helpers for the explicit-tx + SET LOCAL dance
  · OTel metrics: store.tx_list_fallback_total (counter),
    store.tx_list_first_attempt_duration (histogram),
    store.tx_list_retry_duration (histogram, outcome attribute)
  · Structured log on every fallback event

- internal/storage/driver/module.go  — ModuleConfig carries TransactionListConfig
- internal/storage/module.go         — forwards TransactionListConfig to driver
- cmd/serve.go
  · --tx-list-adaptive-fallback (default true)
  · --tx-list-first-attempt-timeout-ms (default 5000)
  · --tx-list-retry-timeout-ms (default 40000)

- internal/storage/ledger/transactions_planner_hints_test.go (//go:build it)
  · FastPathUnchanged   — generous timeout, dense wallet, result == baseline
  · FallbackTriggered   — 1 ms probe, retry succeeds, rows correct + ordered DESC
  · FallbackRowsMatch   — fallback result == baseline result row-for-row
  · NoLeakage           — pinned conn: enable_indexscan + statement_timeout restored
  · RetryAlsoTimesOut   — both timeouts 1 ms → error propagated, no loop
  · DisabledFallback    — EnableAdaptiveFallback=false → plain path, no overhead
  · CursorIntegrity     — next-page cursor from fallback path is decodable and correct
  · GetOneAndCountUnaffected — adaptive store GetOne/Count == base store
@ariel-formance ariel-formance force-pushed the fix/v2.4/tx-list-planner-hints branch from 577f762 to e27296c Compare June 11, 2026 13:33
Three gaps addressed:

1. ClientCancelNoRetry (new test): proves that a cancelled request context
   returns context.Canceled immediately and never triggers the retry.  This
   is the critical client-disconnect safety property — if the ctx.Err() guard
   in Paginate were wrong, a dead-client request would fire a wasted retry.

2. RetryAlsoTimesOut: bumped data set from 3 to 200 rows so the 1 ms
   statement_timeout reliably fires SQLSTATE 57014 even on fast CI runners.
   With only 3 rows the probe could complete in < 1 ms and require.Error would
   fail spuriously.

3. FallbackTriggeredByTimeout / FallbackRowsMatchBaseline: same 200-row fix —
   both tests used 1 ms probes with tiny data sets that might not timeout,
   causing the assertions to pass without exercising the fallback path at all.
…d edge-case tests

Code fix
--------
paginateInTx previously issued SET LOCAL directly onto an outer bun.Tx when
store.db was already a transaction (BeginTX path).  If the probe timed out
(57014), Postgres put the outer transaction into an error state.  The retry
then attempted to issue another SET LOCAL on the aborted transaction, which
failed with InFailedSqlTransaction (25P02), leaving the caller's transaction
permanently unusable.

This path is unreachable today (list operations are never called within an
explicit BeginTX), but it was a silent trap for future callers.  Fix: when
store.db is a bun.Tx, skip the adaptive machinery entirely and delegate
straight to the base paginator.

New tests
---------
- EmptyResultSet: zero wallet rows → empty cursor, no error, HasMore false.
  Also confirms the fast path doesn't blow up on a zero-row response.

- AlreadyInTxSkipsAdaptive: opens an outer transaction via BeginTX, then
  calls Transactions().Paginate() with a 1 ms probe timeout.  Verifies the
  call succeeds, returns the right rows, and that statement_timeout on the
  outer transaction is still "0" (the adaptive SET LOCAL never ran).
The test was flaky on GitHub Actions because the GIN bitmap scan (used on
the retry path) completed in < 1ms even with 200 rows, so statement_timeout
never fired and require.Error always saw nil.

Add a testHookBeforePaginateSelect field to Store and a setter
SetTestHookBeforePaginateSelect that tests can use to inject arbitrary SQL
inside paginateInTx, sharing the same transaction and therefore the same SET
LOCAL statement_timeout scope. The test now installs SELECT pg_sleep(0.005),
which reliably exceeds a 1ms timeout regardless of how fast the real SELECT
would have been.

Also fix gofmt alignment on three struct fields in Store (extra spaces → gofmt
canonical alignment) and a double-space in the test (1,  // → 1, //).
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.83051% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.84%. Comparing base (140ed9b) to head (c68e786).

Files with missing lines Patch % Lines
internal/storage/ledger/store.go 88.00% 6 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1408      +/-   ##
==========================================
+ Coverage   80.55%   80.84%   +0.29%     
==========================================
  Files         206      206              
  Lines       11293    11372      +79     
==========================================
+ Hits         9097     9194      +97     
+ Misses       1646     1619      -27     
- Partials      550      559       +9     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The FallbackTriggeredByTimeout / FallbackRowsMatchBaseline / NoLeakage tests
relied on dataset size (200 rows) to make the 1 ms probe fire SQLSTATE 57014.
On a fast CI runner those queries complete in < 1 ms and the fallback never
triggers, leaving the retry-success branch in Paginate (the else { Infof } block)
and several issueSetLocal branches permanently uncovered.

Add three targeted tests that use the pg_sleep hook introduced in the previous
commit so the fallback fires deterministically regardless of query speed:

* FallbackRetrySucceedsViaPgSleep — hook makes the probe fire 57014 and is a
  no-op on the retry, covering the retry-success path and its logging.
* CtxCancelAfterProbe — hook cancels the outer ctx then fires 57014, covering
  the ctx.Err() != nil branch in Paginate (no retry when client disconnects).
* ZeroTimeoutNoSetLocal — FirstAttemptTimeoutMs = 0 skips SET LOCAL entirely,
  covering the if timeoutMs > 0 { } false branch in issueSetLocal; also
  exercises DefaultTransactionListConfig().

Also add a no-op hook to FastPathUnchanged to cover the "hook set but returns
nil" fall-through path in paginateInTx.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/storage/ledger/transactions_planner_hints_test.go (1)

553-561: ⚡ Quick win

Strengthen no-retry verification in the cancel-after-probe test.

Line 559 currently verifies only that an error is returned. A retry regression could still pass this assertion. Track hook calls and assert exactly one attempt.

Suggested test hardening
 func TestTransactionListAdaptive_CtxCancelAfterProbe(t *testing.T) {
@@
-	adaptive.SetTestHookBeforePaginateSelect(func(_ context.Context, tx bun.Tx) error {
+	callCount := 0
+	adaptive.SetTestHookBeforePaginateSelect(func(_ context.Context, tx bun.Tx) error {
+		callCount++
 		cancel()
 		_, err := tx.ExecContext(context.Background(), "SELECT pg_sleep(0.005)")
 		return err // SQLSTATE 57014; ctx is now cancelled
 	})
@@
 	_, err := adaptive.Transactions().Paginate(ctx, walletQuery(15))
 	require.Error(t, err, "must return an error when context is cancelled after probe timeout")
+	require.Equal(t, 1, callCount, "must not retry after context cancellation")
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/ledger/transactions_planner_hints_test.go` around lines 553
- 561, The test should verify the paginate hook is invoked exactly once to
ensure no retries: add a counter (e.g. attempts := 0) captured by the closure
passed to adaptive.SetTestHookBeforePaginateSelect and increment it each time
the hook runs, leaving the existing cancel() and tx.ExecContext sleep behavior
unchanged; after calling adaptive.Transactions().Paginate(ctx, walletQuery(15))
assert that an error was returned and that the counter equals 1 (no retries),
referencing the existing hook setup function
adaptive.SetTestHookBeforePaginateSelect, the call
adaptive.Transactions().Paginate, and walletQuery(15) to locate where to insert
the counter and final assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/storage/ledger/transactions_planner_hints_test.go`:
- Around line 553-561: The test should verify the paginate hook is invoked
exactly once to ensure no retries: add a counter (e.g. attempts := 0) captured
by the closure passed to adaptive.SetTestHookBeforePaginateSelect and increment
it each time the hook runs, leaving the existing cancel() and tx.ExecContext
sleep behavior unchanged; after calling adaptive.Transactions().Paginate(ctx,
walletQuery(15)) assert that an error was returned and that the counter equals 1
(no retries), referencing the existing hook setup function
adaptive.SetTestHookBeforePaginateSelect, the call
adaptive.Transactions().Paginate, and walletQuery(15) to locate where to insert
the counter and final assertion.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2a7d7c3f-8ea1-4db1-97d8-0932f9cf3b7e

📥 Commits

Reviewing files that changed from the base of the PR and between 7cbc3b5 and 9f1a0c2.

⛔ Files ignored due to path filters (1)
  • go.mod is excluded by !**/*.mod
📒 Files selected for processing (2)
  • internal/storage/ledger/store.go
  • internal/storage/ledger/transactions_planner_hints_test.go

@NumaryBot

Copy link
Copy Markdown
Contributor

🛑 Changes requested — multi-model review

The adaptive fallback implementation is well-structured overall — the in-transaction bypass is a sound safety measure, the fx/config/metrics plumbing is clean, and the test coverage is a good foundation. However, several correctness and robustness issues require attention before merging.

The most critical concerns are: (1) the test hook field is read and written on a shared Store without synchronization, creating a real data race; (2) SQLSTATE 57014 is shared between statement_timeout, pg_cancel_backend, and other server-side cancellations, so the current classification can trigger spurious 40-second retries on non-timeout cancellations; (3) there is no integration test verifying that a real statement_timeout error propagates unwrapped through the bun/paginate stack to actually trigger the fallback; and (4) existing callers that construct ModuleConfig without the new TransactionListConfig field silently receive the zero value, disabling the fallback outside the cmd/serve path despite the documented defaults.

Minor issues include: the SET LOCAL statement_timeout = 0 omission meaning inherited session timeouts are not overridden as documented; no validation of negative or implausibly large timeout values in the config; metric instruments that are unconditionally dereferenced and could panic in non-standard construction paths; the in-transaction bypass being silent and unobservable; and probe timeouts being logged at error level for what is an expected, handled condition.

🟠 [major] testHookBeforePaginateSelect mutates shared Store without synchronization

internal/storage/ledger/store.go:85 — reported by claude

SetTestHookBeforePaginateSelect writes store.testHookBeforePaginateSelect while paginateInTx reads it concurrently across goroutines serving HTTP requests. A Store is shared across goroutines, so even though the hook is intended for tests only, the nil-check and read on every Paginate call creates a data race. Exposing a test-only mutation method on the production Store type is also a maintenance and security smell.

Suggestion: Gate the hook behind a build tag, or inject it via an unexported field set only at construction time (e.g. an Option used solely in tests), so it cannot be mutated on a live shared Store. At minimum document that it must only be set before the store serves any traffic.

🟠 [major] SQLSTATE 57014 covers client/admin cancellation as well as statement_timeout

internal/storage/ledger/store.go:74 — reported by claude

isStatementTimeout treats any pgerrcode.QueryCanceled (57014) as a probe timeout. Postgres returns 57014 for statement_timeout, pg_cancel_backend, and lock_timeout-adjacent cancellations alike. The ctx.Err()==nil guard only filters client disconnects detected on the Go side; a server-side cancel (e.g. an operator running pg_cancel_backend) would be misclassified as a probe timeout and trigger a spurious retry with enable_indexscan=off, potentially running for up to RetryTimeoutMs (40s) doing unwanted work.

Suggestion: Distinguish statement_timeout from generic cancellation by also inspecting pgErr.Message (e.g. checking for 'canceling statement due to statement timeout') in addition to the code. At minimum document this residual ambiguity in a comment on isStatementTimeout.

🟠 [major] Error classification depends on raw *pgconn.PgError surviving the bun/paginate stack

internal/storage/ledger/store.go:80 — reported by claude

isStatementTimeout relies on errors.As reaching the underlying *pgconn.PgError. If any layer in the paginator stack (bun, the common paginate package, etc.) wraps or replaces the driver error with a non-wrapping type or calls ResolveError internally, errors.As will fail and the fallback will silently never trigger — the exact pathological case this PR targets would still time out. There is no test asserting that a real statement_timeout from the SELECT is classified correctly through the full stack.

Suggestion: Add an integration test that drives a genuine statement_timeout on the actual transactions SELECT (without the hook) and asserts the fallback fires, to lock in that the error type propagates unwrapped through the paginator.

🟠 [major] Zero-value TransactionListConfig silently disables the adaptive fallback for existing callers

internal/storage/module.go:35 — reported by gpt

TransactionListConfig is forwarded unconditionally. Existing callers that construct storage.ModuleConfig without the new field will pass the zero value through to ledgerstore.WithTransactionListConfig, leaving EnableAdaptiveFallback=false and both timeouts at 0. This contradicts the PR's documented default of fallback enabled with 5s/40s timeouts and means the mitigation may be silently disabled outside the cmd serve path where Cobra provides defaults.

Suggestion: Normalize defaults at the module/store boundary. Initialize from ledgerstore.DefaultTransactionListConfig() and only override explicitly supplied values, or make ModuleConfig.TransactionListConfig a pointer so nil means defaults while an explicit config can disable the fallback.

🟡 [minor] No validation that probe+retry budgets are non-negative or fit within client timeout

cmd/serve.go:255 — reported by claude

The flags accept arbitrary int64 values. Negative timeouts would be passed straight into SET LOCAL statement_timeout = , which Postgres rejects at query time, surfacing as an opaque error. The design also requires RetryTimeoutMs ≤ client_timeout − FirstAttemptTimeoutMs, but nothing enforces or warns about this invariant.

Suggestion: Add validation in cfg.Validate() (or wherever serve config is validated) to reject negative timeouts and optionally warn when RetryTimeoutMs is implausibly large relative to known client timeouts.

🟡 [minor] timeoutMs=0 does not actually disable the statement timeout when the session already has one set

internal/storage/ledger/store.go:381 — reported by gpt

paginateInTx documents timeoutMs: 0 = no timeout, and issueSetLocal skips SET LOCAL statement_timeout when the value is 0. However, if the role, database, DSN, or session already has a nonzero statement_timeout, the query will still be bounded by that inherited timeout and may unexpectedly trigger the fallback.

Suggestion: If 0 is intended to disable the server-side timeout for that attempt, explicitly execute SET LOCAL statement_timeout = 0 when timeoutMs is 0. Otherwise update the comment and configuration documentation to clarify that 0 means 'do not override the existing session timeout'.

🟡 [minor] Fallback metric instruments may be nil if Store is constructed outside New()

internal/storage/ledger/store.go:175 — reported by claude

txListFallbackCounter, txListFirstAttemptDurationMs, and txListRetryDurationMs are only initialized in New(). They are dereferenced unconditionally in the fallback branch. While current construction paths go through New(), any future struct-literal construction or refactor would cause a nil-pointer panic in the fallback branch.

Suggestion: Defensively nil-check the metric instruments before calling Add/Record, or add a comment and construction invariant guaranteeing they are always non-nil when EnableAdaptiveFallback is true.

🟡 [minor] In-transaction bypass of adaptive fallback is silent and unobservable

internal/storage/ledger/store.go:230 — reported by claude

paginateInTx returns the plain base paginator whenever store.db is a bun.Tx, which is correct for safety, but means any transactions-list query inside an outer transaction silently loses the sparse-wallet protection and can still hit the long timeout. There is no metric or log indicating the mitigation was skipped, making it impossible for operators to tell whether a slow list ran with or without the fallback.

Suggestion: Emit a debug log or increment a counter when the adaptive path is bypassed due to an outer transaction, so the skipped-mitigation case is observable in production.

⚪ [nit] FirstAttemptTimeoutMs=0 silently makes the fallback unreachable

internal/storage/ledger/store.go:135 — reported by claude

When FirstAttemptTimeoutMs is 0, issueSetLocal skips SET LOCAL statement_timeout, so the probe runs with no server-side timeout and can only be cancelled by the Go context. In that configuration the fallback can effectively never fire on the probe (no 57014 can be raised), making the adaptive machinery a pure read-only-transaction wrapper. This behaviour is non-obvious to operators.

Suggestion: Document that FirstAttemptTimeoutMs=0 disables the probe-timeout trigger entirely, rendering the fallback unreachable under normal conditions.

⚪ [nit] Probe timeout logged at error level for expected, handled condition

internal/storage/ledger/store.go:115 — reported by claude

Every fallback event logs at Errorf level ('transactions list probe timed out'). For deployments with sparse wallets this is expected/benign behaviour, not an error, and will create noisy error-level log volume that may trip error-rate alerts.

Suggestion: Log the probe timeout at Warn or Info level since it is a handled, expected condition. Reserve Errorf for the retry-failure case.


Reviewed in parallel by claude (anthropic/claude-opus-4-8) and gpt (openai/gpt-5.5), then consolidated. This comment is updated on each push.

…a queries

Adds INDEXED_METADATA_KEYS feature flag (comma-separated key names). When a
metadata key is flagged, the query builder emits

  metadata ->> 'key' = 'value'

instead of

  metadata @> '{"key": "value"}'

allowing Postgres to use a partial functional BTree index rather than falling
back to a full backward scan on the id index.

Migrations 53 and 54 create partial functional indexes scoped to ledger =
'deriv' for source_wallet_id and destination_wallet_id.

Simulation at 1 M rows (0.005% selectivity, matching Deriv production scale):
  plain @> path:   497 ms
  indexed ->> path:  1 ms
  speedup: 452×

To activate for Deriv after migrations have run:
  PATCH /ledgers/deriv  {"features": {"INDEXED_METADATA_KEYS": "source_wallet_id,destination_wallet_id"}}
@ariel-formance ariel-formance changed the title fix(storage): adaptive fallback for sparse-wallet transactions-list timeout perf(storage): sparse-wallet query fix — functional indexes + adaptive fallback Jun 18, 2026

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Changes requested — automated review

The adaptive retry path is broadly structured correctly, but the newly added functional-index path has planner-visibility issues that can make the intended indexes unusable in realistic deployments. These are performance bugs in the mitigation/index work and should be addressed.

@@ -0,0 +1,3 @@
create index {{ if not .Transactional }}concurrently{{end}} if not exists transactions_metadata_source_wallet_id_deriv
on "{{.Schema}}".transactions ((metadata->>'source_wallet_id'))
where ledger = 'deriv';

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 [major] Keep ledger predicate usable for partial indexes

When the bucket contains only the deriv ledger, newScopedSelect() deliberately omits the ledger = ? filter, so a transactions-list query on that bucket will not imply this partial-index predicate. PostgreSQL therefore cannot use these new indexes in the single-ledger-bucket case, which is a likely deployment shape for the targeted Deriv ledger and makes the migration ineffective for the sparse-wallet queries it is meant to help.

return "metadata @> ?", []any{map[string]any{
match[0][1]: value,
}}, nil
if slices.Contains(h.store.ledger.GetIndexedMetadataKeys(), key) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 [major] Inline the indexed metadata key in the expression

When INDEXED_METADATA_KEYS is enabled, this emits metadata ->> $1 = $2 with the key as a bind parameter rather than the indexed expression metadata ->> 'source_wallet_id' = $1. That parameterized expression is not reliably matched to the new functional index on (metadata->>'source_wallet_id'), especially once prepared/generic plans are involved, so the optimization can silently miss the index for the sparse-wallet path.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
internal/storage/ledger/transactions_metadata_index_test.go (1)

36-40: ⚡ Quick win

Add a regression test for comma+space feature values.

Please add a case like "source_wallet_id, destination_wallet_id" to ensure indexed-key detection remains robust for human-edited config strings.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/ledger/transactions_metadata_index_test.go` around lines 36
- 40, Add a new test case in the transactions_metadata_index_test.go file that
uses the withIndexedMetadataKeys helper function with a comma+space separated
string value like "source_wallet_id, destination_wallet_id" to verify that the
indexed-key detection logic correctly handles human-edited configuration strings
with spaces following commas, not just comma-separated values without spaces.
This regression test should validate that the feature value parsing remains
robust for both formats.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/ledger.go`:
- Around line 41-50: The GetIndexedMetadataKeys method does not normalize the
tokens parsed from the comma-separated feature value, preserving whitespace
around delimiters. When splitting "source_wallet_id, destination_wallet_id" by
comma, the result includes " destination_wallet_id" with leading space, which
causes downstream slices.Contains checks to fail. Modify the return statement to
iterate through each token from strings.Split, trim whitespace from each token
using strings.TrimSpace, skip any empty entries after trimming, and return only
the cleaned, non-empty tokens.

In `@internal/storage/ledger/resource_transactions.go`:
- Around line 151-154: The code is checking if a key exists in indexed metadata
keys using slices.Contains with h.store.ledger.GetIndexedMetadataKeys(), but
indexed metadata keys may contain space padding (like "source_wallet_id,
destination_wallet_id"), causing the exact match to fail and silently skipping
the indexed branch. Normalize both the key parameter and the indexed metadata
keys by trimming whitespace before performing the membership check in the
slices.Contains call to ensure space-padded keys are properly matched.

In `@internal/storage/ledger/transactions_deriv_sim_test.go`:
- Around line 229-233: The totalRows variable obtained from getEnvInt for
DERIV_SIM_ROWS is used as a divisor in the selectivity calculation without
validation, which will cause a panic if set to zero and could produce misleading
results if set to an invalid value. Add validation after retrieving totalRows
and walletRows using getEnvInt to ensure totalRows is greater than zero,
walletRows is positive, and walletRows does not exceed totalRows. If validation
fails, either use sensible defaults or fail the test early with a clear error
message before attempting the division in the t.Logf call.
- Around line 193-210: The EXPLAIN query in the test hardcodes the filter value
directly into the SQL string using fmt.Sprintf with %s in the else branch of the
filter condition, but the actual runtime paginator uses parameterized queries
with placeholders. To make the test query shape-equivalent to the real paginator
query, modify the predicate construction in the else block to use a
parameterized placeholder (?) instead of embedding the value directly with
string formatting, so the EXPLAIN plan accurately reflects what the runtime
query path will execute.

In `@pkg/features/features.go`:
- Around line 26-31: The FeatureIndexedMetadataKeys flag accepts arbitrary
metadata key lists without validation, but downstream code in ResolveFilter
rewrites these keys using text extraction (metadata ->> ? = ?) which only works
correctly for string-valued metadata. This causes incorrect behavior for
numeric, boolean, or object metadata values. Add a dedicated validator or parser
function that processes the INDEXED_METADATA_KEYS configuration value
(referenced at lines 56 and 65-66) to enforce that only string-valued indexed
metadata keys are allowed, either by validating against a whitelist of known
string-indexed keys like wallet IDs or by implementing a contract validator that
ensures type safety for the configured keys.

---

Nitpick comments:
In `@internal/storage/ledger/transactions_metadata_index_test.go`:
- Around line 36-40: Add a new test case in the
transactions_metadata_index_test.go file that uses the withIndexedMetadataKeys
helper function with a comma+space separated string value like
"source_wallet_id, destination_wallet_id" to verify that the indexed-key
detection logic correctly handles human-edited configuration strings with spaces
following commas, not just comma-separated values without spaces. This
regression test should validate that the feature value parsing remains robust
for both formats.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4764d899-30cb-4f91-b5c0-d40c0bd32f72

📥 Commits

Reviewing files that changed from the base of the PR and between c68e786 and 3efcdc3.

⛔ Files ignored due to path filters (2)
  • internal/storage/bucket/migrations/53-add-metadata-source-wallet-id-index-deriv/notes.yaml is excluded by !**/*.yaml
  • internal/storage/bucket/migrations/54-add-metadata-destination-wallet-id-index-deriv/notes.yaml is excluded by !**/*.yaml
📒 Files selected for processing (7)
  • internal/ledger.go
  • internal/storage/bucket/migrations/53-add-metadata-source-wallet-id-index-deriv/up.sql
  • internal/storage/bucket/migrations/54-add-metadata-destination-wallet-id-index-deriv/up.sql
  • internal/storage/ledger/resource_transactions.go
  • internal/storage/ledger/transactions_deriv_sim_test.go
  • internal/storage/ledger/transactions_metadata_index_test.go
  • pkg/features/features.go

Comment thread internal/ledger.go
Comment on lines +41 to +50
// GetIndexedMetadataKeys returns the list of metadata keys for which the query
// builder will emit a functional-index-compatible predicate (metadata ->> 'key' = 'value')
// instead of the default JSONB containment form. The list is stored as a comma-separated
// string in Features[FeatureIndexedMetadataKeys].
func (l Ledger) GetIndexedMetadataKeys() []string {
val := l.Features[features.FeatureIndexedMetadataKeys]
if val == "" {
return nil
}
return strings.Split(val, ",")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize tokens from the comma-separated feature value.

Line 50 preserves whitespace and empty tokens, so a common config like source_wallet_id, destination_wallet_id returns " destination_wallet_id" and the downstream exact slices.Contains(..., key) check will not enable the indexed predicate for that key. Trim and skip empty entries before returning.

Proposed parsing fix
 func (l Ledger) GetIndexedMetadataKeys() []string {
 	val := l.Features[features.FeatureIndexedMetadataKeys]
 	if val == "" {
 		return nil
 	}
-	return strings.Split(val, ",")
+
+	keys := make([]string, 0, strings.Count(val, ",")+1)
+	for _, key := range strings.Split(val, ",") {
+		key = strings.TrimSpace(key)
+		if key != "" {
+			keys = append(keys, key)
+		}
+	}
+	return keys
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// GetIndexedMetadataKeys returns the list of metadata keys for which the query
// builder will emit a functional-index-compatible predicate (metadata ->> 'key' = 'value')
// instead of the default JSONB containment form. The list is stored as a comma-separated
// string in Features[FeatureIndexedMetadataKeys].
func (l Ledger) GetIndexedMetadataKeys() []string {
val := l.Features[features.FeatureIndexedMetadataKeys]
if val == "" {
return nil
}
return strings.Split(val, ",")
// GetIndexedMetadataKeys returns the list of metadata keys for which the query
// builder will emit a functional-index-compatible predicate (metadata ->> 'key' = 'value')
// instead of the default JSONB containment form. The list is stored as a comma-separated
// string in Features[FeatureIndexedMetadataKeys].
func (l Ledger) GetIndexedMetadataKeys() []string {
val := l.Features[features.FeatureIndexedMetadataKeys]
if val == "" {
return nil
}
keys := make([]string, 0, strings.Count(val, ",")+1)
for _, key := range strings.Split(val, ",") {
key = strings.TrimSpace(key)
if key != "" {
keys = append(keys, key)
}
}
return keys
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/ledger.go` around lines 41 - 50, The GetIndexedMetadataKeys method
does not normalize the tokens parsed from the comma-separated feature value,
preserving whitespace around delimiters. When splitting "source_wallet_id,
destination_wallet_id" by comma, the result includes " destination_wallet_id"
with leading space, which causes downstream slices.Contains checks to fail.
Modify the return statement to iterate through each token from strings.Split,
trim whitespace from each token using strings.TrimSpace, skip any empty entries
after trimming, and return only the cleaned, non-empty tokens.

Comment on lines +151 to +154
if slices.Contains(h.store.ledger.GetIndexedMetadataKeys(), key) {
return "metadata ->> ? = ?", []any{key, value}, nil
}
return "metadata @> ?", []any{map[string]any{key: value}}, nil

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize indexed metadata keys before membership checks.

Space-padded feature values (for example "source_wallet_id, destination_wallet_id") won’t match key exactly, so the indexed branch is silently skipped.

Suggested fix
// internal/ledger.go
func (l Ledger) GetIndexedMetadataKeys() []string {
	val := l.Features[features.FeatureIndexedMetadataKeys]
	if val == "" {
		return nil
	}
-	return strings.Split(val, ",")
+	raw := strings.Split(val, ",")
+	keys := make([]string, 0, len(raw))
+	for _, k := range raw {
+		k = strings.TrimSpace(k)
+		if k != "" {
+			keys = append(keys, k)
+		}
+	}
+	return keys
}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if slices.Contains(h.store.ledger.GetIndexedMetadataKeys(), key) {
return "metadata ->> ? = ?", []any{key, value}, nil
}
return "metadata @> ?", []any{map[string]any{key: value}}, nil
func (l Ledger) GetIndexedMetadataKeys() []string {
val := l.Features[features.FeatureIndexedMetadataKeys]
if val == "" {
return nil
}
raw := strings.Split(val, ",")
keys := make([]string, 0, len(raw))
for _, k := range raw {
k = strings.TrimSpace(k)
if k != "" {
keys = append(keys, k)
}
}
return keys
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/ledger/resource_transactions.go` around lines 151 - 154, The
code is checking if a key exists in indexed metadata keys using slices.Contains
with h.store.ledger.GetIndexedMetadataKeys(), but indexed metadata keys may
contain space padding (like "source_wallet_id, destination_wallet_id"), causing
the exact match to fail and silently skipping the indexed branch. Normalize both
the key parameter and the indexed metadata keys by trimming whitespace before
performing the membership check in the slices.Contains call to ensure
space-padded keys are properly matched.

Comment on lines +193 to +210
// Replicate what resource_transactions.ResolveFilter + BuildDataset produce.
// ledgerName and value are test-controlled alphanumeric strings, safe to embed.
var predicate string
if filter == "@>" {
// %q adds Go double-quotes; JSON string values need double-quotes, so this is correct.
predicate = fmt.Sprintf(`metadata @> '{"source_wallet_id": %q}'`, value)
} else {
// SQL string literals use single quotes.
predicate = fmt.Sprintf(`metadata ->> 'source_wallet_id' = '%s'`, value)
}

sql := fmt.Sprintf(`
EXPLAIN (FORMAT TEXT)
SELECT id FROM %q.transactions
WHERE ledger = '%s'
AND %s
ORDER BY id DESC LIMIT 16
`, schema, ledgerName, predicate)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

plan_after is not shape-equivalent to the real paginator query.

This EXPLAIN path hardcodes metadata ->> 'source_wallet_id' = ..., but runtime filtering uses metadata ->> ? = ?. That can make the logged plan unrepresentative of the actual query path.

Suggested fix
 func explainAnalyze(t *testing.T, store *ledgerstore.Store, filter string, value string) string {
@@
-	var predicate string
-	if filter == "@>" {
-		predicate = fmt.Sprintf(`metadata @> '{"source_wallet_id": %q}'`, value)
-	} else {
-		predicate = fmt.Sprintf(`metadata ->> 'source_wallet_id' = '%s'`, value)
-	}
-
-	sql := fmt.Sprintf(`
-		EXPLAIN (FORMAT TEXT)
-		SELECT id FROM %q.transactions
-		WHERE ledger = '%s'
-		AND %s
-		ORDER BY id DESC LIMIT 16
-	`, schema, ledgerName, predicate)
-
-	rows, err := store.GetDB().QueryContext(ctx, sql)
+	var (
+		sql  string
+		args []any
+	)
+	if filter == "@>" {
+		sql = fmt.Sprintf(`
+			EXPLAIN (FORMAT TEXT)
+			SELECT id FROM %q.transactions
+			WHERE ledger = $1
+			AND metadata @> $2::jsonb
+			ORDER BY id DESC LIMIT 16
+		`, schema)
+		args = []any{ledgerName, fmt.Sprintf(`{"source_wallet_id":%q}`, value)}
+	} else {
+		sql = fmt.Sprintf(`
+			EXPLAIN (FORMAT TEXT)
+			SELECT id FROM %q.transactions
+			WHERE ledger = $1
+			AND metadata ->> $2 = $3
+			ORDER BY id DESC LIMIT 16
+		`, schema)
+		args = []any{ledgerName, "source_wallet_id", value}
+	}
+
+	rows, err := store.GetDB().QueryContext(ctx, sql, args...)
 	require.NoError(t, err)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Replicate what resource_transactions.ResolveFilter + BuildDataset produce.
// ledgerName and value are test-controlled alphanumeric strings, safe to embed.
var predicate string
if filter == "@>" {
// %q adds Go double-quotes; JSON string values need double-quotes, so this is correct.
predicate = fmt.Sprintf(`metadata @> '{"source_wallet_id": %q}'`, value)
} else {
// SQL string literals use single quotes.
predicate = fmt.Sprintf(`metadata ->> 'source_wallet_id' = '%s'`, value)
}
sql := fmt.Sprintf(`
EXPLAIN (FORMAT TEXT)
SELECT id FROM %q.transactions
WHERE ledger = '%s'
AND %s
ORDER BY id DESC LIMIT 16
`, schema, ledgerName, predicate)
// Replicate what resource_transactions.ResolveFilter + BuildDataset produce.
// ledgerName and value are test-controlled alphanumeric strings, safe to embed.
var (
sql string
args []any
)
if filter == "@>" {
sql = fmt.Sprintf(`
EXPLAIN (FORMAT TEXT)
SELECT id FROM %q.transactions
WHERE ledger = $1
AND metadata @> $2::jsonb
ORDER BY id DESC LIMIT 16
`, schema)
args = []any{ledgerName, fmt.Sprintf(`{"source_wallet_id":%q}`, value)}
} else {
sql = fmt.Sprintf(`
EXPLAIN (FORMAT TEXT)
SELECT id FROM %q.transactions
WHERE ledger = $1
AND metadata ->> $2 = $3
ORDER BY id DESC LIMIT 16
`, schema)
args = []any{ledgerName, "source_wallet_id", value}
}
rows, err := store.GetDB().QueryContext(ctx, sql, args...)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/ledger/transactions_deriv_sim_test.go` around lines 193 -
210, The EXPLAIN query in the test hardcodes the filter value directly into the
SQL string using fmt.Sprintf with %s in the else branch of the filter condition,
but the actual runtime paginator uses parameterized queries with placeholders.
To make the test query shape-equivalent to the real paginator query, modify the
predicate construction in the else block to use a parameterized placeholder (?)
instead of embedding the value directly with string formatting, so the EXPLAIN
plan accurately reflects what the runtime query path will execute.

Comment on lines +229 to +233
totalRows := getEnvInt("DERIV_SIM_ROWS", 1_000_000)
walletRows := getEnvInt("DERIV_SIM_WALLET", 50)

t.Logf("simulation: %d total rows, %d wallet rows (%.4f%% selectivity)",
totalRows, walletRows, float64(walletRows)/float64(totalRows)*100)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate DERIV_SIM_ROWS / DERIV_SIM_WALLET bounds before use.

totalRows is used as a divisor; DERIV_SIM_ROWS=0 will panic, and invalid ranges can silently produce misleading simulations.

Suggested guardrails
 totalRows := getEnvInt("DERIV_SIM_ROWS", 1_000_000)
 walletRows := getEnvInt("DERIV_SIM_WALLET", 50)
+require.Greater(t, totalRows, 0, "DERIV_SIM_ROWS must be > 0")
+require.GreaterOrEqual(t, walletRows, 0, "DERIV_SIM_WALLET must be >= 0")
+require.LessOrEqual(t, walletRows, totalRows, "DERIV_SIM_WALLET must be <= DERIV_SIM_ROWS")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
totalRows := getEnvInt("DERIV_SIM_ROWS", 1_000_000)
walletRows := getEnvInt("DERIV_SIM_WALLET", 50)
t.Logf("simulation: %d total rows, %d wallet rows (%.4f%% selectivity)",
totalRows, walletRows, float64(walletRows)/float64(totalRows)*100)
totalRows := getEnvInt("DERIV_SIM_ROWS", 1_000_000)
walletRows := getEnvInt("DERIV_SIM_WALLET", 50)
require.Greater(t, totalRows, 0, "DERIV_SIM_ROWS must be > 0")
require.GreaterOrEqual(t, walletRows, 0, "DERIV_SIM_WALLET must be >= 0")
require.LessOrEqual(t, walletRows, totalRows, "DERIV_SIM_WALLET must be <= DERIV_SIM_ROWS")
t.Logf("simulation: %d total rows, %d wallet rows (%.4f%% selectivity)",
totalRows, walletRows, float64(walletRows)/float64(totalRows)*100)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/storage/ledger/transactions_deriv_sim_test.go` around lines 229 -
233, The totalRows variable obtained from getEnvInt for DERIV_SIM_ROWS is used
as a divisor in the selectivity calculation without validation, which will cause
a panic if set to zero and could produce misleading results if set to an invalid
value. Add validation after retrieving totalRows and walletRows using getEnvInt
to ensure totalRows is greater than zero, walletRows is positive, and walletRows
does not exceed totalRows. If validation fails, either use sensible defaults or
fail the test early with a clear error message before attempting the division in
the t.Logf call.

Comment thread pkg/features/features.go
Comment on lines +26 to +31
// FeatureIndexedMetadataKeys is a comma-separated list of metadata keys for which the query builder
// emits a functional-index-compatible predicate (metadata ->> 'key' = 'value') instead of the default
// JSONB containment form (metadata @> '{"key":"value"}'). A matching partial functional index must
// exist on the ledger's transactions table for the rewrite to actually speed up the query.
// Value: comma-separated key names, e.g. "source_wallet_id,destination_wallet_id". Empty = disabled.
FeatureIndexedMetadataKeys = "INDEXED_METADATA_KEYS"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Constrain this open-ended flag to string-valued indexed metadata keys.

The nil entry on Line 56 plus Line 65-Line 66 accepts arbitrary key lists. Downstream ResolveFilter rewrites every configured key to metadata ->> ? = ?, which is text extraction and is not equivalent to JSONB containment for numeric/bool/object metadata values; misconfigured keys can return different rows or fail instead of falling back. Add a dedicated validator/parser for INDEXED_METADATA_KEYS that enforces the intended string-valued indexed-key contract, or restrict it to the currently indexed wallet-id keys.

Also applies to: 56-56, 65-66

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/features/features.go` around lines 26 - 31, The
FeatureIndexedMetadataKeys flag accepts arbitrary metadata key lists without
validation, but downstream code in ResolveFilter rewrites these keys using text
extraction (metadata ->> ? = ?) which only works correctly for string-valued
metadata. This causes incorrect behavior for numeric, boolean, or object
metadata values. Add a dedicated validator or parser function that processes the
INDEXED_METADATA_KEYS configuration value (referenced at lines 56 and 65-66) to
enforce that only string-valued indexed metadata keys are allowed, either by
validating against a whitelist of known string-indexed keys like wallet IDs or
by implementing a contract validator that ensures type safety for the configured
keys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants