Skip to content

feat(rpcconsumer): provider whitelist restricting relays per chain and geolocation#2310

Open
avitenzer wants to merge 8 commits into
mainfrom
feat/provider-whitelist
Open

feat(rpcconsumer): provider whitelist restricting relays per chain and geolocation#2310
avitenzer wants to merge 8 commits into
mainfrom
feat/provider-whitelist

Conversation

@avitenzer

@avitenzer avitenzer commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds an operator-controlled provider whitelist to rpcconsumer that restricts which providers the consumer is willing to relay to, keyed by (provider, chain, geolocation).

  • List present (configured + loaded): the consumer relays only to whitelisted providers for its chain and its own region (--geolocation).
  • List absent (flag empty, or not yet loaded): the consumer keeps its current relay behavior (relay to any paired provider).

The list is a JSON document, refreshed hourly (interval configurable) in the background, sourced from GitHub/GitLab or a local file. When the flag is empty, no refresh loop runs at all.

Why geolocation (MAG-1987)

The Cosmos pilot needs per-region control — "1 spec per geolocation and per chain" — so a top-performing provider can be whitelisted for one region (e.g. EU) and blocked from others. Because Lava's on-chain geolocation pairing filter is disabled, a consumer's pairing already spans all regions; this client-side filter is what pins each per-region gateway to its chosen providers.

Configuration

--providers-whitelist-config            local JSON file path OR a GitHub/GitLab directory URL
--providers-whitelist-refresh-interval  default 1h
--providers-whitelist-token             dedicated token for a remote source in a different repo;
                                         falls back to --github-token / --gitlab-token when empty

GitHub/GitLab fetching reuses the exact same machinery as specs (URL parsing, contents-API directory listing, authenticated raw download, timeouts). A remote source lists one directory and loads only the top-level .json files in it (subdirectories and non-json files ignored); their chains maps are unioned.

JSON schema

Keyed chain → geolocation → providers:

{
  "chains": {
    "ETH1": { "EU": ["lava@1abc...", "lava@1def..."], "AS": ["lava@1ghi...", "lava@1abc..."] },
    "LAV1": { "EU": ["lava@1abc..."] }
  }
}

Region keys are the on-chain codes parsed by planstypes.ParseGeoEnum: a name (EU, AS, AF, AU, USC/USE/USW), a raw bitmask (2), a comma-combined set (EU,AS), or GL (any region). Addresses are matched exactly (on-chain bech32). The same provider may appear under multiple regions.

Geolocation matching

Each per-chain CSM queries the whitelist with its own ChainID and the consumer's --geolocation. A GL bucket matches any region; otherwise the consumer's geolocation must overlap the region bitmask (which reduces to equality for a single-region gateway). This is a per-region-gateway model — production runs one consumer per region (e.g. EU = --geolocation 2), each relaying only to the providers listed for its region.

Design notes

  • ProviderWhitelist (lavasession): immutable index (chain → geo → provider set) behind an atomic.Pointer, so the per-relay IsAllowed hot path is lock-free; the hourly refresh swaps the pointer atomically. Passthrough until the first successful load.
  • Selection paths guarded (4): optimizer, header-select, and sticky session (all via the single validAddresses filter), plus blocked-provider recovery (per-candidate guard — the list refreshes on its own clock, so a provider whitelisted when blocked can be de-listed before recovery). Note: the emergency backup-provider fallback is not whitelist-filtered.
  • Filtering is applied at the selection call site, not inside CalculateAddonValidAddresses, so an empty intersection degrades to a clean PairingListEmptyError rather than triggering validAddresses reset loops (covered by a regression test).
  • specfetcher refactor is behavior-preserving: FetchAllSpecs layers on FetchAllRawFiles; existing spec behavior is unchanged.
  • Token resolution: --providers-whitelist-token wins when set; otherwise falls back to --github-token / --gitlab-token by provider (testable resolveTokenForSource).
  • Fail-closed semantics: startup fetch failure retries on a short interval until the first success; transient/malformed refreshes keep the last-known-good list; an empty (but loaded) list is logged loudly and relays to nobody.
  • Old-schema rejection: an old top-level {"providers":[...]} file is now rejected with a hard error (not silently skipped), so an un-migrated file can't drop the consumer into passthrough.

Migration

The schema changed from {"providers":[...]} (chain-only) to {"chains":{...}} (chain → geolocation → providers). Old-format files are rejected loudly — the whitelist file must be converted to the new structure and rolled out together with this build.

Scope

rpcconsumer only. rpcsmartrouter shares the selection code but is injected no whitelist, so it stays in current behavior.

Docs

docs/provider-whitelist.md covers purpose, flags, JSON schema, region codes, the per-region-gateway model, repo layout, GitHub-like-specs fetching, the guarded selection paths, migration, and fail modes.

Testing

  • Value type: parse/empty/malformed, hit/miss, per-chain and per-geolocation isolation, same-provider-in-multiple-regions, GL wildcard, named-vs-numeric region keys, invalid-region rejection, old-schema rejection, union + skip non-conforming across files.
  • CSM filter: exclusion across the guarded selection paths, filter-by-consumer-geolocation, a GetSessions end-to-end test, and an empty-intersection "no reset storm" regression test.
  • Fetcher: local-file load, last-known-good on malformed refresh, passthrough when the source is missing, and token resolution.
  • specfetcher: FetchAllRawFiles over a mocked GitHub API, and a spec-path parity test.
  • go build ./..., go vet, gofmt, and the lavasession / rpcconsumer / specfetcher suites all pass.

🤖 Generated with Claude Code

Adds an operator-controlled whitelist that restricts which providers the
consumer relays to, keyed by (provider, chain). When configured and loaded,
the consumer relays only to whitelisted providers; when absent it keeps the
current relay behavior.

- ProviderWhitelist value (lavasession): atomic-snapshot, lock-free IsAllowed,
  passthrough until the first successful load.
- Hourly fetcher (rpcconsumer): a GitHub/GitLab directory URL fetched the same
  way as specs, or a local JSON file; short-interval retry until the first load,
  last-known-good on transient/malformed refresh. Only started when configured,
  so an empty flag means no refresh loop runs.
- specfetcher: behavior-preserving refactor exposing FetchAllRawFiles /
  FetchAllFilesFromRemote, shared by specs and the whitelist (specs keep
  identical behavior; only the terminal JSON unmarshal differs).
- CSM filtering at all five selection paths (optimizer, header-select, sticky,
  backup, and blocked-provider recovery); an empty intersection degrades to a
  clean PairingListEmptyError without triggering validAddresses resets.
- New flags: --providers-whitelist-config and --providers-whitelist-refresh-interval
  (reuses the existing --github-token / --gitlab-token).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Took 4 hours 17 minutes
@qodo-code-review

Copy link
Copy Markdown

Review Summary by Qodo

Add provider whitelist to restrict relays per chain with hourly refresh

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Adds operator-controlled provider whitelist restricting relays per (provider, chain) pair
• Whitelist loaded hourly from GitHub/GitLab URL or local JSON file with atomic refresh
• Filters all five selection paths: optimizer, header-select, sticky, backup, blocked-recovery
• Refactors specfetcher to expose shared raw-file fetch for specs and whitelist parsing
• New flags: --providers-whitelist-config and --providers-whitelist-refresh-interval
Diagram
flowchart LR
  Config["Config Flags<br/>whitelist-config<br/>refresh-interval"]
  Fetcher["ProviderWhitelistFetcher<br/>GitHub/GitLab/Local"]
  Whitelist["ProviderWhitelist<br/>atomic.Pointer"]
  CSM["ConsumerSessionManager<br/>per-chain"]
  Selection["Selection Paths<br/>optimizer, header, sticky<br/>backup, blocked-recovery"]
  
  Config -- "source + interval" --> Fetcher
  Fetcher -- "hourly refresh" --> Whitelist
  Whitelist -- "injected" --> CSM
  CSM -- "filters candidates" --> Selection

Loading

Grey Divider

File Changes

1. protocol/common/cobra_common.go ⚙️ Configuration changes +16/-0

Add whitelist config and refresh interval flags

protocol/common/cobra_common.go


2. protocol/lavasession/consumer_session_manager.go ✨ Enhancement +58/-0

Inject whitelist and guard five selection paths

protocol/lavasession/consumer_session_manager.go


3. protocol/lavasession/provider_whitelist.go ✨ Enhancement +185/-0

Immutable atomic whitelist with lock-free IsAllowed

protocol/lavasession/provider_whitelist.go


View more (10)
4. protocol/lavasession/provider_whitelist_filter_test.go 🧪 Tests +201/-0

Test whitelist filtering across all selection paths

protocol/lavasession/provider_whitelist_filter_test.go


5. protocol/lavasession/provider_whitelist_test.go 🧪 Tests +140/-0

Test whitelist parsing, loading, and atomic updates

protocol/lavasession/provider_whitelist_test.go


6. protocol/rpcconsumer/provider_whitelist_fetcher.go ✨ Enhancement +139/-0

Hourly fetcher with short-interval retry and last-known-good

protocol/rpcconsumer/provider_whitelist_fetcher.go


7. protocol/rpcconsumer/provider_whitelist_fetcher_test.go 🧪 Tests +55/-0

Test local file and remote fetch with malformed handling

protocol/rpcconsumer/provider_whitelist_fetcher_test.go


8. protocol/rpcconsumer/rpcconsumer.go ✨ Enhancement +39/-14

Build and inject whitelist, start background fetcher

protocol/rpcconsumer/rpcconsumer.go


9. utils/specfetcher/api.go ✨ Enhancement +13/-0

Expose FetchAllFilesFromRemote for raw file fetching

utils/specfetcher/api.go


10. utils/specfetcher/fetcher.go ✨ Enhancement +72/-26

Refactor to separate raw fetch from spec parsing

utils/specfetcher/fetcher.go


11. utils/specfetcher/github.go ✨ Enhancement +8/-8

Rename to fetchRawFromGitHub, remove spec parsing

utils/specfetcher/github.go


12. utils/specfetcher/gitlab.go ✨ Enhancement +8/-8

Rename to fetchRawFromGitLab, remove spec parsing

utils/specfetcher/gitlab.go


13. utils/specfetcher/raw_fetch_test.go 🧪 Tests +73/-0

Test raw file fetch and spec parity via mock GitHub API

utils/specfetcher/raw_fetch_test.go


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented Jun 1, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

Grey Divider


Action required

1. Unfiltered valid-provider count 🐞 Bug ≡ Correctness
Description
ConsumerSessionManager.GetNumberOfValidProviders() returns len(csm.validAddresses) even when a
whitelist is loaded, so callers can overestimate how many providers are actually selectable under
the whitelist. This can let cross-validation capacity checks pass with an impossible
MaxParticipants and can skew retry/reset logic that relies on the provider count.
Code

protocol/lavasession/consumer_session_manager.go[R105-106]

Evidence
The whitelist is only applied during provider selection (getValidProviderAddresses), but
GetNumberOfValidProviders still reports the unfiltered validAddresses length; meanwhile
rpcconsumer_server uses this count to validate cross-validation capacity and to drive retry/reset
heuristics. Therefore, when the whitelist excludes providers, these call sites can make decisions
based on a provider count that is higher than the actually-selectable set.

protocol/lavasession/consumer_session_manager.go[102-106]
protocol/lavasession/consumer_session_manager.go[1125-1134]
protocol/rpcconsumer/rpcconsumer_server.go[261-270]
protocol/rpcconsumer/rpcconsumer_server.go[303-311]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`GetNumberOfValidProviders()` currently returns the size of `csm.validAddresses` (the pairing/valid set), but whitelist filtering is applied later (in `getValidProviderAddresses`). When a whitelist is loaded and excludes some (or most) providers, the reported "number of valid providers" no longer matches reality for selection.

This breaks assumptions in downstream logic that uses `GetNumberOfValidProviders()` as a capacity signal (e.g., cross-validation `MaxParticipants` validation and retry/reset thresholds).

### Issue Context
- Whitelist filtering is applied inside `getValidProviderAddresses` via `filterAllowedProviders`, but `GetNumberOfValidProviders` does not consult the whitelist.
- `rpcconsumer_server.validateCrossValidationCapacity` compares `MaxParticipants` against `GetNumberOfValidProviders()`, so it can accept a `MaxParticipants` that exceeds the whitelisted subset.

### Fix Focus Areas
- protocol/lavasession/consumer_session_manager.go[102-106]
- protocol/lavasession/consumer_session_manager.go[1125-1134]
- protocol/rpcconsumer/rpcconsumer_server.go[261-270]

### Suggested fix
Update `GetNumberOfValidProviders()` to return the count of *whitelisted* providers when:
1) a whitelist is configured (`csm.providerWhitelist != nil`), and
2) it has loaded at least once (`snapshot() != nil`).

Implementation sketch:
- Under the existing `csm.lock.RLock()`, snapshot the whitelist once.
- If snapshot is nil (not loaded) or whitelist is nil, return `len(csm.validAddresses)` (passthrough semantics).
- Otherwise iterate `csm.validAddresses` and count entries allowed for `csm.rpcEndpoint.ChainID`.

This keeps behavior unchanged when the feature is disabled/unloaded, and makes capacity checks consistent once the whitelist is active.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@codecov

codecov Bot commented Jun 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 62.45487% with 104 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
protocol/rpcconsumer/provider_whitelist_fetcher.go 42.66% 42 Missing and 1 partial ⚠️
protocol/rpcconsumer/rpcconsumer.go 0.00% 28 Missing ⚠️
utils/specfetcher/fetcher.go 56.41% 12 Missing and 5 partials ⚠️
utils/specfetcher/api.go 0.00% 6 Missing ⚠️
utils/specfetcher/gitlab.go 0.00% 5 Missing ⚠️
protocol/lavasession/consumer_session_manager.go 90.90% 1 Missing and 1 partial ⚠️
protocol/lavasession/provider_whitelist.go 97.93% 1 Missing and 1 partial ⚠️
utils/specfetcher/github.go 80.00% 1 Missing ⚠️
Flag Coverage Δ
consensus 9.05% <47.27%> (+0.09%) ⬆️
protocol 35.90% <66.21%> (+0.34%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/common/cobra_common.go 0.00% <ø> (ø)
utils/specfetcher/github.go 63.46% <80.00%> (+63.46%) ⬆️
protocol/lavasession/consumer_session_manager.go 71.86% <90.90%> (+4.07%) ⬆️
protocol/lavasession/provider_whitelist.go 97.93% <97.93%> (ø)
utils/specfetcher/gitlab.go 0.00% <0.00%> (ø)
utils/specfetcher/api.go 33.33% <0.00%> (-6.07%) ⬇️
utils/specfetcher/fetcher.go 73.21% <56.41%> (+31.03%) ⬆️
protocol/rpcconsumer/rpcconsumer.go 7.24% <0.00%> (-0.18%) ⬇️
protocol/rpcconsumer/provider_whitelist_fetcher.go 42.66% <42.66%> (ø)

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Add a dedicated --providers-whitelist-token for the remote whitelist source,
  used when the whitelist lives in a different repo than the specs. When empty
  it falls back to --github-token / --gitlab-token (selected by provider), so
  existing single-credential setups are unaffected. Token resolution is a
  testable resolveTokenForSource helper.
- Add docs/provider-whitelist.md covering purpose, flags, JSON schema,
  GitHub-like-specs fetching, the five guarded selection paths, fail modes,
  and behavior when the list is absent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

# Commit time for manual adjustment:
# Took 10 minutes
Comment on lines 105 to 106
return len(csm.validAddresses)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Unfiltered valid-provider count 🐞 Bug ≡ Correctness

ConsumerSessionManager.GetNumberOfValidProviders() returns len(csm.validAddresses) even when a
whitelist is loaded, so callers can overestimate how many providers are actually selectable under
the whitelist. This can let cross-validation capacity checks pass with an impossible
MaxParticipants and can skew retry/reset logic that relies on the provider count.
Agent Prompt
### Issue description
`GetNumberOfValidProviders()` currently returns the size of `csm.validAddresses` (the pairing/valid set), but whitelist filtering is applied later (in `getValidProviderAddresses`). When a whitelist is loaded and excludes some (or most) providers, the reported "number of valid providers" no longer matches reality for selection.

This breaks assumptions in downstream logic that uses `GetNumberOfValidProviders()` as a capacity signal (e.g., cross-validation `MaxParticipants` validation and retry/reset thresholds).

### Issue Context
- Whitelist filtering is applied inside `getValidProviderAddresses` via `filterAllowedProviders`, but `GetNumberOfValidProviders` does not consult the whitelist.
- `rpcconsumer_server.validateCrossValidationCapacity` compares `MaxParticipants` against `GetNumberOfValidProviders()`, so it can accept a `MaxParticipants` that exceeds the whitelisted subset.

### Fix Focus Areas
- protocol/lavasession/consumer_session_manager.go[102-106]
- protocol/lavasession/consumer_session_manager.go[1125-1134]
- protocol/rpcconsumer/rpcconsumer_server.go[261-270]

### Suggested fix
Update `GetNumberOfValidProviders()` to return the count of *whitelisted* providers when:
1) a whitelist is configured (`csm.providerWhitelist != nil`), and
2) it has loaded at least once (`snapshot() != nil`).

Implementation sketch:
- Under the existing `csm.lock.RLock()`, snapshot the whitelist once.
- If snapshot is nil (not loaded) or whitelist is nil, return `len(csm.validAddresses)` (passthrough semantics).
- Otherwise iterate `csm.validAddresses` and count entries allowed for `csm.rpcEndpoint.ChainID`.

This keeps behavior unchanged when the feature is disabled/unloaded, and makes capacity checks consistent once the whitelist is active.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

Test Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
7 files   ±0   0 ❌ ±0 

Results for commit cd58861. ± Comparison against base commit b8cd292.

♻️ This comment has been updated with latest results.

avitenzer added 4 commits June 2, 2026 12:50
The remote whitelist refresh could silently replace the last-known-good
list with a partial one, contradicting the documented behavior and
potentially taking chains offline:

- A transient per-file fetch failure (500/timeout) was tolerated, so
  UpdateFromFiles swapped in a union missing the failed file's chains.
- A file that downloaded OK but had a corrupt body was silently skipped,
  with the partial union still swapped in.

Fixes:
- Add Config.FailOnPartial to the spec fetcher; FetchAllFilesFromRemote
  (whitelist-only) sets it so any per-file fetch failure aborts the whole
  fetch. Spec fetching keeps its partial-tolerant default by construction.
- In UpdateFromFiles, distinguish a non-whitelist file (valid JSON, no
  "providers" key -> still skipped for mixed-repo support) from a
  malformed-JSON file (now a hard error that keeps the last-known-good
  snapshot).
- Tighten fetcher failure logging: branch on whether a list has loaded so
  the message reflects the real effect (startup -> passthrough/allow-all;
  steady state -> keeping last-known-good).

Cleanup:
- Remove the dead provider-whitelist guard in the backup-provider path:
  rpcconsumer never supplies backup providers and rpcsmartrouter (the only
  source of backups) is never injected a whitelist, so the guard could
  never fire. Remove the obsolete test and doc row.
- Drop the rpcsmartrouter reference from the whitelist doc scope section.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Took 1 hour 52 minutes
…a_only_with_node_three_providers.sh`

Took 34 minutes
…th provider whitelist management

- Introduces `init_lava_only_with_node_three_providers_github.sh`, a variant of the existing node setup script.
- Enables automation for managing provider whitelist through a GitHub-hosted repository, including dynamic per-run updates.
- Supports full configuration via environment variables for flexibility and avoids hardcoding repository-specific details.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Took 5 minutes
Comment thread app/app.go
Comment thread protocol/lavasession/provider_whitelist.go
Comment thread protocol/rpcconsumer/provider_whitelist_fetcher.go Outdated
Extract the initial-load retry loop from Start() into retryInitialLoad() so
its time.Ticker is stopped via defer the moment the first load succeeds,
instead of lingering (and firing uselessly) for the whole fetcher lifetime
because the old defer was bound to Start()'s much longer scope. Behavior is
unchanged; build and existing fetcher/whitelist tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
sotskov-do
sotskov-do previously approved these changes Jun 4, 2026
…hain

The consumer provider whitelist previously keyed allowed providers by
(chain, provider) only. Per MAG-1987 the Cosmos pilot needs per-region
control -- "1 spec per geolocation and per chain" -- so a top provider can
be whitelisted for one region and blocked from others.

Change the schema to chain -> geolocation -> providers (mirroring how the
requirement was expressed) and filter using the consumer's own
--geolocation. Each per-region gateway now relays only to the providers
listed for its region; a provider absent from a region's bucket is dropped,
and an empty (chain, region) bucket fails cleanly with PairingListEmptyError.

- Region keys are parsed with planstypes.ParseGeoEnum (names like "EU", raw
  codes, "EU,AS", or "GL" = any region), so the file speaks the same geo
  vocabulary as the --geolocation flag.
- Match is bitwise overlap of the consumer geo against the region, reducing
  to equality for a single-region gateway.
- The old top-level "providers" schema is now rejected with a hard error,
  so an un-migrated file can't silently drop the consumer into passthrough.

Updates the two filter call sites, tests, docs, and the init scripts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Took 3 hours 17 minutes
@avitenzer avitenzer changed the title feat(rpcconsumer): add provider whitelist to restrict relays per chain feat(rpcconsumer): provider whitelist restricting relays per chain and geolocation Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants