feat(rpcconsumer): provider whitelist restricting relays per chain and geolocation#2310
feat(rpcconsumer): provider whitelist restricting relays per chain and geolocation#2310avitenzer wants to merge 8 commits into
Conversation
Adds an operator-controlled whitelist that restricts which providers the consumer relays to, keyed by (provider, chain). When configured and loaded, the consumer relays only to whitelisted providers; when absent it keeps the current relay behavior. - ProviderWhitelist value (lavasession): atomic-snapshot, lock-free IsAllowed, passthrough until the first successful load. - Hourly fetcher (rpcconsumer): a GitHub/GitLab directory URL fetched the same way as specs, or a local JSON file; short-interval retry until the first load, last-known-good on transient/malformed refresh. Only started when configured, so an empty flag means no refresh loop runs. - specfetcher: behavior-preserving refactor exposing FetchAllRawFiles / FetchAllFilesFromRemote, shared by specs and the whitelist (specs keep identical behavior; only the terminal JSON unmarshal differs). - CSM filtering at all five selection paths (optimizer, header-select, sticky, backup, and blocked-provider recovery); an empty intersection degrades to a clean PairingListEmptyError without triggering validAddresses resets. - New flags: --providers-whitelist-config and --providers-whitelist-refresh-interval (reuses the existing --github-token / --gitlab-token). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Took 4 hours 17 minutes
Review Summary by QodoAdd provider whitelist to restrict relays per chain with hourly refresh
WalkthroughsDescription• Adds operator-controlled provider whitelist restricting relays per (provider, chain) pair • Whitelist loaded hourly from GitHub/GitLab URL or local JSON file with atomic refresh • Filters all five selection paths: optimizer, header-select, sticky, backup, blocked-recovery • Refactors specfetcher to expose shared raw-file fetch for specs and whitelist parsing • New flags: --providers-whitelist-config and --providers-whitelist-refresh-interval Diagramflowchart LR
Config["Config Flags<br/>whitelist-config<br/>refresh-interval"]
Fetcher["ProviderWhitelistFetcher<br/>GitHub/GitLab/Local"]
Whitelist["ProviderWhitelist<br/>atomic.Pointer"]
CSM["ConsumerSessionManager<br/>per-chain"]
Selection["Selection Paths<br/>optimizer, header, sticky<br/>backup, blocked-recovery"]
Config -- "source + interval" --> Fetcher
Fetcher -- "hourly refresh" --> Whitelist
Whitelist -- "injected" --> CSM
CSM -- "filters candidates" --> Selection
File Changes1. protocol/common/cobra_common.go
|
Code Review by Qodo
1. Unfiltered valid-provider count
|
- Add a dedicated --providers-whitelist-token for the remote whitelist source, used when the whitelist lives in a different repo than the specs. When empty it falls back to --github-token / --gitlab-token (selected by provider), so existing single-credential setups are unaffected. Token resolution is a testable resolveTokenForSource helper. - Add docs/provider-whitelist.md covering purpose, flags, JSON schema, GitHub-like-specs fetching, the five guarded selection paths, fail modes, and behavior when the list is absent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> # Commit time for manual adjustment: # Took 10 minutes
| return len(csm.validAddresses) | ||
| } |
There was a problem hiding this comment.
1. Unfiltered valid-provider count 🐞 Bug ≡ Correctness
ConsumerSessionManager.GetNumberOfValidProviders() returns len(csm.validAddresses) even when a whitelist is loaded, so callers can overestimate how many providers are actually selectable under the whitelist. This can let cross-validation capacity checks pass with an impossible MaxParticipants and can skew retry/reset logic that relies on the provider count.
Agent Prompt
### Issue description
`GetNumberOfValidProviders()` currently returns the size of `csm.validAddresses` (the pairing/valid set), but whitelist filtering is applied later (in `getValidProviderAddresses`). When a whitelist is loaded and excludes some (or most) providers, the reported "number of valid providers" no longer matches reality for selection.
This breaks assumptions in downstream logic that uses `GetNumberOfValidProviders()` as a capacity signal (e.g., cross-validation `MaxParticipants` validation and retry/reset thresholds).
### Issue Context
- Whitelist filtering is applied inside `getValidProviderAddresses` via `filterAllowedProviders`, but `GetNumberOfValidProviders` does not consult the whitelist.
- `rpcconsumer_server.validateCrossValidationCapacity` compares `MaxParticipants` against `GetNumberOfValidProviders()`, so it can accept a `MaxParticipants` that exceeds the whitelisted subset.
### Fix Focus Areas
- protocol/lavasession/consumer_session_manager.go[102-106]
- protocol/lavasession/consumer_session_manager.go[1125-1134]
- protocol/rpcconsumer/rpcconsumer_server.go[261-270]
### Suggested fix
Update `GetNumberOfValidProviders()` to return the count of *whitelisted* providers when:
1) a whitelist is configured (`csm.providerWhitelist != nil`), and
2) it has loaded at least once (`snapshot() != nil`).
Implementation sketch:
- Under the existing `csm.lock.RLock()`, snapshot the whitelist once.
- If snapshot is nil (not loaded) or whitelist is nil, return `len(csm.validAddresses)` (passthrough semantics).
- Otherwise iterate `csm.validAddresses` and count entries allowed for `csm.rpcEndpoint.ChainID`.
This keeps behavior unchanged when the feature is disabled/unloaded, and makes capacity checks consistent once the whitelist is active.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
The remote whitelist refresh could silently replace the last-known-good list with a partial one, contradicting the documented behavior and potentially taking chains offline: - A transient per-file fetch failure (500/timeout) was tolerated, so UpdateFromFiles swapped in a union missing the failed file's chains. - A file that downloaded OK but had a corrupt body was silently skipped, with the partial union still swapped in. Fixes: - Add Config.FailOnPartial to the spec fetcher; FetchAllFilesFromRemote (whitelist-only) sets it so any per-file fetch failure aborts the whole fetch. Spec fetching keeps its partial-tolerant default by construction. - In UpdateFromFiles, distinguish a non-whitelist file (valid JSON, no "providers" key -> still skipped for mixed-repo support) from a malformed-JSON file (now a hard error that keeps the last-known-good snapshot). - Tighten fetcher failure logging: branch on whether a list has loaded so the message reflects the real effect (startup -> passthrough/allow-all; steady state -> keeping last-known-good). Cleanup: - Remove the dead provider-whitelist guard in the backup-provider path: rpcconsumer never supplies backup providers and rpcsmartrouter (the only source of backups) is never injected a whitelist, so the guard could never fire. Remove the obsolete test and doc row. - Drop the rpcsmartrouter reference from the whitelist doc scope section. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Took 1 hour 52 minutes
… staked addresses Took 1 hour 5 minutes
…a_only_with_node_three_providers.sh` Took 34 minutes
…th provider whitelist management - Introduces `init_lava_only_with_node_three_providers_github.sh`, a variant of the existing node setup script. - Enables automation for managing provider whitelist through a GitHub-hosted repository, including dynamic per-run updates. - Supports full configuration via environment variables for flexibility and avoids hardcoding repository-specific details. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Took 5 minutes
Extract the initial-load retry loop from Start() into retryInitialLoad() so its time.Ticker is stopped via defer the moment the first load succeeds, instead of lingering (and firing uselessly) for the whole fetcher lifetime because the old defer was bound to Start()'s much longer scope. Behavior is unchanged; build and existing fetcher/whitelist tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hain The consumer provider whitelist previously keyed allowed providers by (chain, provider) only. Per MAG-1987 the Cosmos pilot needs per-region control -- "1 spec per geolocation and per chain" -- so a top provider can be whitelisted for one region and blocked from others. Change the schema to chain -> geolocation -> providers (mirroring how the requirement was expressed) and filter using the consumer's own --geolocation. Each per-region gateway now relays only to the providers listed for its region; a provider absent from a region's bucket is dropped, and an empty (chain, region) bucket fails cleanly with PairingListEmptyError. - Region keys are parsed with planstypes.ParseGeoEnum (names like "EU", raw codes, "EU,AS", or "GL" = any region), so the file speaks the same geo vocabulary as the --geolocation flag. - Match is bitwise overlap of the consumer geo against the region, reducing to equality for a single-region gateway. - The old top-level "providers" schema is now rejected with a hard error, so an un-migrated file can't silently drop the consumer into passthrough. Updates the two filter call sites, tests, docs, and the init scripts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Took 3 hours 17 minutes
Summary
Adds an operator-controlled provider whitelist to
rpcconsumerthat restricts which providers the consumer is willing to relay to, keyed by(provider, chain, geolocation).--geolocation).The list is a JSON document, refreshed hourly (interval configurable) in the background, sourced from GitHub/GitLab or a local file. When the flag is empty, no refresh loop runs at all.
Why geolocation (MAG-1987)
The Cosmos pilot needs per-region control — "1 spec per geolocation and per chain" — so a top-performing provider can be whitelisted for one region (e.g. EU) and blocked from others. Because Lava's on-chain geolocation pairing filter is disabled, a consumer's pairing already spans all regions; this client-side filter is what pins each per-region gateway to its chosen providers.
Configuration
GitHub/GitLab fetching reuses the exact same machinery as specs (URL parsing, contents-API directory listing, authenticated raw download, timeouts). A remote source lists one directory and loads only the top-level
.jsonfiles in it (subdirectories and non-json files ignored); theirchainsmaps are unioned.JSON schema
Keyed chain → geolocation → providers:
{ "chains": { "ETH1": { "EU": ["lava@1abc...", "lava@1def..."], "AS": ["lava@1ghi...", "lava@1abc..."] }, "LAV1": { "EU": ["lava@1abc..."] } } }Region keys are the on-chain codes parsed by
planstypes.ParseGeoEnum: a name (EU,AS,AF,AU,USC/USE/USW), a raw bitmask (2), a comma-combined set (EU,AS), orGL(any region). Addresses are matched exactly (on-chain bech32). The same provider may appear under multiple regions.Geolocation matching
Each per-chain CSM queries the whitelist with its own
ChainIDand the consumer's--geolocation. AGLbucket matches any region; otherwise the consumer's geolocation must overlap the region bitmask (which reduces to equality for a single-region gateway). This is a per-region-gateway model — production runs one consumer per region (e.g. EU =--geolocation 2), each relaying only to the providers listed for its region.Design notes
ProviderWhitelist(lavasession): immutable index (chain → geo → provider set) behind anatomic.Pointer, so the per-relayIsAllowedhot path is lock-free; the hourly refresh swaps the pointer atomically. Passthrough until the first successful load.validAddressesfilter), plus blocked-provider recovery (per-candidate guard — the list refreshes on its own clock, so a provider whitelisted when blocked can be de-listed before recovery). Note: the emergency backup-provider fallback is not whitelist-filtered.CalculateAddonValidAddresses, so an empty intersection degrades to a cleanPairingListEmptyErrorrather than triggeringvalidAddressesreset loops (covered by a regression test).FetchAllSpecslayers onFetchAllRawFiles; existing spec behavior is unchanged.--providers-whitelist-tokenwins when set; otherwise falls back to--github-token/--gitlab-tokenby provider (testableresolveTokenForSource).{"providers":[...]}file is now rejected with a hard error (not silently skipped), so an un-migrated file can't drop the consumer into passthrough.Migration
The schema changed from
{"providers":[...]}(chain-only) to{"chains":{...}}(chain → geolocation → providers). Old-format files are rejected loudly — the whitelist file must be converted to the new structure and rolled out together with this build.Scope
rpcconsumeronly.rpcsmartroutershares the selection code but is injected no whitelist, so it stays in current behavior.Docs
docs/provider-whitelist.mdcovers purpose, flags, JSON schema, region codes, the per-region-gateway model, repo layout, GitHub-like-specs fetching, the guarded selection paths, migration, and fail modes.Testing
GLwildcard, named-vs-numeric region keys, invalid-region rejection, old-schema rejection, union + skip non-conforming across files.GetSessionsend-to-end test, and an empty-intersection "no reset storm" regression test.FetchAllRawFilesover a mocked GitHub API, and a spec-path parity test.go build ./...,go vet, gofmt, and thelavasession/rpcconsumer/specfetchersuites all pass.🤖 Generated with Claude Code