Skip to content

starknet_transaction_prover: retry storage-proof RPC fetches with backoff#14396

Open
l-henri wants to merge 1 commit into
starkware-libs:mainfrom
l-henri:storage-proofs-rpc-retry-main
Open

starknet_transaction_prover: retry storage-proof RPC fetches with backoff#14396
l-henri wants to merge 1 commit into
starkware-libs:mainfrom
l-henri:storage-proofs-rpc-retry-main

Conversation

@l-henri

@l-henri l-henri commented Jun 5, 2026

Copy link
Copy Markdown

Problem

Proving a virtual block fetches a storage proof per 100-key chunk of touched state via RpcStorageProofsProvider::fetch_single_proof (crates/starknet_transaction_prover/src/running/storage_proofs.rs). Against a public RPC node, a burst of these gets rate-limited. The call is a bare get_storage_proof().await? with no retry, so a single HTTP 429/5xx aborts a prove that has already run for minutes.

The failure is also opaque: a throttling node returns a non-JSON body, which starknet-rs fails to deserialize and surfaces as ProviderError::Other(..) carrying expected value at line 1 column 1 — not a structured JSON-RPC error.

Field data: a controlled replay of 150 small requests against a public node returned 137× HTTP 429; one anywhere in input collection killed the whole prove.

Fix

Retry transient get_storage_proof failures with exponential backoff:

  • 5 attempts, 0.5s → 8s (doubling, capped), configurable via STARKNET_PROVER_RPC_MAX_RETRIES.
  • is_retryable_rpc_error() retries rate-limit / 5xx / transport / non-JSON-parse failures.
  • Deterministic structured JSON-RPC errors (block-not-found, etc.) are explicitly NOT retried — detected via the same JsonRpcClientError::JsonRpcError downcast already used in impl From<ProviderError> for ProofProviderError, so real errors still fail fast.
  • A tracing::warn! is emitted per retry with attempt / backoff / error context.

No new dependencies (uses tokio::time + tracing, both already in the crate).

Testing

  • cargo check -p starknet_transaction_prover passes. (Compiled on the PRIVACY-0.14.2-RC.6 line, where this was found; main's error model — the From<ProviderError> downcast — and deps are identical, and the patch applies to main unchanged.)
  • Not yet exercised against a live throttling node; backoff behavior under real 429s is unverified.

…koff

Proving a virtual block fetches a storage proof per chunk of touched state
(fetch_single_proof). Against a public node these bursts get rate-limited: a
single HTTP 429/5xx arrives as a non-JSON body that starknet-rs fails to parse,
surfacing as `ProviderError::Other(expected value at line 1 column 1)`. With no
retry, one such transient failure anywhere in input collection aborted a prove
that had already run for minutes.

Retry transient get_storage_proof failures with exponential backoff (5 attempts,
0.5s..8s; configurable via STARKNET_PROVER_RPC_MAX_RETRIES). Structured upstream
JSON-RPC errors (block-not-found, etc.) are deterministic and explicitly NOT
retried, mirroring the downcast in `impl From<ProviderError> for ProofProviderError`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@reviewable-StarkWare

Copy link
Copy Markdown

This change is Reviewable

@cursor

cursor Bot commented Jun 5, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Scoped to RPC fetch resilience in the prover; no changes to proof logic or state commitment, though string-based retry heuristics could occasionally retry non-transient errors.

Overview
Adds transient retry with exponential backoff around each get_storage_proof call in RpcStorageProofsProvider::fetch_single_proof, so a single HTTP 429/5xx or transport/parse failure during proof input collection no longer aborts a long-running prove.

Retry policy defaults to 5 attempts with delays 0.5s → 8s (doubling, capped), overridable via STARKNET_PROVER_RPC_MAX_RETRIES. New is_retryable_rpc_error treats rate limits, gateway errors, timeouts, and non-JSON throttle pages as retryable, but skips retries for structured JSON-RPC errors (same JsonRpcClientError::JsonRpcError downcast as ProofProviderError conversion). Each retry logs a tracing::warn! with block, attempt, backoff, and error.

Reviewed by Cursor Bugbot for commit 16bb5d2. Bugbot is set up for automated code reviews on this repo. Configure here.

@l-henri

l-henri commented Jun 5, 2026

Copy link
Copy Markdown
Author

Henri writing, not an agent.
This is obviously an automated PR. It does solve a problem I'm facing. I figured I'd open the PR in case it's useful to others. I'm happy to discuss this further with maintainers and do what's needed to get the PR in, if it's useful.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 16bb5d2. Configure here.

.is_some_and(|e| matches!(e, JsonRpcClientError::JsonRpcError(_)));
if is_structured_rpc_error {
return false;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON-RPC rate limits not retried

Medium Severity

is_retryable_rpc_error treats every ProviderError::Other downcast to JsonRpcClientError::JsonRpcError as non-retryable. Nodes that answer throttling with a valid JSON-RPC error (e.g. “too many requests”) therefore fail on the first attempt, even though the change is meant to retry rate limits.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 16bb5d2. Configure here.

match result {
Ok(storage_proof) => return Ok(storage_proof),
Err(err) if attempt < max_retries && is_retryable_rpc_error(&err) => {
attempt += 1;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry cap exceeds documented attempts

Low Severity

Docs and the PR describe STARKNET_PROVER_RPC_MAX_RETRIES as the number of attempts (default five), but fetch_single_proof retries while attempt < max_retries after each failure, yielding one initial call plus up to max_retries retries—six RPC calls and five backoff sleeps by default, not the documented five attempts and ~7.5s wait.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 16bb5d2. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants