Skip to content

feat(sts): cache AssumeRoleWithWebIdentity responses across isolates#175

Draft
alukach wants to merge 1 commit into
fix/sts-request-timeoutfrom
fix/sts-credential-cache
Draft

feat(sts): cache AssumeRoleWithWebIdentity responses across isolates#175
alukach wants to merge 1 commit into
fix/sts-request-timeoutfrom
fix/sts-credential-cache

Conversation

@alukach

@alukach alukach commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Draft / stacked on fix/sts-request-timeout (#172). The post_form change
builds on that PR's timeout, and the two are complementary (cache removes STS
from the hot path; timeout bounds the rare cold miss). Retarget to main once
#172 merges.

Problem

Private products federate to AWS STS (AssumeRoleWithWebIdentity) on the cold path. multistore's credential cache lives in per-isolate memory (OIDC_PROVIDER is a OnceLock), and Cloudflare spins up many short-lived isolates — so a large fraction of requests re-run the STS exchange on the request hot path. When that exchange stalls, the worker hangs until the edge kills it, surfacing to the app as an unparseable 503. This is the root cause behind the intermittent "product won't load, self-heals on reload" reports.

Approach

Add an L2 cache for the STS response, shared across isolates within a colo via the Cloudflare Cache API — the same pattern already used for Source API responses in source_api/cache.rs. It sits under multistore's in-isolate L1 cache:

  • L1 (multistore, per-isolate): caches typed BackendCredentials, single-flights within an isolate.
  • L2 (this PR, per-colo): caches the raw STS response body, keyed by RoleArn (L1's own cache key). On a hit, the proxy skips the STS round-trip entirely.

The only seam data.source.coop controls in the mint path is FetchHttpExchange::post_form (the outbound STS call) — get_credentials and the L1 cache live inside multistore. So the L2 cache wraps post_form.

Effect: STS goes from ~once per isolate per credential lifetime → ~once per colo. The slow exchange leaves the user hot path almost entirely.

What's cached (and not)

  • Only AssumeRoleWithWebIdentity forms (role_arn_from_form returns None for other actions / Azure-GCP flows → bypass).
  • TTL = time to the response's <Expiration> minus a 300s lead (≥ multistore's 60s refresh lead, so an L2 entry always expires before L1 would call the derived credential stale).
  • STS error documents are never cachedttl_secs returns None when there's no parseable <Expiration>.
  • On an L2 hit we still mint the (cheap, local) JWT; only the slow STS network call is skipped. Skipping the mint too would need an L1-level hook in multistore (noted below).

Security

The cached values are short-lived, role-scoped temporary credentials, stored under a synthetic non-routable cache key (https://sts-creds.cache.internal/…, never a real edge request URL, so not externally addressable), with TTL ≤ credential lifetime, per-colo. If a deployment needs global reach, encryption-at-rest, or true cross-isolate single-flight (a cold colo can still see a small STS herd), the same cache_key/ttl_secs helpers drop into KV (global, encrypted) or a Durable Object (global, single-flight).

Follow-up (not here)

Cleaner long-term: give multistore's CredentialCache::get_or_fetch an optional runtime L2 hook (the crate doc already anticipates "a runtime can layer an additional cache tier inside the closure"). That caches typed creds at L1 and skips the JWT mint on hits too — but it's a cross-repo API change + release, vs. this which ships from data.source.coop today.

Verification

  • cargo test --test sts_cache8/8 (role/key/ttl helpers, incl. error-doc and near-expiry → not cached).
  • cargo check --target wasm32-unknown-unknown — clean.
  • cargo clippy --target wasm32-unknown-unknown -- -D warnings — clean.

🤖 Generated with Claude Code

multistore's credential cache is per-isolate (OIDC_PROVIDER is a OnceLock),
and Cloudflare spins up many short-lived isolates — so a large fraction of
requests re-run the STS exchange on the request hot path. When that stalls,
the worker hangs until the edge kills it, surfacing as an unparseable 503.

Layer an L2 (cross-isolate, per-colo) Cloudflare Cache API tier UNDER
multistore's in-isolate cache, keyed by RoleArn, wrapping the only seam this
repo controls in the mint path: FetchHttpExchange::post_form. On a hit the
proxy skips the slow STS round-trip entirely.

Pure helpers (role/key/ttl) live in sts_cache.rs (host-tested); the Cache API
I/O lives in lib.rs. TTL derives from the response <Expiration> minus a 300s
lead (>= multistore's 60s refresh lead, so L2 never outlives L1's freshness).
STS error documents are never cached.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@claude

claude Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Claude finished @alukach's task in 5m 2s —— View job


Review complete

  • Gather context (read diff, changed files, CLAUDE.md)
  • Review src/sts_cache.rs
  • Review src/lib.rs integration
  • Review tests
  • Post final review

✅ No blocking issues — safe to merge.

Reviewed the diff against origin/fix/sts-request-timeout (5 files, +218/-2). The L2 cache wraps post_form cleanly, only caches a successful AssumeRoleWithWebIdentity response with a parseable <Expiration> (error documents and near-expiry responses are excluded — covered by tests/sts_cache.rs), and the 300s lead vs. multistore's 60s L1 lead leaves a safe margin so L2 always expires before L1 would consider the credential stale.

A couple of non-blocking observations for awareness, not asking for changes:

  • src/sts_cache.rs keys the cache solely on RoleArn (mirroring multistore's own L1 key). This is safe today because apply_backend_auth (src/backend_auth.rs:131-138) never passes a per-connection session Policy/tags into the STS form — the role's own IAM policy is the only thing that determines the resulting credential's permissions, and oidc_subject (scv1:conn:{id}) only gates the trust-policy check, not the grant. If a per-connection session-scoping policy is ever added to the form, this key (at both L1 and L2) would need to widen accordingly, since two connections sharing one RoleArn would then incorrectly share a colo-wide cached credential instead of just an isolate-wide one.
  • sts_cache_get (src/lib.rs:134-137) swallows any Cache API error silently via .ok()??, while sts_cache_put and the analogous cached_fetch in src/source_api/cache.rs log a tracing::warn on failure. Worth a similar tracing::warn on the read path so a systemic Cache API problem doesn't go unnoticed (purely an observability nit — behavior is correctly best-effort either way).

Tests (tests/sts_cache.rs) look thorough for the pure helpers (role/key/ttl, including the error-doc and near-expiry-not-cached cases). I could not run cargo test/cargo clippy myself in this sandbox (no tool access to cargo), so I relied on static review — CI hasn't reported results yet at the time of this review.

@github-actions

Copy link
Copy Markdown

🚀 Latest commit deployed to https://source-data-proxy-pr-175.source-coop.workers.dev

  • Date: 2026-06-30T23:08:10Z
  • Commit: d33036f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant