feat(sts): cache AssumeRoleWithWebIdentity responses across isolates#175
feat(sts): cache AssumeRoleWithWebIdentity responses across isolates#175alukach wants to merge 1 commit into
Conversation
multistore's credential cache is per-isolate (OIDC_PROVIDER is a OnceLock), and Cloudflare spins up many short-lived isolates — so a large fraction of requests re-run the STS exchange on the request hot path. When that stalls, the worker hangs until the edge kills it, surfacing as an unparseable 503. Layer an L2 (cross-isolate, per-colo) Cloudflare Cache API tier UNDER multistore's in-isolate cache, keyed by RoleArn, wrapping the only seam this repo controls in the mint path: FetchHttpExchange::post_form. On a hit the proxy skips the slow STS round-trip entirely. Pure helpers (role/key/ttl) live in sts_cache.rs (host-tested); the Cache API I/O lives in lib.rs. TTL derives from the response <Expiration> minus a 300s lead (>= multistore's 60s refresh lead, so L2 never outlives L1's freshness). STS error documents are never cached. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Claude finished @alukach's task in 5m 2s —— View job Review complete
✅ No blocking issues — safe to merge. Reviewed the diff against A couple of non-blocking observations for awareness, not asking for changes:
Tests ( |
|
🚀 Latest commit deployed to https://source-data-proxy-pr-175.source-coop.workers.dev
|
Problem
Private products federate to AWS STS (
AssumeRoleWithWebIdentity) on the cold path. multistore's credential cache lives in per-isolate memory (OIDC_PROVIDERis aOnceLock), and Cloudflare spins up many short-lived isolates — so a large fraction of requests re-run the STS exchange on the request hot path. When that exchange stalls, the worker hangs until the edge kills it, surfacing to the app as an unparseable503. This is the root cause behind the intermittent "product won't load, self-heals on reload" reports.Approach
Add an L2 cache for the STS response, shared across isolates within a colo via the Cloudflare Cache API — the same pattern already used for Source API responses in
source_api/cache.rs. It sits under multistore's in-isolate L1 cache:BackendCredentials, single-flights within an isolate.RoleArn(L1's own cache key). On a hit, the proxy skips the STS round-trip entirely.The only seam
data.source.coopcontrols in the mint path isFetchHttpExchange::post_form(the outbound STS call) —get_credentialsand the L1 cache live insidemultistore. So the L2 cache wrapspost_form.Effect: STS goes from ~once per isolate per credential lifetime → ~once per colo. The slow exchange leaves the user hot path almost entirely.
What's cached (and not)
AssumeRoleWithWebIdentityforms (role_arn_from_formreturnsNonefor other actions / Azure-GCP flows → bypass).<Expiration>minus a 300s lead (≥ multistore's 60s refresh lead, so an L2 entry always expires before L1 would call the derived credential stale).ttl_secsreturnsNonewhen there's no parseable<Expiration>.multistore(noted below).Security
The cached values are short-lived, role-scoped temporary credentials, stored under a synthetic non-routable cache key (
https://sts-creds.cache.internal/…, never a real edge request URL, so not externally addressable), with TTL ≤ credential lifetime, per-colo. If a deployment needs global reach, encryption-at-rest, or true cross-isolate single-flight (a cold colo can still see a small STS herd), the samecache_key/ttl_secshelpers drop into KV (global, encrypted) or a Durable Object (global, single-flight).Follow-up (not here)
Cleaner long-term: give
multistore'sCredentialCache::get_or_fetchan optional runtime L2 hook (the crate doc already anticipates "a runtime can layer an additional cache tier inside the closure"). That caches typed creds at L1 and skips the JWT mint on hits too — but it's a cross-repo API change + release, vs. this which ships fromdata.source.cooptoday.Verification
cargo test --test sts_cache— 8/8 (role/key/ttl helpers, incl. error-doc and near-expiry → not cached).cargo check --target wasm32-unknown-unknown— clean.cargo clippy --target wasm32-unknown-unknown -- -D warnings— clean.🤖 Generated with Claude Code