feat(sts): cache AssumeRoleWithWebIdentity responses across isolates by alukach · Pull Request #175 · source-cooperative/data.source.coop

alukach · 2026-06-30T23:08:10Z

Draft / stacked on fix/sts-request-timeout (#172). The post_form change
builds on that PR's timeout, and the two are complementary (cache removes STS
from the hot path; timeout bounds the rare cold miss). Retarget to main once
#172 merges.

Problem

Private products federate to AWS STS (AssumeRoleWithWebIdentity) on the cold path. multistore's credential cache lives in per-isolate memory (OIDC_PROVIDER is a OnceLock), and Cloudflare spins up many short-lived isolates — so a large fraction of requests re-run the STS exchange on the request hot path. When that exchange stalls, the worker hangs until the edge kills it, surfacing to the app as an unparseable 503. This is the root cause behind the intermittent "product won't load, self-heals on reload" reports.

Approach

Add an L2 cache for the STS response, shared across isolates within a colo via the Cloudflare Cache API — the same pattern already used for Source API responses in source_api/cache.rs. It sits under multistore's in-isolate L1 cache:

L1 (multistore, per-isolate): caches typed BackendCredentials, single-flights within an isolate.
L2 (this PR, per-colo): caches the raw STS response body, keyed by RoleArn (L1's own cache key). On a hit, the proxy skips the STS round-trip entirely.

The only seam data.source.coop controls in the mint path is FetchHttpExchange::post_form (the outbound STS call) — get_credentials and the L1 cache live inside multistore. So the L2 cache wraps post_form.

Effect: STS goes from ~once per isolate per credential lifetime → ~once per colo. The slow exchange leaves the user hot path almost entirely.

What's cached (and not)

Only AssumeRoleWithWebIdentity forms (role_arn_from_form returns None for other actions / Azure-GCP flows → bypass).
TTL = time to the response's <Expiration> minus a 300s lead (≥ multistore's 60s refresh lead, so an L2 entry always expires before L1 would call the derived credential stale).
STS error documents are never cached — ttl_secs returns None when there's no parseable <Expiration>.
On an L2 hit we still mint the (cheap, local) JWT; only the slow STS network call is skipped. Skipping the mint too would need an L1-level hook in multistore (noted below).

Security

The cached values are short-lived, role-scoped temporary credentials, stored under a synthetic non-routable cache key (https://sts-creds.cache.internal/…, never a real edge request URL, so not externally addressable), with TTL ≤ credential lifetime, per-colo. If a deployment needs global reach, encryption-at-rest, or true cross-isolate single-flight (a cold colo can still see a small STS herd), the same cache_key/ttl_secs helpers drop into KV (global, encrypted) or a Durable Object (global, single-flight).

Follow-up (not here)

Cleaner long-term: give multistore's CredentialCache::get_or_fetch an optional runtime L2 hook (the crate doc already anticipates "a runtime can layer an additional cache tier inside the closure"). That caches typed creds at L1 and skips the JWT mint on hits too — but it's a cross-repo API change + release, vs. this which ships from data.source.coop today.

Verification

cargo test --test sts_cache — 8/8 (role/key/ttl helpers, incl. error-doc and near-expiry → not cached).
cargo check --target wasm32-unknown-unknown — clean.
cargo clippy --target wasm32-unknown-unknown -- -D warnings — clean.

🤖 Generated with Claude Code

multistore's credential cache is per-isolate (OIDC_PROVIDER is a OnceLock), and Cloudflare spins up many short-lived isolates — so a large fraction of requests re-run the STS exchange on the request hot path. When that stalls, the worker hangs until the edge kills it, surfacing as an unparseable 503. Layer an L2 (cross-isolate, per-colo) Cloudflare Cache API tier UNDER multistore's in-isolate cache, keyed by RoleArn, wrapping the only seam this repo controls in the mint path: FetchHttpExchange::post_form. On a hit the proxy skips the slow STS round-trip entirely. Pure helpers (role/key/ttl) live in sts_cache.rs (host-tested); the Cache API I/O lives in lib.rs. TTL derives from the response <Expiration> minus a 300s lead (>= multistore's 60s refresh lead, so L2 never outlives L1's freshness). STS error documents are never cached. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

claude · 2026-06-30T23:08:24Z

Claude finished @alukach's task in 5m 2s —— View job

Review complete

✅ No blocking issues — safe to merge.

Reviewed the diff against origin/fix/sts-request-timeout (5 files, +218/-2). The L2 cache wraps post_form cleanly, only caches a successful AssumeRoleWithWebIdentity response with a parseable <Expiration> (error documents and near-expiry responses are excluded — covered by tests/sts_cache.rs), and the 300s lead vs. multistore's 60s L1 lead leaves a safe margin so L2 always expires before L1 would consider the credential stale.

A couple of non-blocking observations for awareness, not asking for changes:

src/sts_cache.rs keys the cache solely on RoleArn (mirroring multistore's own L1 key). This is safe today because apply_backend_auth (src/backend_auth.rs:131-138) never passes a per-connection session Policy/tags into the STS form — the role's own IAM policy is the only thing that determines the resulting credential's permissions, and oidc_subject (scv1:conn:{id}) only gates the trust-policy check, not the grant. If a per-connection session-scoping policy is ever added to the form, this key (at both L1 and L2) would need to widen accordingly, since two connections sharing one RoleArn would then incorrectly share a colo-wide cached credential instead of just an isolate-wide one.
sts_cache_get (src/lib.rs:134-137) swallows any Cache API error silently via .ok()??, while sts_cache_put and the analogous cached_fetch in src/source_api/cache.rs log a tracing::warn on failure. Worth a similar tracing::warn on the read path so a systemic Cache API problem doesn't go unnoticed (purely an observability nit — behavior is correctly best-effort either way).

Tests (tests/sts_cache.rs) look thorough for the pure helpers (role/key/ttl, including the error-doc and near-expiry-not-cached cases). I could not run cargo test/cargo clippy myself in this sandbox (no tool access to cargo), so I relied on static review — CI hasn't reported results yet at the time of this review.

github-actions · 2026-06-30T23:12:24Z

🚀 Latest commit deployed to https://source-data-proxy-pr-175.source-coop.workers.dev

Date: 2026-06-30T23:08:10Z
Commit: d33036f

alukach temporarily deployed to preview June 30, 2026 23:08 — with GitHub Actions Inactive

source-release-bot Bot added the feat label Jun 30, 2026

alukach temporarily deployed to preview June 30, 2026 23:08 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(sts): cache AssumeRoleWithWebIdentity responses across isolates#175

feat(sts): cache AssumeRoleWithWebIdentity responses across isolates#175
alukach wants to merge 1 commit into
fix/sts-request-timeoutfrom
fix/sts-credential-cache

alukach commented Jun 30, 2026

Uh oh!

claude Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

alukach commented Jun 30, 2026

Problem

Approach

What's cached (and not)

Security

Follow-up (not here)

Verification

Uh oh!

claude Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review complete

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Jun 30, 2026 •

edited

Loading