Skip to content

Perf: page cache hits on S3 still cost a full bundle GET because metadata is loaded outside the cache #408

Description

@yumike

Summary

The page renderer's local cache (crates/rw-site/src/page.rs:285-310) is designed to skip markdown re-rendering when the source mtime is unchanged, but it loads page metadata before the cache check and metadata is not stored in the cache. With the S3 storage backend, meta(path) does a full page-bundle GET (crates/rw-storage-s3/src/storage.rs:196-198fetch_page_bundle), which includes the markdown body. So every cache hit on S3 still incurs a full bundle GET — the local cache saves render CPU but provides essentially zero S3 bandwidth/latency benefit.

For Backstage-backed deployments (the main S3 consumer via @rwdocs/core), this means hot traffic on a 50-page entity docs site generates one full S3 GET per page view, regardless of whether anything has changed.

What each call actually costs

Call FsStorage S3Storage
mtime(path) git (gix) lookup via vcs.mtime(&paths) in-memory HashMap lookup (storage.rs:186-194) — cheap
meta(path) walks ancestor chain reading each meta.yaml fetch_page_bundle(path)full S3 GET of the entire page bundle
read(path) local file read full S3 GET (markdown body)

The cache key (source_mtime.to_string()) is intentionally cheap on both backends. The expensive call on S3 is meta(), which render_content makes unconditionally before any cache lookup.

Where the cost lives

crates/rw-site/src/page.rs:285-310:

fn render_content(&self, path: &str, page: &Page, breadcrumbs: Vec<BreadcrumbItem>, ctx: &RenderContext)
    -> Result<PageRenderResult, RenderError>
{
    let source_mtime = self.storage.mtime(path).map_err(RenderError::from)?;   // cheap on S3
    let metadata = self.load_metadata(path);                                    // S3 GET of full bundle
    let etag = source_mtime.to_string();

    if let Some(cached) = self.page_bucket.get_json::<CachedPage>(path, &etag) {
        return Ok(PageRenderResult {
            html: cached.html,
            ...
            metadata,                  // metadata from the just-fetched bundle
        });
    }

    let markdown_text = self.storage.read(path)?;                               // would be another GET
    ...
}

And CachedPage/CachedPageRef only carry html / title / toc — no metadata:

#[derive(Deserialize)]
struct CachedPage {
    html: String,
    title: Option<String>,
    toc: Vec<TocEntry>,
}

So on cache hit:

  1. mtime — cheap.
  2. meta — full S3 page-bundle GET (markdown body fetched and discarded).
  3. page_bucket.get_json — local hit, returns html/title/toc.
  4. markdown_text is not fetched a second time (the cache hit returns early before storage.read), but the bundle GET in step 2 already included it.

Net result: the markdown body crosses the network on every render, just for the sake of metadata extraction.

Why metadata isn't already cached

FsStorage::meta (rw-storage-fs/src/lib.rs:646-) walks the ancestor chain reading each meta.yaml and merging — metadata for domain/billing/api depends on meta.yaml files at root, domain/, domain/billing/, and domain/billing/api/. The page cache is keyed only by the page's own mtime, which does not capture changes to ancestor meta.yaml. Caching metadata under that key would let ancestor edits go undetected.

So the design isn't accidental — there's a reason metadata isn't in CachedPage — but the S3 cost on cache hits feels like an oversight.

Possible directions

Listing for discussion; none of these is a one-liner.

  1. Bundle metadata into CachedPage, with an enriched etag.
    Add a storage.meta_mtime(path) method that returns the max mtime across the ancestor meta.yaml chain (cheap: HashMap lookup on S3 if the manifest exposes it; one-time stat walk on FS). Use format!(\"{page_mtime}-{meta_mtime}\") as the etag. Then CachedPage can include metadata: Option<Metadata> and the cache hit skips load_metadata entirely. Solves the S3 round-trip.

  2. Separate metadata bucket.
    Cache Metadata under a key that reflects the ancestor chain (e.g., the concatenated ancestor mtimes). Lots of pages share the same metadata hierarchy, so a separate bucket would dedupe well. More moving parts; probably overkill unless meta becomes a bigger hot path.

  3. S3-side optimization: HEAD before GET.
    Cheaper but doesn't fully solve it — meta() returns parsed JSON, so a HEAD-then-conditional-GET would still need to fetch the body unless the bundle JSON were split. Probably not worth the complexity vs. option 1.

  4. Restructure the bundle to separate metadata from content on S3.
    Publisher writes a small <path>.meta.json alongside the full bundle; S3Storage::meta fetches only the small object. Means more S3 keys but lets metadata fetches be cheap. Bigger change.

My preference would be option 1: pass a composite etag derived from (page_mtime, ancestor_meta_mtime) and bundle metadata into the cache entry. The meta_mtime API is small and useful in its own right, and it preserves cache correctness across ancestor meta edits.

Why it matters

  • For Backstage deployments, the cache is one of the main selling points of @rwdocs/core — it's supposed to make repeated page views cheap. Currently the rendering CPU is cheap on a hit but the S3 round trip isn't avoided, so the user-perceived latency win is much smaller than the cache hit rate suggests.
  • For local rw serve, the cost is small (filesystem + git), but ancestor-meta walks scale with directory depth and YAML parsing is non-trivial — measurable on large monorepo docs trees.
  • Not a correctness bug, no data loss — purely a perf/scaling design issue that's worth tracking now while the cache architecture is still simple.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions