Perf: page cache hits on S3 still cost a full bundle GET because metadata is loaded outside the cache

## Summary

The page renderer's local cache (`crates/rw-site/src/page.rs:285-310`) is designed to skip markdown re-rendering when the source mtime is unchanged, but it loads page metadata **before** the cache check and metadata is **not** stored in the cache. With the S3 storage backend, `meta(path)` does a full page-bundle GET (`crates/rw-storage-s3/src/storage.rs:196-198` → `fetch_page_bundle`), which includes the markdown body. So every cache hit on S3 still incurs a full bundle GET — the local cache saves render CPU but provides essentially zero S3 bandwidth/latency benefit.

For Backstage-backed deployments (the main S3 consumer via `@rwdocs/core`), this means hot traffic on a 50-page entity docs site generates one full S3 GET per page view, regardless of whether anything has changed.

## What each call actually costs

| Call | FsStorage | S3Storage |
|------|-----------|-----------|
| `mtime(path)` | git (`gix`) lookup via `vcs.mtime(&paths)` | in-memory `HashMap` lookup (`storage.rs:186-194`) — **cheap** |
| `meta(path)` | walks ancestor chain reading each `meta.yaml` | `fetch_page_bundle(path)` — **full S3 GET of the entire page bundle** |
| `read(path)` | local file read | full S3 GET (markdown body) |

The cache key (`source_mtime.to_string()`) is intentionally cheap on both backends. The expensive call on S3 is `meta()`, which `render_content` makes unconditionally before any cache lookup.

## Where the cost lives

`crates/rw-site/src/page.rs:285-310`:

```rust
fn render_content(&self, path: &str, page: &Page, breadcrumbs: Vec<BreadcrumbItem>, ctx: &RenderContext)
    -> Result<PageRenderResult, RenderError>
{
    let source_mtime = self.storage.mtime(path).map_err(RenderError::from)?;   // cheap on S3
    let metadata = self.load_metadata(path);                                    // S3 GET of full bundle
    let etag = source_mtime.to_string();

    if let Some(cached) = self.page_bucket.get_json::<CachedPage>(path, &etag) {
        return Ok(PageRenderResult {
            html: cached.html,
            ...
            metadata,                  // metadata from the just-fetched bundle
        });
    }

    let markdown_text = self.storage.read(path)?;                               // would be another GET
    ...
}
```

And `CachedPage`/`CachedPageRef` only carry html / title / toc — no metadata:

```rust
#[derive(Deserialize)]
struct CachedPage {
    html: String,
    title: Option<String>,
    toc: Vec<TocEntry>,
}
```

So on cache hit:
1. `mtime` — cheap.
2. `meta` — full S3 page-bundle GET (markdown body fetched and discarded).
3. `page_bucket.get_json` — local hit, returns html/title/toc.
4. `markdown_text` is **not** fetched a second time (the cache hit returns early before `storage.read`), but the bundle GET in step 2 already included it.

Net result: the markdown body crosses the network on every render, just for the sake of metadata extraction.

## Why metadata isn't already cached

`FsStorage::meta` (`rw-storage-fs/src/lib.rs:646-`) walks the ancestor chain reading each `meta.yaml` and merging — metadata for `domain/billing/api` depends on `meta.yaml` files at root, `domain/`, `domain/billing/`, and `domain/billing/api/`. The page cache is keyed only by the page's own mtime, which does not capture changes to ancestor `meta.yaml`. Caching metadata under that key would let ancestor edits go undetected.

So the design isn't accidental — there's a reason metadata isn't in `CachedPage` — but the S3 cost on cache hits feels like an oversight.

## Possible directions

Listing for discussion; none of these is a one-liner.

1. **Bundle metadata into `CachedPage`, with an enriched etag.**
   Add a `storage.meta_mtime(path)` method that returns the max mtime across the ancestor `meta.yaml` chain (cheap: HashMap lookup on S3 if the manifest exposes it; one-time stat walk on FS). Use `format!(\"{page_mtime}-{meta_mtime}\")` as the etag. Then `CachedPage` can include `metadata: Option<Metadata>` and the cache hit skips `load_metadata` entirely. Solves the S3 round-trip.

2. **Separate metadata bucket.**
   Cache `Metadata` under a key that reflects the ancestor chain (e.g., the concatenated ancestor mtimes). Lots of pages share the same metadata hierarchy, so a separate bucket would dedupe well. More moving parts; probably overkill unless meta becomes a bigger hot path.

3. **S3-side optimization: HEAD before GET.**
   Cheaper but doesn't fully solve it — `meta()` returns parsed JSON, so a HEAD-then-conditional-GET would still need to fetch the body unless the bundle JSON were split. Probably not worth the complexity vs. option 1.

4. **Restructure the bundle to separate metadata from content on S3.**
   Publisher writes a small `<path>.meta.json` alongside the full bundle; `S3Storage::meta` fetches only the small object. Means more S3 keys but lets metadata fetches be cheap. Bigger change.

My preference would be option 1: pass a composite etag derived from `(page_mtime, ancestor_meta_mtime)` and bundle metadata into the cache entry. The `meta_mtime` API is small and useful in its own right, and it preserves cache correctness across ancestor meta edits.

## Why it matters

- For Backstage deployments, the cache is one of the main selling points of `@rwdocs/core` — it's supposed to make repeated page views cheap. Currently the rendering CPU is cheap on a hit but the S3 round trip isn't avoided, so the user-perceived latency win is much smaller than the cache hit rate suggests.
- For local `rw serve`, the cost is small (filesystem + git), but ancestor-meta walks scale with directory depth and YAML parsing is non-trivial — measurable on large monorepo docs trees.
- Not a correctness bug, no data loss — purely a perf/scaling design issue that's worth tracking now while the cache architecture is still simple.

Call	FsStorage	S3Storage
`mtime(path)`	git (`gix`) lookup via `vcs.mtime(&paths)`	in-memory `HashMap` lookup (`storage.rs:186-194`) — cheap
`meta(path)`	walks ancestor chain reading each `meta.yaml`	`fetch_page_bundle(path)` — full S3 GET of the entire page bundle
`read(path)`	local file read	full S3 GET (markdown body)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf: page cache hits on S3 still cost a full bundle GET because metadata is loaded outside the cache #408

Summary

What each call actually costs

Where the cost lives

Why metadata isn't already cached

Possible directions

Why it matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Perf: page cache hits on S3 still cost a full bundle GET because metadata is loaded outside the cache #408

Description

Summary

What each call actually costs

Where the cost lives

Why metadata isn't already cached

Possible directions

Why it matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions