Summary
The page renderer's local cache (crates/rw-site/src/page.rs:285-310) is designed to skip markdown re-rendering when the source mtime is unchanged, but it loads page metadata before the cache check and metadata is not stored in the cache. With the S3 storage backend, meta(path) does a full page-bundle GET (crates/rw-storage-s3/src/storage.rs:196-198 → fetch_page_bundle), which includes the markdown body. So every cache hit on S3 still incurs a full bundle GET — the local cache saves render CPU but provides essentially zero S3 bandwidth/latency benefit.
For Backstage-backed deployments (the main S3 consumer via @rwdocs/core), this means hot traffic on a 50-page entity docs site generates one full S3 GET per page view, regardless of whether anything has changed.
What each call actually costs
| Call |
FsStorage |
S3Storage |
mtime(path) |
git (gix) lookup via vcs.mtime(&paths) |
in-memory HashMap lookup (storage.rs:186-194) — cheap |
meta(path) |
walks ancestor chain reading each meta.yaml |
fetch_page_bundle(path) — full S3 GET of the entire page bundle |
read(path) |
local file read |
full S3 GET (markdown body) |
The cache key (source_mtime.to_string()) is intentionally cheap on both backends. The expensive call on S3 is meta(), which render_content makes unconditionally before any cache lookup.
Where the cost lives
crates/rw-site/src/page.rs:285-310:
fn render_content(&self, path: &str, page: &Page, breadcrumbs: Vec<BreadcrumbItem>, ctx: &RenderContext)
-> Result<PageRenderResult, RenderError>
{
let source_mtime = self.storage.mtime(path).map_err(RenderError::from)?; // cheap on S3
let metadata = self.load_metadata(path); // S3 GET of full bundle
let etag = source_mtime.to_string();
if let Some(cached) = self.page_bucket.get_json::<CachedPage>(path, &etag) {
return Ok(PageRenderResult {
html: cached.html,
...
metadata, // metadata from the just-fetched bundle
});
}
let markdown_text = self.storage.read(path)?; // would be another GET
...
}
And CachedPage/CachedPageRef only carry html / title / toc — no metadata:
#[derive(Deserialize)]
struct CachedPage {
html: String,
title: Option<String>,
toc: Vec<TocEntry>,
}
So on cache hit:
mtime — cheap.
meta — full S3 page-bundle GET (markdown body fetched and discarded).
page_bucket.get_json — local hit, returns html/title/toc.
markdown_text is not fetched a second time (the cache hit returns early before storage.read), but the bundle GET in step 2 already included it.
Net result: the markdown body crosses the network on every render, just for the sake of metadata extraction.
Why metadata isn't already cached
FsStorage::meta (rw-storage-fs/src/lib.rs:646-) walks the ancestor chain reading each meta.yaml and merging — metadata for domain/billing/api depends on meta.yaml files at root, domain/, domain/billing/, and domain/billing/api/. The page cache is keyed only by the page's own mtime, which does not capture changes to ancestor meta.yaml. Caching metadata under that key would let ancestor edits go undetected.
So the design isn't accidental — there's a reason metadata isn't in CachedPage — but the S3 cost on cache hits feels like an oversight.
Possible directions
Listing for discussion; none of these is a one-liner.
-
Bundle metadata into CachedPage, with an enriched etag.
Add a storage.meta_mtime(path) method that returns the max mtime across the ancestor meta.yaml chain (cheap: HashMap lookup on S3 if the manifest exposes it; one-time stat walk on FS). Use format!(\"{page_mtime}-{meta_mtime}\") as the etag. Then CachedPage can include metadata: Option<Metadata> and the cache hit skips load_metadata entirely. Solves the S3 round-trip.
-
Separate metadata bucket.
Cache Metadata under a key that reflects the ancestor chain (e.g., the concatenated ancestor mtimes). Lots of pages share the same metadata hierarchy, so a separate bucket would dedupe well. More moving parts; probably overkill unless meta becomes a bigger hot path.
-
S3-side optimization: HEAD before GET.
Cheaper but doesn't fully solve it — meta() returns parsed JSON, so a HEAD-then-conditional-GET would still need to fetch the body unless the bundle JSON were split. Probably not worth the complexity vs. option 1.
-
Restructure the bundle to separate metadata from content on S3.
Publisher writes a small <path>.meta.json alongside the full bundle; S3Storage::meta fetches only the small object. Means more S3 keys but lets metadata fetches be cheap. Bigger change.
My preference would be option 1: pass a composite etag derived from (page_mtime, ancestor_meta_mtime) and bundle metadata into the cache entry. The meta_mtime API is small and useful in its own right, and it preserves cache correctness across ancestor meta edits.
Why it matters
- For Backstage deployments, the cache is one of the main selling points of
@rwdocs/core — it's supposed to make repeated page views cheap. Currently the rendering CPU is cheap on a hit but the S3 round trip isn't avoided, so the user-perceived latency win is much smaller than the cache hit rate suggests.
- For local
rw serve, the cost is small (filesystem + git), but ancestor-meta walks scale with directory depth and YAML parsing is non-trivial — measurable on large monorepo docs trees.
- Not a correctness bug, no data loss — purely a perf/scaling design issue that's worth tracking now while the cache architecture is still simple.
Summary
The page renderer's local cache (
crates/rw-site/src/page.rs:285-310) is designed to skip markdown re-rendering when the source mtime is unchanged, but it loads page metadata before the cache check and metadata is not stored in the cache. With the S3 storage backend,meta(path)does a full page-bundle GET (crates/rw-storage-s3/src/storage.rs:196-198→fetch_page_bundle), which includes the markdown body. So every cache hit on S3 still incurs a full bundle GET — the local cache saves render CPU but provides essentially zero S3 bandwidth/latency benefit.For Backstage-backed deployments (the main S3 consumer via
@rwdocs/core), this means hot traffic on a 50-page entity docs site generates one full S3 GET per page view, regardless of whether anything has changed.What each call actually costs
mtime(path)gix) lookup viavcs.mtime(&paths)HashMaplookup (storage.rs:186-194) — cheapmeta(path)meta.yamlfetch_page_bundle(path)— full S3 GET of the entire page bundleread(path)The cache key (
source_mtime.to_string()) is intentionally cheap on both backends. The expensive call on S3 ismeta(), whichrender_contentmakes unconditionally before any cache lookup.Where the cost lives
crates/rw-site/src/page.rs:285-310:And
CachedPage/CachedPageRefonly carry html / title / toc — no metadata:So on cache hit:
mtime— cheap.meta— full S3 page-bundle GET (markdown body fetched and discarded).page_bucket.get_json— local hit, returns html/title/toc.markdown_textis not fetched a second time (the cache hit returns early beforestorage.read), but the bundle GET in step 2 already included it.Net result: the markdown body crosses the network on every render, just for the sake of metadata extraction.
Why metadata isn't already cached
FsStorage::meta(rw-storage-fs/src/lib.rs:646-) walks the ancestor chain reading eachmeta.yamland merging — metadata fordomain/billing/apidepends onmeta.yamlfiles at root,domain/,domain/billing/, anddomain/billing/api/. The page cache is keyed only by the page's own mtime, which does not capture changes to ancestormeta.yaml. Caching metadata under that key would let ancestor edits go undetected.So the design isn't accidental — there's a reason metadata isn't in
CachedPage— but the S3 cost on cache hits feels like an oversight.Possible directions
Listing for discussion; none of these is a one-liner.
Bundle metadata into
CachedPage, with an enriched etag.Add a
storage.meta_mtime(path)method that returns the max mtime across the ancestormeta.yamlchain (cheap: HashMap lookup on S3 if the manifest exposes it; one-time stat walk on FS). Useformat!(\"{page_mtime}-{meta_mtime}\")as the etag. ThenCachedPagecan includemetadata: Option<Metadata>and the cache hit skipsload_metadataentirely. Solves the S3 round-trip.Separate metadata bucket.
Cache
Metadataunder a key that reflects the ancestor chain (e.g., the concatenated ancestor mtimes). Lots of pages share the same metadata hierarchy, so a separate bucket would dedupe well. More moving parts; probably overkill unless meta becomes a bigger hot path.S3-side optimization: HEAD before GET.
Cheaper but doesn't fully solve it —
meta()returns parsed JSON, so a HEAD-then-conditional-GET would still need to fetch the body unless the bundle JSON were split. Probably not worth the complexity vs. option 1.Restructure the bundle to separate metadata from content on S3.
Publisher writes a small
<path>.meta.jsonalongside the full bundle;S3Storage::metafetches only the small object. Means more S3 keys but lets metadata fetches be cheap. Bigger change.My preference would be option 1: pass a composite etag derived from
(page_mtime, ancestor_meta_mtime)and bundle metadata into the cache entry. Themeta_mtimeAPI is small and useful in its own right, and it preserves cache correctness across ancestor meta edits.Why it matters
@rwdocs/core— it's supposed to make repeated page views cheap. Currently the rendering CPU is cheap on a hit but the S3 round trip isn't avoided, so the user-perceived latency win is much smaller than the cache hit rate suggests.rw serve, the cost is small (filesystem + git), but ancestor-meta walks scale with directory depth and YAML parsing is non-trivial — measurable on large monorepo docs trees.