Skip to content

memory-system-page#205

Merged
AndreasAbdi merged 5 commits into
mainfrom
memory-system-page
Jun 22, 2026
Merged

memory-system-page#205
AndreasAbdi merged 5 commits into
mainfrom
memory-system-page

Conversation

@AndreasAbdi

Copy link
Copy Markdown
Contributor

{
"project": "Model Atlas — Memory Canonical System Page",
"branchName": "memory-system-page",
"description": "Publish the missing canonical English memory system page, backed by a canonical registry identity, a page-local graph-backed system flow, and localized messages, so readers can understand how serving memory constrains inference through weight residency, KV-cache growth, bandwidth limits, allocator pressure, and the tradeoffs around batching, spilling, and splitting work.",
"context": {
"customerAsk": "Customer ask alignment: keep the queue near the 3-4 active worker target with one fresh, narrow, customer-visible systems-page slice now that continuous-batching-system-page landed cleanly on current main. Add the missing canonical English memory system page from the reader-facing systems bundle so customers can understand why serving performance is constrained by weight residency, KV-cache growth, bandwidth, and allocator pressure rather than treating memory as a vague hardware footnote. Keep this to one mergeable page slice on current main. Scope: create the system page under src/content/docs/systems/memory/ with page.mdx, messages/en.json, and any required assets.json; add the backing system registry record and graph record if missing; classify it as a system; connect it to already-shipped nearby pages such as inference engine, batching, routing, deployment, on-disk-kv-cache, prefill, decode, and KV cache when those pages exist on current main by implementation time; and add only the focused validation or tests needed for the touched content and registry surfaces. The page should explain what memory means in an inference-serving stack, why weights and KV cache dominate different parts of the request path, how bandwidth and fragmentation shape latency and throughput, what operators trade when they batch, spill, or split work, and where memory sits relative to routing and deployment decisions. Acceptance criteria: the registry-backed memory system page renders on current main, search and related links can discover it, the diff stays page-local, and the focused touched checks pass.",
"problem": "The site already has nearby serving pages such as inference engine, batching, routing, deployment, and on-disk-kv-cache, plus glossary pages for prefill, decode, and KV cache, but it does not yet ship one canonical page that explains memory as a serving system concern. Readers can encounter memory-adjacent ideas in those pages without getting one plain-language explanation of what memory actually means in inference serving, why weights and KV cache dominate different stages, or why bandwidth and fragmentation can cap throughput even when compute looks available. That leaves a gap in the serving bundle: one of the main operational constraints behind latency, concurrency, and capacity is fragmented across neighboring pages and harder to discover directly.",
"solution": "Create a canonical /docs/systems/memory page using the standard system-page contract, colocated English messages, and a page-local system-flow asset backed by a matching graph record that teaches memory residency, cache growth, bandwidth motion, and allocator pressure clearly. Back the page with a canonical system.memory registry record, classify it as a system, and connect it to already-shipped nearby pages through registry-backed aliases, tags, and related-doc relationships. Keep the slice focused on plain-language serving behavior, request-path memory pressure, operator tradeoffs, and focused discovery proof rather than broad hardware-taxonomy, performance-benchmark, or allocator-framework expansion."
},
"acceptanceCriteria": [
"A published canonical docs page exists for memory under src/content/docs/systems/memory/ with kind: \"system\", a matching canonical registry record, English messages, and any required local assets.",
"The page follows the canonical system-page contract and docs-writing standards with one folded openingSummary, plain-language system framing, distinct section jobs, and no page-meta prose.",
"The page explains what memory means in an inference-serving stack and why it is a serving-system concern rather than only a hardware specification.",
"The page explains why model weights and KV cache dominate different parts of the request path, including prefill and decode.",
"The page explains how memory bandwidth and allocator fragmentation shape latency, throughput, and concurrency even when raw compute exists.",
"The page explains the main operator tradeoffs around batching, spilling, splitting work, and related serving decisions, and connects the reader to nearby shipped pages for inference engine, batching, routing, deployment, on-disk-kv-cache, prefill, decode, and KV cache when those targets exist on current main.",
"The implementation remains English-only and does not broaden into hardware benchmarking, allocator-library tutorials, or unrelated serving-taxonomy cleanup beyond what is required for the canonical memory page to ship cleanly.",
"Quality gate: typecheck, lint, and focused touched tests pass."
],
"userStories": [
{
"id": "memory-system-page-001",
"title": "Establish the canonical memory system identity",
"description": "As a reader searching for inference-serving memory behavior, I want one canonical serving-system identity for memory so discovery surfaces lead me to a dedicated explainer instead of scattered references across neighboring pages.",
"acceptanceCriteria": [
"A canonical system.memory registry identity exists with kind: \"system\", slug memory, controlled aliases, tags, and related ids appropriate for a serving-system topic.",
"A matching graph record exists for the page-local system-flow teaching aid so the memory page can render the memory-path explanation through the standard graph-backed surface.",
"The memory record is classified as a system, not a concept, module, or glossary entry, and it fits the current registry and classification conventions for shipped system pages.",
"Registry relationships connect the record to already-shipped nearby docs for inference engine, batching, routing, deployment, on-disk-kv-cache, prefill, decode, and KV cache when those canonical targets exist on current main.",
"Representative reader queries such as memory, serving memory, weight residency, KV cache growth, and memory bandwidth can resolve to the canonical page through the normal discovery path.",
"Typecheck passes",
"Tests pass"
],
"priority": 1,
"passes": true,
"notes": "2026-06-22: Added the canonical system.memory registry record, graph.memory-system-flow, the published /docs/systems/memory page bundle, and focused discovery tests so representative search queries now resolve to the canonical page."
},
{
"id": "memory-system-page-002",
"title": "Publish the canonical memory system page",
"description": "As a technical layperson learning inference serving, I want a dedicated memory page so I can understand the serving constraint before I read narrower runtime or deployment pages.",
"acceptanceCriteria": [
"A canonical system page exists at /docs/systems/memory with matching frontmatter, messages/en.json, and a colocated assets.json wired to the page-local system-flow graph.",
"The page follows the shared system-page structure and keeps page.mdx structural, with reader-facing copy resolved through messages/en.json.",
"The page opens with one folded openingSummary and explains in plain language that serving memory is the live state the runtime must keep or move while answering requests, not just the advertised capacity number on a device spec sheet.",
"The page explains where memory sits in the serving stack, including how it constrains runtime choices without changing the underlying model architecture itself.",
"The page renders in the standard docs shell and is understandable in isolation before narrowing into bandwidth, cache, or deployment tradeoffs.",
"Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 2,
"passes": true,
"notes": "2026-06-22: Audited the published /docs/systems/memory bundle against the page-focused acceptance criteria, reran lint/typecheck plus focused registry/page tests through the prepared content-runtime path, and re-verified the route in-browser on http://127.0.0.1:3456/docs/systems/memory."
},
{
"id": "memory-system-page-003",
"title": "Teach request-path memory dominance with a graph-backed flow",
"description": "As a reader trying to understand why memory pressure changes over time, I want the page to show how weights and KV cache dominate different stages so I can connect prefill and decode behavior to concrete memory load.",
"acceptanceCriteria": [
"The page clearly explains that model weights must stay resident before useful work can begin, while KV cache grows with prompt and generation length as requests stay active.",
"The page clearly explains how prefill and decode stress memory differently, including why prompt processing and token-by-token generation do not create identical memory pressure.",
"The page-local system-flow graph and asset wiring render a reader-visible flow that shows weight residency, request execution, KV-cache accumulation, and memory pressure moving across the serving path without decorative extra graph churn.",
"The page explains why memory remains a live serving concern even after the model is loaded, because active requests keep changing the amount and placement of state the runtime must hold.",
"Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 3,
"passes": true,
"notes": "2026-06-22: Expanded the canonical memory page to teach the different memory pressure phases explicitly, updated the page-local system-flow graph labels and layout to emphasize weight residency, prefill burst, decode reuse, and ongoing live pressure, and extended the focused page tests plus browser verification for those teaching points."
},
{
"id": "memory-system-page-004",
"title": "Explain bandwidth, fragmentation, and operator tradeoffs",
"description": "As a reader deciding why memory matters operationally, I want the page to explain both the performance limits and the operator decisions they force so I can understand why batching, spilling, and routing choices are tied to memory.",
"acceptanceCriteria": [
"The page clearly explains how memory bandwidth affects first-token latency, inter-token latency, or throughput by limiting how quickly the runtime can read weights and move active state.",
"The page clearly explains allocator pressure or fragmentation in plain language, including why enough total memory does not always mean enough usable contiguous or well-placed memory for the next serving step.",
"The page clearly explains the practical tradeoffs of batching, spilling cache, or splitting prefill and decode, including what each choice gains and what memory cost or latency cost it introduces.",
"The page explains where memory sits relative to routing and deployment decisions, including why those systems often exist to fit work into available memory shape rather than only to save compute.",
"The page exposes clear related navigation to inference engine, batching, routing, deployment, on-disk-kv-cache, prefill, decode, and KV cache when those pages exist on current main.",
"At least one discovery surface presents memory as a reachable destination so readers can find it without typing the exact slug.",
"Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 4,
"passes": true,
"notes": "2026-06-22: Expanded the page with dedicated bandwidth/fragmentation and operator-tradeoff sections, clarified first-token and inter-token latency consequences plus routing/deployment memory-fit behavior, extended focused runtime assertions, and re-verified the route in-browser on localhost."
},
{
"id": "memory-system-page-005",
"title": "Add focused validation for the memory page slice",
"description": "As a maintainer, I want targeted automated proof for the memory slice so route, registry, graph, messages, and discoverability regressions are caught without expanding into unrelated suite work.",
"acceptanceCriteria": [
"Validation or tests confirm the /docs/systems/memory route, the canonical memory system record, and the default English messages resolve together.",
"Focused validation confirms the page-local graph record and asset reference resolve correctly for the canonical route.",
"Coverage asserts at least one memory-specific discoverability outcome and at least one related-link, tag, alias, or search expectation for the page.",
"Focused checks stay limited to touched content and discovery integrity rather than inventory snapshots, broad serving-family audits, or meta-test scaffolding.",
"Typecheck passes",
"Tests pass"
],
"priority": 5,
"passes": true,
"notes": "2026-06-22: Revalidated the shipped memory page slice with focused registry/page tests plus lint and typecheck, confirmed the route, English messages, local graph asset wiring, and discoverability expectations stay aligned, and recorded the content-runtime prep race as a workflow gotcha rather than a page regression."
}
]
}

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Follow-up push for mergeability on the current PR head:

  • restored the canonical system page section contract for /docs/systems/memory
  • kept the bandwidth, fragmentation, and operator-tradeoff teaching points by folding them into the existing how-it-works and practical-impact sections
  • updated the focused memory-page test to assert the same content against the contract-compliant structure
  • re-ran bun run validate-data, bun run lint, bun run typecheck, and the focused memory registry/page tests locally

@AndreasAbdi

Copy link
Copy Markdown
Contributor Author

Review summary for PR #205

Result: PASS
Blocking status: no BLOCKING issues found.

Evidence:

  • make test: PASS locally (2464 pass, 0 fail)
  • Live CI on head b82ef938a518c6564fb882392d671b3fd5aa890e: PASS (gh pr checks shows 11/11 passed)
  • Browser verification: PASS on http://127.0.0.1:3456/docs/systems/memory using the webpack-backed dev server because Turbopack mis-inferred the worktree root in this nested workspace. The page rendered the expected sections, folded opening summary, graph labels, and nearby related links. GET /api/search?query=memory also returned /docs/systems/memory as the top hit.
  • Note: docs/internal/processes/manual-qa.md is absent in this worktree, so I could not read that process doc and verified the route directly instead.

Project acceptance criteria:

  • PASS: A published canonical docs page exists for memory under src/content/docs/systems/memory/ with kind: "system", a matching canonical registry record, English messages, and any required local assets. The diff adds page.mdx, messages/en.json, assets.json, src/content/registry/systems/memory.json, and src/content/registry/graphs/memory-system-flow.json with matching IDs and namespaces.
  • PASS: The page follows the canonical system-page contract and docs-writing standards with one folded openingSummary, plain-language system framing, distinct section jobs, and no page-meta prose. The page matches the existing system-page structure, keeps prose in messages, and the rendered shell includes the folded opening summary.
  • PASS: The page explains what memory means in an inference-serving stack and why it is a serving-system concern rather than only a hardware specification. The whatItIs, whereItSits, and opening summary copy do this directly in plain language.
  • PASS: The page explains why model weights and KV cache dominate different parts of the request path, including prefill and decode. The howItWorks section and graph cover weight residency, prefill burst, decode reuse, and ongoing cache growth.
  • PASS: The page explains how memory bandwidth and allocator fragmentation shape latency, throughput, and concurrency even when raw compute exists. The howItWorks and practicalImpact sections explain bandwidth, usable-memory limits, first-token latency, inter-token latency, queue growth, and safe concurrency.
  • PASS: The page explains the main operator tradeoffs around batching, spilling, splitting work, and related serving decisions, and connects the reader to nearby shipped pages for inference engine, batching, routing, deployment, on-disk-kv-cache, prefill, decode, and KV cache when those targets exist on current main. Those tradeoffs are explicit in practicalImpact, related IDs are wired in the registry, and the rendered page exposes the nearby links.
  • PASS: The implementation remains English-only and does not broaden into hardware benchmarking, allocator-library tutorials, or unrelated serving-taxonomy cleanup beyond what is required for the canonical memory page to ship cleanly. Scope stayed page-local and English-only.
  • PASS: Quality gate: typecheck, lint, and focused touched tests pass. CI is green and local make test passed.

User stories and behavioral-assertion check:

  • PASS: memory-system-page-001 includes behavioral assertions, especially discovery queries resolving to the canonical page and related-doc resolution. The implementation satisfies the story.
  • PASS: memory-system-page-002 includes behavioral assertions, especially rendered route behavior and browser verification. The implementation satisfies the story.
  • PASS: memory-system-page-003 includes behavioral assertions, especially rendered graph content and request-path explanations. The implementation satisfies the story.
  • PASS: memory-system-page-004 includes behavioral assertions, especially observable related navigation and discovery-surface reachability. The implementation satisfies the story.
  • PASS: memory-system-page-005 includes behavioral assertions, especially route/search/asset/registry integration behavior instead of pure inventory-only checks. The implementation satisfies the story.

Docs-writing standards:

  • PASS: The page is understandable in isolation and does not define the topic only through one architecture slot, one historical example, or one adjacent page.
  • PASS: The narrative body stays focused on the concept and contains no self-referential, site-structure, process, phase, or page-meta copy.
  • PASS: The first sections explain both what the concept is and why it matters in plain language for a technical layperson.
  • PASS: The title and first narrative mention use the full name before acronyms or shorthand. It expands key-value cache before leaning on KV cache shorthand in the narrative.
  • PASS: Each section has a distinct job and does not restate the same thesis with slightly different wording.
  • PASS: Mathematically heavy pages include the equations, notation, or symbolic derivations needed to teach the idea accurately. This page is not math-heavy; the lightweight equation is additive rather than missing necessary formalism.
  • PASS: Visually, structurally, or conceptually heavy pages include the best graph, diagram, chart, comparison view, or algorithm presentation needed to teach the idea accurately, and those assets follow graphing standards.
  • PASS: Math sections keep concise symbol-only definitions directly under equations and avoid concept rows such as projections, grouping mechanics, or implementation steps. The page uses a single concise block formula without verbose definition rows.
  • PASS: Customer-facing copy contains no reader-shortcut callouts, no "on this page" framing, and no internal workflow language.
  • PASS: References and citations are present where factual claims need support, and every cited reference is correct. The page cites citation.orca-serving-system and citation.deepseek-v4-paper, both appropriate to serving-memory/runtime discussion.
  • PASS: Related docs, tags, and citations support discovery, but the page body does not depend on hand-held cross-page explanation to make sense.
  • PASS: The copy is concise, direct, and conformant with the technical-writing baseline in this document.

Graphing standards:

  • PASS: The graph properly expresses and is the most suitable possible presentation amongst possible presentations. A node-flow graph is appropriate for showing where memory pressure shifts across the serving path.
  • PASS: The graph properly expresses math formulas by apply katex based formatting. No chart math formatting was needed; no missing formula rendering issue was introduced.
  • PASS: The graph properly presents arrows in flow direction. The rendered flow shows directional transitions from weights to prefill to cache/decode to pressure.
  • PASS: The graph is renderable and is visible. Verified in-browser.
  • PASS: The graph MUST fits with the other presentations of graphs. It uses the shared SystemFlowGraph and RegistryGraphFlow patterns already used by nearby system pages.
  • PASS: The graph MUST immediately be obvious to the customer what it is trying to present. Labels are direct and reader-facing.
  • PASS: The graph MUST be the most appropriate presentation. This is a relationship/flow explanation, so the flow graph is a good fit.
  • PASS: The graph MUST have a legend. Rendered legend was present.
  • PASS: The graph MUST have a title. Rendered title was present as Memory System Flow.
  • PASS: The graph MUST choose WCAG accessible colors. It uses the shared graph component palette already covered by existing a11y coverage and showed no obvious contrast issue in manual QA.
  • PASS: Node graphs MUST be legible. No overlap or unreadable node text was observed.
  • PASS: Edge flows MUST be visible. The shared renderer kept them visible.
  • PASS: Node graphs MUST emphasize with colors that are the most important. The shared system-flow visual treatment remained consistent and readable without decorative churn.

General website standards:

  • PASS: Architecture and state. The change stays within the existing page/registry/graph/content-runtime architecture and does not add ad hoc frontend state.
  • PASS: Components and interaction. It reuses shared docs components with explicit contracts instead of inventing page-specific UI.
  • PASS: Styling and visual consistency. It relies on shared docs shell and shared graph rendering.
  • PASS: Accessibility. Existing shared components are reused; no new accessibility regression was found in route verification.
  • PASS: Responsive design. No page-local layout regression was evident; the page uses the standard docs shell and shared graph surface.
  • PASS: Performance and resilience. The slice is static content plus existing shared components and does not introduce new runtime fetch or fragile behavior.
  • PASS: Browser compatibility and progressive enhancement. No new browser-specific code was introduced.
  • PASS: Localization. User-facing copy is correctly isolated in messages/en.json with stable keys and explicit English-only scope.
  • PASS: Testing and diagnostics. Coverage includes rendered behavior, route/search integration, registry wiring, and page-local asset resolution.

Review-rules check:

  • PASS: Correctness before style. I found no correctness issue in page wiring, discovery, related links, or graph rendering.
  • PASS: Solves stated problem without obvious regression. The canonical memory page now exists and is discoverable without broad unrelated churn.
  • PASS: Architecture and dependency fit. The slice follows existing content, registry, graph, and docs-shell patterns.
  • PASS: Readability and maintainability. The page is message-driven and uses the established system-page structure.
  • PASS: Tests and quality evidence. Automated and browser evidence are adequate for the scope.
  • PASS: High-risk AI-code targets. I did not find hallucinated APIs, stale patterns, hidden side effects, dead code, or oversized unclear helpers in the diff.

No further changes requested. Earlier blocking feedback does not apply on the current head.

@AndreasAbdi AndreasAbdi merged commit 25d5678 into main Jun 22, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant