memory-system-page#205
Merged
Merged
Conversation
… with a graph-backed flow]
…nd operator tradeoffs]
Contributor
Author
|
Follow-up push for mergeability on the current PR head:
|
Contributor
Author
|
Review summary for PR #205 Result: PASS Evidence:
Project acceptance criteria:
User stories and behavioral-assertion check:
Docs-writing standards:
Graphing standards:
General website standards:
Review-rules check:
No further changes requested. Earlier blocking feedback does not apply on the current head. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
{
"project": "Model Atlas — Memory Canonical System Page",
"branchName": "memory-system-page",
"description": "Publish the missing canonical English
memorysystem page, backed by a canonical registry identity, a page-local graph-backed system flow, and localized messages, so readers can understand how serving memory constrains inference through weight residency, KV-cache growth, bandwidth limits, allocator pressure, and the tradeoffs around batching, spilling, and splitting work.","context": {
"customerAsk": "Customer ask alignment: keep the queue near the 3-4 active worker target with one fresh, narrow, customer-visible systems-page slice now that
continuous-batching-system-pagelanded cleanly on currentmain. Add the missing canonical Englishmemorysystem page from the reader-facing systems bundle so customers can understand why serving performance is constrained by weight residency, KV-cache growth, bandwidth, and allocator pressure rather than treating memory as a vague hardware footnote. Keep this to one mergeable page slice on currentmain. Scope: create the system page undersrc/content/docs/systems/memory/withpage.mdx,messages/en.json, and any requiredassets.json; add the backing system registry record and graph record if missing; classify it as asystem; connect it to already-shipped nearby pages such asinference engine,batching,routing,deployment,on-disk-kv-cache,prefill,decode, andKV cachewhen those pages exist on currentmainby implementation time; and add only the focused validation or tests needed for the touched content and registry surfaces. The page should explain what memory means in an inference-serving stack, why weights and KV cache dominate different parts of the request path, how bandwidth and fragmentation shape latency and throughput, what operators trade when they batch, spill, or split work, and where memory sits relative to routing and deployment decisions. Acceptance criteria: the registry-backedmemorysystem page renders on currentmain, search and related links can discover it, the diff stays page-local, and the focused touched checks pass.","problem": "The site already has nearby serving pages such as
inference engine,batching,routing,deployment, andon-disk-kv-cache, plus glossary pages forprefill,decode, andKV cache, but it does not yet ship one canonical page that explains memory as a serving system concern. Readers can encounter memory-adjacent ideas in those pages without getting one plain-language explanation of what memory actually means in inference serving, why weights and KV cache dominate different stages, or why bandwidth and fragmentation can cap throughput even when compute looks available. That leaves a gap in the serving bundle: one of the main operational constraints behind latency, concurrency, and capacity is fragmented across neighboring pages and harder to discover directly.","solution": "Create a canonical
/docs/systems/memorypage using the standard system-page contract, colocated English messages, and a page-local system-flow asset backed by a matching graph record that teaches memory residency, cache growth, bandwidth motion, and allocator pressure clearly. Back the page with a canonicalsystem.memoryregistry record, classify it as asystem, and connect it to already-shipped nearby pages through registry-backed aliases, tags, and related-doc relationships. Keep the slice focused on plain-language serving behavior, request-path memory pressure, operator tradeoffs, and focused discovery proof rather than broad hardware-taxonomy, performance-benchmark, or allocator-framework expansion."},
"acceptanceCriteria": [
"A published canonical docs page exists for
memoryundersrc/content/docs/systems/memory/withkind: \"system\", a matching canonical registry record, English messages, and any required local assets.","The page follows the canonical system-page contract and docs-writing standards with one folded
openingSummary, plain-language system framing, distinct section jobs, and no page-meta prose.","The page explains what memory means in an inference-serving stack and why it is a serving-system concern rather than only a hardware specification.",
"The page explains why model weights and KV cache dominate different parts of the request path, including prefill and decode.",
"The page explains how memory bandwidth and allocator fragmentation shape latency, throughput, and concurrency even when raw compute exists.",
"The page explains the main operator tradeoffs around batching, spilling, splitting work, and related serving decisions, and connects the reader to nearby shipped pages for
inference engine,batching,routing,deployment,on-disk-kv-cache,prefill,decode, andKV cachewhen those targets exist on currentmain.","The implementation remains English-only and does not broaden into hardware benchmarking, allocator-library tutorials, or unrelated serving-taxonomy cleanup beyond what is required for the canonical
memorypage to ship cleanly.","Quality gate: typecheck, lint, and focused touched tests pass."
],
"userStories": [
{
"id": "memory-system-page-001",
"title": "Establish the canonical memory system identity",
"description": "As a reader searching for inference-serving memory behavior, I want one canonical serving-system identity for
memoryso discovery surfaces lead me to a dedicated explainer instead of scattered references across neighboring pages.","acceptanceCriteria": [
"A canonical
system.memoryregistry identity exists withkind: \"system\", slugmemory, controlled aliases, tags, and related ids appropriate for a serving-system topic.","A matching graph record exists for the page-local system-flow teaching aid so the memory page can render the memory-path explanation through the standard graph-backed surface.",
"The memory record is classified as a
system, not a concept, module, or glossary entry, and it fits the current registry and classification conventions for shipped system pages.","Registry relationships connect the record to already-shipped nearby docs for
inference engine,batching,routing,deployment,on-disk-kv-cache,prefill,decode, andKV cachewhen those canonical targets exist on currentmain.","Representative reader queries such as
memory,serving memory,weight residency,KV cache growth, andmemory bandwidthcan resolve to the canonical page through the normal discovery path.","Typecheck passes",
"Tests pass"
],
"priority": 1,
"passes": true,
"notes": "2026-06-22: Added the canonical
system.memoryregistry record,graph.memory-system-flow, the published/docs/systems/memorypage bundle, and focused discovery tests so representative search queries now resolve to the canonical page."},
{
"id": "memory-system-page-002",
"title": "Publish the canonical memory system page",
"description": "As a technical layperson learning inference serving, I want a dedicated memory page so I can understand the serving constraint before I read narrower runtime or deployment pages.",
"acceptanceCriteria": [
"A canonical system page exists at
/docs/systems/memorywith matching frontmatter,messages/en.json, and a colocatedassets.jsonwired to the page-local system-flow graph.","The page follows the shared system-page structure and keeps
page.mdxstructural, with reader-facing copy resolved throughmessages/en.json.","The page opens with one folded
openingSummaryand explains in plain language that serving memory is the live state the runtime must keep or move while answering requests, not just the advertised capacity number on a device spec sheet.","The page explains where memory sits in the serving stack, including how it constrains runtime choices without changing the underlying model architecture itself.",
"The page renders in the standard docs shell and is understandable in isolation before narrowing into bandwidth, cache, or deployment tradeoffs.",
"Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 2,
"passes": true,
"notes": "2026-06-22: Audited the published
/docs/systems/memorybundle against the page-focused acceptance criteria, reran lint/typecheck plus focused registry/page tests through the prepared content-runtime path, and re-verified the route in-browser onhttp://127.0.0.1:3456/docs/systems/memory."},
{
"id": "memory-system-page-003",
"title": "Teach request-path memory dominance with a graph-backed flow",
"description": "As a reader trying to understand why memory pressure changes over time, I want the page to show how weights and KV cache dominate different stages so I can connect prefill and decode behavior to concrete memory load.",
"acceptanceCriteria": [
"The page clearly explains that model weights must stay resident before useful work can begin, while KV cache grows with prompt and generation length as requests stay active.",
"The page clearly explains how prefill and decode stress memory differently, including why prompt processing and token-by-token generation do not create identical memory pressure.",
"The page-local system-flow graph and asset wiring render a reader-visible flow that shows weight residency, request execution, KV-cache accumulation, and memory pressure moving across the serving path without decorative extra graph churn.",
"The page explains why memory remains a live serving concern even after the model is loaded, because active requests keep changing the amount and placement of state the runtime must hold.",
"Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 3,
"passes": true,
"notes": "2026-06-22: Expanded the canonical memory page to teach the different memory pressure phases explicitly, updated the page-local system-flow graph labels and layout to emphasize weight residency, prefill burst, decode reuse, and ongoing live pressure, and extended the focused page tests plus browser verification for those teaching points."
},
{
"id": "memory-system-page-004",
"title": "Explain bandwidth, fragmentation, and operator tradeoffs",
"description": "As a reader deciding why memory matters operationally, I want the page to explain both the performance limits and the operator decisions they force so I can understand why batching, spilling, and routing choices are tied to memory.",
"acceptanceCriteria": [
"The page clearly explains how memory bandwidth affects first-token latency, inter-token latency, or throughput by limiting how quickly the runtime can read weights and move active state.",
"The page clearly explains allocator pressure or fragmentation in plain language, including why enough total memory does not always mean enough usable contiguous or well-placed memory for the next serving step.",
"The page clearly explains the practical tradeoffs of batching, spilling cache, or splitting prefill and decode, including what each choice gains and what memory cost or latency cost it introduces.",
"The page explains where memory sits relative to routing and deployment decisions, including why those systems often exist to fit work into available memory shape rather than only to save compute.",
"The page exposes clear related navigation to
inference engine,batching,routing,deployment,on-disk-kv-cache,prefill,decode, andKV cachewhen those pages exist on currentmain.","At least one discovery surface presents
memoryas a reachable destination so readers can find it without typing the exact slug.","Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 4,
"passes": true,
"notes": "2026-06-22: Expanded the page with dedicated bandwidth/fragmentation and operator-tradeoff sections, clarified first-token and inter-token latency consequences plus routing/deployment memory-fit behavior, extended focused runtime assertions, and re-verified the route in-browser on localhost."
},
{
"id": "memory-system-page-005",
"title": "Add focused validation for the memory page slice",
"description": "As a maintainer, I want targeted automated proof for the memory slice so route, registry, graph, messages, and discoverability regressions are caught without expanding into unrelated suite work.",
"acceptanceCriteria": [
"Validation or tests confirm the
/docs/systems/memoryroute, the canonical memory system record, and the default English messages resolve together.","Focused validation confirms the page-local graph record and asset reference resolve correctly for the canonical route.",
"Coverage asserts at least one memory-specific discoverability outcome and at least one related-link, tag, alias, or search expectation for the page.",
"Focused checks stay limited to touched content and discovery integrity rather than inventory snapshots, broad serving-family audits, or meta-test scaffolding.",
"Typecheck passes",
"Tests pass"
],
"priority": 5,
"passes": true,
"notes": "2026-06-22: Revalidated the shipped memory page slice with focused registry/page tests plus lint and typecheck, confirmed the route, English messages, local graph asset wiring, and discoverability expectations stay aligned, and recorded the content-runtime prep race as a workflow gotcha rather than a page regression."
}
]
}