ppo-training-regime-page#130
Conversation
…s training regime]
… nearby alignment navigation]
|
BLOCKING review for Local verification I ran on the current head (
BLOCKING findings
Acceptance criteria
User-story acceptance criteria
Website standard checks
Docs writing standard checks
Review-rule summary
Please fix the citation runtime wiring, add a behavioral test for rendered PPO references, then rebase and resolve the current merge conflicts before re-requesting review. |
|
Addressed the blocking PR conversation items on head
Local validation on this merged head:
GitHub CI has been kicked off for this push and is still in progress. |
…e PPO page contract]
|
Follow-up mergeability fix on head
The PR now points at |
|
Mergeability follow-up on head
Waiting on the fresh CI rerun now. |
|
Mergeability follow-up on new head
Waiting on the fresh CI run for head |
|
Mergeability follow-up on head
Waiting on the re-queued required checks now. |
|
Mergeability follow-up on new head
Waiting on the fresh required checks for |
# Conflicts: # src/lib/content/published-docs-registry-manifest.ts # src/lib/content/registry-runtime.generated.ts
|
Mergeability follow-up on head
GitHub CI for this new head started at |
|
Mergeability follow-up on new head .
Waiting on the fresh CI run for \ now. |
|
Mergeability follow-up on new head
Waiting on the fresh CI run for |
|
Mergeability follow-up on new head
Waiting on the fresh CI run for |
|
Mergeability follow-up on new head .\n\n1. I rechecked the PR conversation and the earlier blocking items are still explicitly addressed by the later follow-up replies on this PR, so there is no new unresolved review feedback.\n2. The only remaining blocker on head was stale required CI: the Actions jobs API showed , , and still marked from , , and UTC respectively, which is well past the 15-minute no-progress threshold.\n3. I canceled workflow run and used the smallest reliable retrigger for this repo: an empty mergeability-only commit with message .\n4. No PPO page or registry code changed in that commit. I verified after the push that the PPO slice is still present in the PR diff.\n5. GitHub has already attached a fresh required-check set to the new head on workflow run ; the required checks are now queued on . |
|
Mergeability follow-up on new head
|
|
Mergeability follow-up on new head
|
|
Mergeability follow-up on new head
|
|
Mergeability follow-up on new head
|
|
Mergeability follow-up on new head
Waiting on the fresh CI run for |
|
Mergeability follow-up on new head
Waiting on the fresh CI run for |
|
Mergeability follow-up on new head .
Waiting on the fresh required checks for \ now. |
|
Mergeability follow-up on new head
Waiting on the fresh required checks for |
|
Mergeability follow-up on new head
|
|
Mergeability follow-up on new head .
Waiting on the fresh CI run for now. |
|
Mergeability follow-up on new head
Waiting on the fresh CI run for |
|
Mergeability follow-up on new head
Waiting on the fresh CI run for |
|
Mergeability follow-up on new head
|
|
Mergeability follow-up on new head
|
|
Mergeability follow-up on new head
|
|
Mergeability follow-up on new head
Waiting on the fresh CI run for |
|
Mergeability follow-up on new head
Waiting on the fresh CI run for |
|
Mergeability follow-up on new head
|
{
"project": "Model Atlas — PPO Training Regime Canonical Page",
"branchName": "ppo-training-regime-page",
"description": "Publish one canonical English
ppotraining-regime page, backed by stable registry data and focused validation, so readers can understand Proximal Policy Optimization as the clipped policy-update method commonly used in RLHF-style post-training and navigate cleanly to nearby alignment methods.","context": {
"customerAsk": "Add the canonical docs page under
src/content/docs/training/ppo/withpage.mdx,messages/en.json, andassets.jsonfollowing the current training-regime template and writing standards. Add or update the matching structured registry data undersrc/content/registry/training-regimes/so the page has a stableregistryIdand search metadata. Explain PPO in plain language as the clipped policy-update method often used inside RLHF pipelines, including why it is used and why it can be operationally heavy. Cross-link PPO to RLHF, DPO, GRPO, reward-model or alignment pages, and representative model or paper surfaces when appropriate. Keep the slice narrow and reviewable, touching only the files needed for this page and its structured data.","problem": "The site is missing a canonical PPO training-regime page even though readers encounter PPO repeatedly when tracing RLHF-style alignment workflows. Without one dedicated page, readers have to infer PPO from scattered references, and search or related-doc surfaces cannot route them to an authoritative explanation of what PPO is, why it was used, and why teams often describe it as operationally heavy.",
"solution": "Create a canonical
ppotraining-regime page with English message-driven content, a matching published registry record, and focused validation. Classify PPO as a training regime rather than a glossary or duplicate concept page, explain it in plain language as the clipped policy-update step often used in RLHF, and wire aliases, tags, and related links so nearby alignment and paper surfaces can lead readers into the page."},
"acceptanceCriteria": [
"A published canonical docs page exists for
ppoat the training-regime route with matchingpage.mdx,messages/en.json, andassets.json.","A stable published registry record exists for
training-regime.ppoundersrc/content/registry/training-regimes/, and the page frontmatter resolves to that record throughregistryId.","The page is classified as
training-regime, not as a duplicate concept or glossary page, and uses alignment-oriented search and related-doc metadata.","The page explains PPO in plain language as a clipped policy-update method used in RLHF-style post-training, including why it is used and why it is operationally heavy.",
"Readers can navigate from PPO to nearby alignment topics such as RLHF, DPO, GRPO, reward-model, alignment, and representative paper or model surfaces where those canonical targets already exist, without broken links.",
"Focused validation covers the touched page and registry contract plus at least one PPO-specific discovery behavior, without broad unrelated taxonomy churn.",
"Quality gate: make typecheck, make lint, and make test pass."
],
"userStories": [
{
"id": "ppo-training-regime-page-001",
"title": "Establish PPO as a first-class training regime",
"description": "As a reader searching for PPO, I want the site to treat it as a canonical training-regime record so I can find one authoritative explainer instead of inferring the method from nearby alignment pages.",
"acceptanceCriteria": [
"A published registry record exists with stable id
training-regime.ppo, canonical slugppo, andkind: training-regime.","Registry aliases cover representative query forms such as
PPO,Proximal Policy Optimization, andPPO RLHF.","Registry classification uses training-regime metadata appropriate for alignment and post-training discovery and does not classify PPO as a glossary or broad concept page.",
"Registry relationships connect PPO to nearby alignment surfaces such as RLHF, DPO, GRPO, reward-model, alignment, and representative papers or model pages where those canonical targets exist and the relationship is concrete.",
"Typecheck passes",
"Tests pass"
],
"priority": 1,
"passes": true,
"notes": ""
},
{
"id": "ppo-training-regime-page-002",
"title": "Publish the canonical PPO explainer page",
"description": "As a technical layperson learning alignment workflows, I want one dedicated PPO page so I can understand what PPO is, why RLHF pipelines used it, and why teams often describe it as expensive or operationally heavy.",
"acceptanceCriteria": [
"A canonical training-regime page exists at
/docs/training/ppowith matching frontmatter,messages/en.json, and localassets.json.","The page opens with one concise
openingSummaryand explains Proximal Policy Optimization in isolation before narrowing into RLHF usage.","The page explains PPO as a clipped policy-update method used to keep reinforcement-learning updates from changing the model policy too abruptly.",
"The page explains why PPO was used in RLHF-style loops and why it is operationally heavy, including the need for rollouts, reward-model scoring, and repeated policy updates.",
"The page follows training-regime writing standards, uses plain language, expands the full name before the acronym in narrative copy, and avoids page-meta or workflow-internal prose.",
"Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 2,
"passes": true,
"notes": ""
},
{
"id": "ppo-training-regime-page-003",
"title": "Make PPO discoverable through nearby alignment navigation",
"description": "As a reader exploring alignment methods, I want related docs, tags, and search metadata to route me into PPO and back out to nearby methods so I can compare RLHF-era and post-RLHF alternatives without dead ends.",
"acceptanceCriteria": [
"The PPO page renders tag and related-doc surfaces that connect it to nearby alignment pages such as RLHF, DPO, GRPO, reward-model, alignment, and representative papers or models where those canonical targets exist.",
"Representative PPO search forms such as
ppo,proximal policy optimization, andrlhf pporeturn the canonical PPO page as a direct relevant result when the query intent is to find PPO.","At least one shipped neighboring discovery surface or registry-backed related-doc path can lead a reader into PPO without typing the exact slug.",
"Browser-visible rendering shows title, summary, tags, related links, and references without missing-content placeholders or broken links.",
"Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 3,
"passes": true,
"notes": ""
},
{
"id": "ppo-training-regime-page-004",
"title": "Add focused validation for the PPO page contract",
"description": "As a maintainer, I want targeted automated proof for the PPO page slice so route, registry, messages, and nearby discovery behavior regressions are caught without unrelated test expansion.",
"acceptanceCriteria": [
"Automated coverage confirms the canonical PPO page route, frontmatter, registry record, and default English messages resolve together.",
"Coverage asserts at least one PPO-specific discovery behavior, such as related-doc derivation or search visibility for a representative PPO query.",
"Validation stays focused on observable behavior for the touched page and structured data rather than inventory snapshots, topology audits, or meta-test scaffolding.",
"Typecheck passes",
"Tests pass"
],
"priority": 4,
"passes": true,
"notes": ""
}
]
}