rlhf-training-regime-page#131
Conversation
|
Validated the RLHF page slice end to end on the current PR head. Focused coverage in |
|
BLOCKING: this change is functionally in good shape, but the PR is currently Quality checks and runtime verification:
Project acceptance criteria:
Behavioral assertion check for stories marked
General website standards:
Docs writing standards unified checklist:
No independent correctness defects surfaced in the content or test coverage. The remaining blocker is operational: rebase, resolve the merge conflicts, and push the updated branch so the PR becomes mergeable. |
85193ea to
39d880b
Compare
|
Addressed the blocking mergeability feedback on the current PR head. What changed:
Validation run locally on the rebased head:
PR diff verification:
CI note:
|
|
Follow-up on mergeability work for head |
|
Follow-up on mergeability work for head |
|
Follow-up on mergeability work for head
At this point the only remaining blocker is the fresh rerun of required GitHub CI on the same reviewed head. |
…l training-regime destination]
…raining-regime page]
…deoffs, and nearby methods with one primary flow]
…he RLHF page contract and discovery path]
…he RLHF page contract and discovery path]
39d880b to
bbac52e
Compare
|
Addressed the renewed mergeability regression on the latest PR head. What changed:
Validation run locally on the rebased head:
PR diff verification:
CI note:
|
|
Follow-up on mergeability work for head
At this point the only remaining blocker is the fresh rerun of required GitHub CI on the same reviewed head. |
|
Follow-up on the CI rerun state for head
At this point the only remaining blocker is the fresh required CI rerun on the same reviewed head. |
|
Follow-up on mergeability work for head |
…he RLHF page contract and discovery path]
|
Addressed the remaining mergeability blockers on the latest PR head What changed:
Validation run locally on this head:
PR diff verification:
CI note:
|
…he RLHF page contract and discovery path]
|
Addressed the remaining mergeability blocker on the latest PR head What changed:
Why this addressed the blocker:
Validation run locally on this head:
PR diff verification:
CI note:
|
…he RLHF page contract and discovery path]
|
Addressed the current mergeability blocker on the latest PR head What changed:
Why this addressed the blocker:
Validation run locally on this head:
PR diff verification:
CI note:
|
|
Follow-up on mergeability work for head
At the time of this comment ( |
|
Follow-up on mergeability work for head
At the time of this comment ( |
{
"project": "Model Atlas — RLHF Training-Regime Page",
"branchName": "rlhf-training-regime-page",
"description": "Publish one canonical English RLHF training-regime page, backed by stable registry data and localized messages, so readers can understand the post-training workflow, tradeoffs, and nearby alignment methods from one dedicated destination instead of relying on glossary or broad alignment references alone.",
"context": {
"customerAsk": "Create the training-regime page for RLHF so readers can understand the objective, workflow, tradeoffs, and links to nearby alignment methods without relying on the glossary alone. Add the canonical docs page under
src/content/docs/training/rlhf/withpage.mdx,messages/en.json, andassets.jsonfollowing the current training-regime template and writing standards. Add or update the matching structured registry data undersrc/content/registry/training-regimes/so the page has a stableregistryIdand search metadata. Explain what RLHF is in layperson-friendly terms, where it sits after pretraining, why teams use it, and what tradeoffs it introduces. Link RLHF clearly to adjacent pages such as alignment, PPO, DPO, GRPO, models, papers, and serving or safety surfaces where relevant. Include the single primary graph or flow required by graphing standards if the topic needs it, and keep any math definitions symbol-only and minimal. Acceptance criteria: a reader searching RLHF can land on one canonical training-regime page rather than only glossary references, the page and registry validate cleanly with the repo's existing content expectations, and the implementation stays page-local and avoids reopening unrelated shell or locale infrastructure.","problem": "The repository already has an alignment concept page, but it does not yet offer a canonical RLHF training-regime page that explains the full post-training workflow in one place. That leaves a reader gap between broad alignment language and specific optimization-method names. A reader searching
RLHFcannot reliably land on one page that explains the sequence from pretrained base model to human preference data to optimization loop, why teams use that workflow, and where its tradeoffs show up in helpfulness, safety, cost, and stability.","solution": "Create a canonical
rlhftraining-regime page using the standard training-regime structure, English-only localized messages, a page-local flow asset, and a stable training-regime registry record. Use the page to explain RLHF in isolation first, then connect it to alignment, PPO, DPO, GRPO, representative papers, and any relevant model or safety surfaces through focused registry relationships and adjacent links. Add only the narrow validation needed to prove route, registry, messages, and reader discovery behavior for this new canonical page."},
"acceptanceCriteria": [
"A published canonical docs page exists for
rlhfunder the training docs tree, binds to a stabletraining-regime.rlhfregistry record, and renders in the standard docs shell.","The page uses colocated
messages/en.jsonand localassets.json, with reader-facing copy resolved through message keys rather than hard-coded prose inpage.mdx.","The opening summary and primary sections explain, in plain language, what Reinforcement Learning from Human Feedback is, where it fits after pretraining, why teams use it, and what tradeoffs it introduces.",
"The page includes one primary RLHF workflow graph or flow that teaches the sequence of preference collection and policy optimization, with graph metadata that follows the existing training-regime and graphing standards.",
"Readers can move from the RLHF page to adjacent alignment pages such as
alignment,PPO,DPO,GRPO, and relevant model, paper, serving, or safety surfaces where those pages already exist and are useful.","Search and registry metadata make
RLHFand representative alias queries resolve to this canonical training-regime page instead of leaving readers on glossary-only paths.","Quality gate: typecheck, lint, and targeted tests pass."
],
"userStories": [
{
"id": "rlhf-training-regime-page-001",
"title": "Establish RLHF as a canonical training-regime destination",
"description": "As a reader searching for RLHF, I want one canonical training-regime destination so I can find a full explainer instead of only broad alignment or glossary references.",
"acceptanceCriteria": [
"A published training-regime registry record exists for
rlhfwith stable id, canonical slug, aliases covering representative queries such asRLHFandreinforcement learning from human feedback, and tags aligned to the training-alignment bundle.","Registry relationships connect RLHF to the alignment concept and any already-shipped adjacent methods, papers, models, or safety-related pages that genuinely improve reader navigation without duplicative noise.",
"Discovery metadata is scoped so
RLHFresolves to the canonical training-regime surface rather than remaining only an alias on another page.","Typecheck passes",
"Tests pass"
],
"priority": 1,
"passes": true,
"notes": ""
},
{
"id": "rlhf-training-regime-page-002",
"title": "Publish the canonical RLHF training-regime page",
"description": "As a technical layperson learning alignment methods, I want a dedicated RLHF page so I can understand the workflow, why it happens after pretraining, and what problem it is trying to solve.",
"acceptanceCriteria": [
"A canonical training-regime page exists at
/docs/training/rlhfwith matching frontmatter,messages/en.json, and localassets.json.","The page opens with one folded
openingSummaryand explains in plain language that RLHF is a post-training workflow that uses human preference signals to steer a pretrained model toward preferred behavior.","The page explains the main RLHF stages in order, including a pretrained starting point, preference or ranking data collection, a learned or inferred preference signal, and a policy-updating step.",
"The page is understandable in isolation before linking outward to adjacent optimization methods or alignment concepts.",
"Typecheck passes",
"Verify in browser using the Browser plugin"
],
"priority": 2,
"passes": true,
"notes": ""
},
{
"id": "rlhf-training-regime-page-003",
"title": "Teach the RLHF workflow, tradeoffs, and nearby methods with one primary flow",
"description": "As a reader comparing alignment methods, I want the RLHF page to show the workflow and tradeoffs clearly so I can understand how it differs from nearby approaches such as PPO, DPO, and GRPO.",
"acceptanceCriteria": [
"The page includes one primary workflow graph or flow in the
How It Workssection that makes the RLHF loop obvious without decorative extra visuals.","Narrative copy explains why teams use RLHF, including behavior shaping, instruction following, or safety-policy alignment, in language that a technical layperson can follow.",
"The page describes practical tradeoffs such as human-data cost, reward or preference misspecification, optimization instability, slower iteration, or narrowed behavior.",
"The page compares RLHF to nearby regimes such as PPO, DPO, and GRPO in concise reader-facing language, without turning into a benchmark leaderboard or paper timeline.",
"Any math included is minimal and uses symbol-only definitions rather than concept-heavy derivations.",
"Typecheck passes",
"Tests pass",
"Verify in browser using the Browser plugin"
],
"priority": 3,
"passes": true,
"notes": ""
},
{
"id": "rlhf-training-regime-page-004",
"title": "Add focused validation for the RLHF page contract and discovery path",
"description": "As a maintainer, I want targeted automated proof for the RLHF page slice so route, registry, message, and adjacent discovery regressions are caught without unrelated infrastructure churn.",
"acceptanceCriteria": [
"Validation or tests confirm the RLHF docs route, training-regime registry record, and default English messages resolve together.",
"Validation or tests cover at least one RLHF-specific discovery expectation, such as alias resolution, related-doc presence, or search indexing behavior for
RLHF.","Coverage stays focused on observable behavior for this page slice and does not require unrelated locale, shell, or route-inventory changes.",
"Typecheck passes",
"Tests pass"
],
"priority": 4,
"passes": true,
"notes": ""
}
]
}