[rollout, trainer, cfg] feat: privileged-context teacher scoring for OPSD#6833
[rollout, trainer, cfg] feat: privileged-context teacher scoring for OPSD#6833HaozheZhang6 wants to merge 5 commits into
Conversation
…OPSD (verl-project#6827) On-Policy Self-Distillation (OPSD) has the teacher condition on the ground-truth solution (privileged context) while the student sees only the problem; the teacher scores the students own on-policy rollout. This adds the half plain OPD lacks. - privileged_context.py: two pure helpers -- build_privileged_sequence (splice the GT solution into the teacher input) and slice_privileged_teacher_to_student (realign the teachers response scores onto the students positions). CPU-tested. - DistillationConfig: self_distillation / privileged_solution_key / privileged_prefix / privileged_suffix. - agent_loop._compute_teacher_logprobs: when self_distillation is on, build the privileged teacher sequence and slice the scores back; otherwise unchanged. Reuses the existing external-teacher path with teacher.model_path == student path (a frozen self-teacher), so the OPD data/loss pipeline is unchanged. CPU tests cover the splice/slice-back shapes and the config fields. The end-to-end GPU run (teacher = student checkpoint, convergence) is the follow-up. Part of verl-project#6827.
There was a problem hiding this comment.
Code Review
This pull request introduces On-Policy Self-Distillation (OPSD) support, allowing the teacher model to condition on a privileged ground-truth solution while sharing weights with the student. It adds helper functions to construct the privileged sequence and realign the teacher's scores, updates the distillation configuration, integrates these changes into the agent loop, and adds corresponding CPU tests. The review feedback highlights two key improvements: handling potential null values for the pad token ID to prevent runtime crashes, and raising an explicit error if self-distillation is enabled but the privileged solution is missing to avoid silent failures.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| teacher_ids, teacher_logprobs = slice_privileged_teacher_to_student( | ||
| teacher_ids, teacher_logprobs, len(prompt_ids), len(response_ids), self.tokenizer.pad_token_id | ||
| ) |
There was a problem hiding this comment.
If self.tokenizer.pad_token_id is None (which is common for several tokenizers like LLaMA unless explicitly configured), passing it directly to slice_privileged_teacher_to_student will cause a TypeError inside torch.full. We should fall back to eos_token_id or 0 to prevent runtime crashes.
| teacher_ids, teacher_logprobs = slice_privileged_teacher_to_student( | |
| teacher_ids, teacher_logprobs, len(prompt_ids), len(response_ids), self.tokenizer.pad_token_id | |
| ) | |
| pad_token_id = self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else (self.tokenizer.eos_token_id if self.tokenizer.eos_token_id is not None else 0) | |
| teacher_ids, teacher_logprobs = slice_privileged_teacher_to_student( | |
| teacher_ids, teacher_logprobs, len(prompt_ids), len(response_ids), pad_token_id | |
| ) |
| if sample_kwargs is not None: | ||
| routing_value = sample_kwargs.get(self.teacher_key) | ||
| if routing_value is not None: | ||
| # Non-tensor batch values arrive as 0-d numpy objects / arrays; normalize to Python. | ||
| routing_key = routing_value.item() if hasattr(routing_value, "item") else routing_value | ||
| if self.self_distillation: | ||
| solution = sample_kwargs.get(self.privileged_solution_key) | ||
| if solution is not None: | ||
| solution = solution.item() if hasattr(solution, "item") else solution | ||
| solution_ids = self.tokenizer.encode(str(solution), add_special_tokens=False) |
There was a problem hiding this comment.
When self_distillation is enabled, the teacher model expects the privileged context (ground-truth solution). If the privileged solution key is missing from sample_kwargs or is None, the code silently falls back to standard distillation without privileged context. This can lead to silent failures where the user assumes OPSD is running but it is actually not. We should raise a clear ValueError to prevent this.
if self.self_distillation and (sample_kwargs is None or sample_kwargs.get(self.privileged_solution_key) is None):
raise ValueError(
f"self_distillation is enabled, but privileged solution key '{self.privileged_solution_key}' "
f"was not found in sample_kwargs or is None."
)
if sample_kwargs is not None:
routing_value = sample_kwargs.get(self.teacher_key)
if routing_value is not None:
# Non-tensor batch values arrive as 0-d numpy objects / arrays; normalize to Python.
routing_key = routing_value.item() if hasattr(routing_value, "item") else routing_value
if self.self_distillation:
solution = sample_kwargs.get(self.privileged_solution_key)
solution = solution.item() if hasattr(solution, "item") else solution
solution_ids = self.tokenizer.encode(str(solution), add_special_tokens=False)Add self_distillation / privileged_solution_key / privileged_prefix / privileged_suffix to the user-facing distillation.yaml and regenerate _generated_ppo_*.yaml so the autogen-trainer-cfg pre-commit hook stays green.
|
The def build_privileged_sequence(
prompt_ids: list[int],
response_ids: list[int],
solution_ids: list[int],
prefix_ids: list[int],
suffix_ids: list[int],
insert_before_token_ids: list[int] = None,
) -> list[int]:
"""Build the OPSD teacher's input token sequence.
Layout: ``prompt_prefix + prefix + solution + suffix + prompt_suffix + response``.
By default, ``insert_before_token_ids`` is None, so the solution is appended to the
prompt: ``prompt + prefix + solution + suffix + response``.
If ``insert_before_token_ids`` is provided, we search for the last occurrence of
this sequence in ``prompt_ids`` and insert the solution before it.
Args:
prompt_ids: the student's prompt (problem) token ids.
response_ids: the student's on-policy response token ids.
solution_ids: the ground-truth solution token ids.
prefix_ids: marker tokens placed before the solution.
suffix_ids: marker tokens placed after the solution.
insert_before_token_ids: if provided, the solution is inserted before the
last occurrence of this sequence in ``prompt_ids``.
Returns:
The concatenated teacher input token ids.
"""
if insert_before_token_ids:
# Search for the last occurrence of the sub-sequence
n, m = len(prompt_ids), len(insert_before_token_ids)
split_idx = -1
for i in range(n - m, -1, -1):
if prompt_ids[i : i + m] == insert_before_token_ids:
split_idx = i
break
if split_idx != -1:
return (
prompt_ids[:split_idx]
+ prefix_ids
+ solution_ids
+ suffix_ids
+ prompt_ids[split_idx:]
+ response_ids
)
return prompt_ids + prefix_ids + solution_ids + suffix_ids + response_ids |
…en privileged-context (verl-project#6827) Review fixes for verl-project#6833: - privileged_solution_key resolves nested keys (default reward_model.ground_truth) and raises when self_distillation is on but no solution resolves. verl stores ground truth nested under reward_model, so the old flat default silently matched nothing and OPSD degraded to plain OPD. - pad_token_id=None falls back to 0 in the slice helper. - response_length==0 slices from an explicit start (the negative-zero slice returned the whole teacher tensor). - teacher input length is checked against max_model_len before scoring; long solutions could silently overflow the teacher engine. - yaml marker defaults use single-line escaped scalars so the runtime value matches the dataclass. - config validates self_distillation requires enabled and a non-empty key; ground-truth lists/arrays normalize to a string. - build_privileged_sequence gains insert_before_token_ids to place the solution before the assistant turn instead of appending, per reviewer suggestion. CPU tests cover the nested resolver, the 0-length slice, the None pad, and insert-before.
|
Good point -- appending after the full chat-templated prompt isn't ideal. I added an optional Pushed a few related fixes while I was in there -- the default |
There was a problem hiding this comment.
Originally, the filtering was based on the length of the Student's prompt, which is why this error has a chance of occurring. Would it be better to define a separate filtering parameter in rl_dataset?
There was a problem hiding this comment.
Good point -- the overflow is because the privileged sequence is longer than the student prompt the dataset filtered on. I kept the teacher-side raise as a loud safety net for now, but a dataset-level filter (capping prompt+solution upfront, like filter_overlong_prompts) is cleaner and I think fits better as a follow-up PR. Open to either.
…+ lock alignment with an e2e test (verl-project#6827) Follow-up review fixes on verl-project#6833: - The default privileged_suffix framed the task as evaluating the response, but the tokens that follow are the student response, not an evaluation. Reframe it to ask the teacher to derive the answer using the reference (think step by step), so the per-token target stays on-distribution and matches the reference OPSD transition. - Document the signal-quality knobs: ground_truth is often just the final answer (use a full-solution field like extra_info.answer for a stronger signal), and OPSD is a supervised loss (use_policy_gradient=False, use_task_rewards=False). - Add an e2e alignment test: the privileged slice plus the _pad_teacher_outputs left/right pad land response token j at row prompt_width+j, identical to the non-privileged path.
| if solution_ids is not None: | ||
| # OPSD: the teacher conditions on the privileged ground-truth solution. |
There was a problem hiding this comment.
Can these lines be deleted?
There was a problem hiding this comment.
Yes, done -- folded the build into the self.self_distillation branch and gated the slice-back on the same flag, so the solution_ids is not None checks are gone.
…lution_ids sentinel (verl-project#6827) Per review: fold the privileged-sequence build into the self_distillation branch and gate the teacher slice-back on the same flag, dropping the solution_ids=None sentinel and the two is-not-None checks. Equivalent control flow -- self_distillation is true exactly when a solution was resolved, since a missing solution raises.
JacobHelwig
left a comment
There was a problem hiding this comment.
Nice idea to add this. Could you please show metrics from a sample training run without task rewards, along with the command used to run? The key metrics of interest would be the distillation loss, the training rewards, and the validation rewards.
|
Thanks! Will run a supervised OPSD training (no task rewards) on a small model and follow up here with the distillation loss and the training/validation rewards, plus the exact command. |
What does this PR do?
The first piece of On-Policy Self-Distillation (OPSD, #6827): the teacher conditions on the ground-truth solution (privileged context) while the student sees only the problem, and scores the student's own on-policy rollout. This is the half plain OPD lacks -- verl's OPD teacher sees only
prompt + response.verl/trainer/distillation/privileged_context.py-- two pure helpers:build_privileged_sequence(splice the GT solution, wrapped in marker tokens, into the teacher input) andslice_privileged_teacher_to_student(realign the teacher's per-token response scores back onto the student'sprompt + responsepositions). CPU-tested.DistillationConfig--self_distillation,privileged_solution_key,privileged_prefix,privileged_suffix.agent_loop._compute_teacher_logprobs-- whenself_distillationis on, build the privileged teacher sequence and slice the scores back; otherwise the path is unchanged (prompt + response).Route
Reuses the existing external-teacher path with
teacher.model_path == student model_path, i.e. a frozen self-teacher (the student's init checkpoint) scoring the rollout with privileged context. The whole OPD data + loss pipeline is reused unchanged -- only the teacher's input sequence and the realignment are new. The dynamic / EMA self-teacher (a second in-engine forward) is a separate, GPU-gated follow-up.Test
pytest tests/workers/test_opsd_privileged_context_on_cpu.py tests/workers/test_distillation_config_opsd_on_cpu.py-- splice layout + slice-back shapes/dtype + the config fields. CPU-only. The end-to-end GPU run (teacher = student checkpoint, top-k overlap rises, convergence) is the remaining validation, hence draft.Part of #6827.