[rollout, trainer, cfg] feat: privileged-context teacher scoring for OPSD by HaozheZhang6 · Pull Request #6833 · verl-project/verl

HaozheZhang6 · 2026-06-24T04:29:25Z

What does this PR do?

The first piece of On-Policy Self-Distillation (OPSD, #6827): the teacher conditions on the ground-truth solution (privileged context) while the student sees only the problem, and scores the student's own on-policy rollout. This is the half plain OPD lacks -- verl's OPD teacher sees only prompt + response.

verl/trainer/distillation/privileged_context.py -- two pure helpers: build_privileged_sequence (splice the GT solution, wrapped in marker tokens, into the teacher input) and slice_privileged_teacher_to_student (realign the teacher's per-token response scores back onto the student's prompt + response positions). CPU-tested.
DistillationConfig -- self_distillation, privileged_solution_key, privileged_prefix, privileged_suffix.
agent_loop._compute_teacher_logprobs -- when self_distillation is on, build the privileged teacher sequence and slice the scores back; otherwise the path is unchanged (prompt + response).

Route

Reuses the existing external-teacher path with teacher.model_path == student model_path, i.e. a frozen self-teacher (the student's init checkpoint) scoring the rollout with privileged context. The whole OPD data + loss pipeline is reused unchanged -- only the teacher's input sequence and the realignment are new. The dynamic / EMA self-teacher (a second in-engine forward) is a separate, GPU-gated follow-up.

Test

pytest tests/workers/test_opsd_privileged_context_on_cpu.py tests/workers/test_distillation_config_opsd_on_cpu.py -- splice layout + slice-back shapes/dtype + the config fields. CPU-only. The end-to-end GPU run (teacher = student checkpoint, top-k overlap rises, convergence) is the remaining validation, hence draft.

Part of #6827.

…OPSD (verl-project#6827) On-Policy Self-Distillation (OPSD) has the teacher condition on the ground-truth solution (privileged context) while the student sees only the problem; the teacher scores the students own on-policy rollout. This adds the half plain OPD lacks. - privileged_context.py: two pure helpers -- build_privileged_sequence (splice the GT solution into the teacher input) and slice_privileged_teacher_to_student (realign the teachers response scores onto the students positions). CPU-tested. - DistillationConfig: self_distillation / privileged_solution_key / privileged_prefix / privileged_suffix. - agent_loop._compute_teacher_logprobs: when self_distillation is on, build the privileged teacher sequence and slice the scores back; otherwise unchanged. Reuses the existing external-teacher path with teacher.model_path == student path (a frozen self-teacher), so the OPD data/loss pipeline is unchanged. CPU tests cover the splice/slice-back shapes and the config fields. The end-to-end GPU run (teacher = student checkpoint, convergence) is the follow-up. Part of verl-project#6827.

gemini-code-assist

Code Review

This pull request introduces On-Policy Self-Distillation (OPSD) support, allowing the teacher model to condition on a privileged ground-truth solution while sharing weights with the student. It adds helper functions to construct the privileged sequence and realign the teacher's scores, updates the distillation configuration, integrates these changes into the agent loop, and adds corresponding CPU tests. The review feedback highlights two key improvements: handling potential null values for the pad token ID to prevent runtime crashes, and raising an explicit error if self-distillation is enabled but the privileged solution is missing to avoid silent failures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-24T04:31:01Z

+                teacher_ids, teacher_logprobs = slice_privileged_teacher_to_student(
+                    teacher_ids, teacher_logprobs, len(prompt_ids), len(response_ids), self.tokenizer.pad_token_id
+                )


If self.tokenizer.pad_token_id is None (which is common for several tokenizers like LLaMA unless explicitly configured), passing it directly to slice_privileged_teacher_to_student will cause a TypeError inside torch.full. We should fall back to eos_token_id or 0 to prevent runtime crashes.

Suggested change

teacher_ids, teacher_logprobs = slice_privileged_teacher_to_student(

teacher_ids, teacher_logprobs, len(prompt_ids), len(response_ids), self.tokenizer.pad_token_id

)

pad_token_id = self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else (self.tokenizer.eos_token_id if self.tokenizer.eos_token_id is not None else 0)

teacher_ids, teacher_logprobs = slice_privileged_teacher_to_student(

teacher_ids, teacher_logprobs, len(prompt_ids), len(response_ids), pad_token_id

)

gemini-code-assist · 2026-06-24T04:31:02Z

            if sample_kwargs is not None:
                routing_value = sample_kwargs.get(self.teacher_key)
                if routing_value is not None:
                    # Non-tensor batch values arrive as 0-d numpy objects / arrays; normalize to Python.
                    routing_key = routing_value.item() if hasattr(routing_value, "item") else routing_value
+                if self.self_distillation:
+                    solution = sample_kwargs.get(self.privileged_solution_key)
+                    if solution is not None:
+                        solution = solution.item() if hasattr(solution, "item") else solution
+                        solution_ids = self.tokenizer.encode(str(solution), add_special_tokens=False)


When self_distillation is enabled, the teacher model expects the privileged context (ground-truth solution). If the privileged solution key is missing from sample_kwargs or is None, the code silently falls back to standard distillation without privileged context. This can lead to silent failures where the user assumes OPSD is running but it is actually not. We should raise a clear ValueError to prevent this.

if self.self_distillation and (sample_kwargs is None or sample_kwargs.get(self.privileged_solution_key) is None): raise ValueError( f"self_distillation is enabled, but privileged solution key '{self.privileged_solution_key}' " f"was not found in sample_kwargs or is None." ) if sample_kwargs is not None: routing_value = sample_kwargs.get(self.teacher_key) if routing_value is not None: # Non-tensor batch values arrive as 0-d numpy objects / arrays; normalize to Python. routing_key = routing_value.item() if hasattr(routing_value, "item") else routing_value if self.self_distillation: solution = sample_kwargs.get(self.privileged_solution_key) solution = solution.item() if hasattr(solution, "item") else solution solution_ids = self.tokenizer.encode(str(solution), add_special_tokens=False)

Add self_distillation / privileged_solution_key / privileged_prefix / privileged_suffix to the user-facing distillation.yaml and regenerate _generated_ppo_*.yaml so the autogen-trainer-cfg pre-commit hook stays green.

ytchx1999 · 2026-06-25T07:00:16Z

The prompt_ids here should be the result after apply_chat_template, so should they include the EOS token? When modifying the code, I added an insertion token (to insert the ground truth at the last occurrence of this token). This seems to enhance scalability. I'm not sure if my understanding is correct, and I welcome any clarification.

def build_privileged_sequence(
    prompt_ids: list[int],
    response_ids: list[int],
    solution_ids: list[int],
    prefix_ids: list[int],
    suffix_ids: list[int],
    insert_before_token_ids: list[int] = None,
) -> list[int]:
    """Build the OPSD teacher's input token sequence.
    Layout: ``prompt_prefix + prefix + solution + suffix + prompt_suffix + response``.
    By default, ``insert_before_token_ids`` is None, so the solution is appended to the
    prompt: ``prompt + prefix + solution + suffix + response``.
    If ``insert_before_token_ids`` is provided, we search for the last occurrence of
    this sequence in ``prompt_ids`` and insert the solution before it.
    Args:
        prompt_ids: the student's prompt (problem) token ids.
        response_ids: the student's on-policy response token ids.
        solution_ids: the ground-truth solution token ids.
        prefix_ids: marker tokens placed before the solution.
        suffix_ids: marker tokens placed after the solution.
        insert_before_token_ids: if provided, the solution is inserted before the
            last occurrence of this sequence in ``prompt_ids``.
    Returns:
        The concatenated teacher input token ids.
    """
    if insert_before_token_ids:
        # Search for the last occurrence of the sub-sequence
        n, m = len(prompt_ids), len(insert_before_token_ids)
        split_idx = -1
        for i in range(n - m, -1, -1):
            if prompt_ids[i : i + m] == insert_before_token_ids:
                split_idx = i
                break

        if split_idx != -1:
            return (
                prompt_ids[:split_idx]
                + prefix_ids
                + solution_ids
                + suffix_ids
                + prompt_ids[split_idx:]
                + response_ids
            )

    return prompt_ids + prefix_ids + solution_ids + suffix_ids + response_ids

…en privileged-context (verl-project#6827) Review fixes for verl-project#6833: - privileged_solution_key resolves nested keys (default reward_model.ground_truth) and raises when self_distillation is on but no solution resolves. verl stores ground truth nested under reward_model, so the old flat default silently matched nothing and OPSD degraded to plain OPD. - pad_token_id=None falls back to 0 in the slice helper. - response_length==0 slices from an explicit start (the negative-zero slice returned the whole teacher tensor). - teacher input length is checked against max_model_len before scoring; long solutions could silently overflow the teacher engine. - yaml marker defaults use single-line escaped scalars so the runtime value matches the dataclass. - config validates self_distillation requires enabled and a non-empty key; ground-truth lists/arrays normalize to a string. - build_privileged_sequence gains insert_before_token_ids to place the solution before the assistant turn instead of appending, per reviewer suggestion. CPU tests cover the nested resolver, the 0-length slice, the None pad, and insert-before.

HaozheZhang6 · 2026-06-25T07:54:50Z

Good point -- appending after the full chat-templated prompt isn't ideal. I added an optional insert_before_token_ids to build_privileged_sequence (config privileged_insert_before): it inserts the solution before the last occurrence of a marker, and defaults to the plain append when unset.

Pushed a few related fixes while I was in there -- the default privileged_solution_key now resolves reward_model.ground_truth (the old flat default silently missed verl's nested ground truth, so OPSD quietly fell back to plain OPD), plus pad-token / max-len / empty-response guards. Thanks for the careful read.

ytchx1999 · 2026-06-25T08:43:52Z

Originally, the filtering was based on the length of the Student's prompt, which is why this error has a chance of occurring. Would it be better to define a separate filtering parameter in rl_dataset?

Good point -- the overflow is because the privileged sequence is longer than the student prompt the dataset filtered on. I kept the teacher-side raise as a loud safety net for now, but a dataset-level filter (capping prompt+solution upfront, like filter_overlong_prompts) is cleaner and I think fits better as a follow-up PR. Open to either.

…+ lock alignment with an e2e test (verl-project#6827) Follow-up review fixes on verl-project#6833: - The default privileged_suffix framed the task as evaluating the response, but the tokens that follow are the student response, not an evaluation. Reframe it to ask the teacher to derive the answer using the reference (think step by step), so the per-token target stays on-distribution and matches the reference OPSD transition. - Document the signal-quality knobs: ground_truth is often just the final answer (use a full-solution field like extra_info.answer for a stronger signal), and OPSD is a supervised loss (use_policy_gradient=False, use_task_rewards=False). - Add an e2e alignment test: the privileged slice plus the _pad_teacher_outputs left/right pad land response token j at row prompt_width+j, identical to the non-privileged path.

JacobHelwig · 2026-06-26T06:07:22Z

+            if solution_ids is not None:
+                # OPSD: the teacher conditions on the privileged ground-truth solution.


Can these lines be deleted?

Yes, done -- folded the build into the self.self_distillation branch and gated the slice-back on the same flag, so the solution_ids is not None checks are gone.

…lution_ids sentinel (verl-project#6827) Per review: fold the privileged-sequence build into the self_distillation branch and gate the teacher slice-back on the same flag, dropping the solution_ids=None sentinel and the two is-not-None checks. Equivalent control flow -- self_distillation is true exactly when a solution was resolved, since a missing solution raises.

JacobHelwig

Nice idea to add this. Could you please show metrics from a sample training run without task rewards, along with the command used to run? The key metrics of interest would be the distillation loss, the training rewards, and the validation rewards.

HaozheZhang6 · 2026-06-26T07:26:22Z

Thanks! Will run a supervised OPSD training (no task rewards) on a small model and follow up here with the distillation loss and the training/validation rewards, plus the exact command.

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

wuxibin89 requested a review from JacobHelwig June 24, 2026 05:24

HaozheZhang6 mentioned this pull request Jun 24, 2026

[Feature] On-Policy Self-Distillation (OPSD): privileged-context self-teacher #6827

Open

ytchx1999 reviewed Jun 25, 2026

View reviewed changes

JacobHelwig reviewed Jun 26, 2026

View reviewed changes

JacobHelwig suggested changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rollout, trainer, cfg] feat: privileged-context teacher scoring for OPSD#6833

[rollout, trainer, cfg] feat: privileged-context teacher scoring for OPSD#6833
HaozheZhang6 wants to merge 5 commits into
verl-project:mainfrom
HaozheZhang6:feat/opsd-privileged-context

HaozheZhang6 commented Jun 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

ytchx1999 commented Jun 25, 2026

Uh oh!

HaozheZhang6 commented Jun 25, 2026

Uh oh!

ytchx1999 Jun 25, 2026

Uh oh!

HaozheZhang6 Jun 25, 2026

Uh oh!

JacobHelwig Jun 26, 2026

Uh oh!

HaozheZhang6 Jun 26, 2026

Uh oh!

JacobHelwig left a comment

Uh oh!

HaozheZhang6 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if solution_ids is not None:
		# OPSD: the teacher conditions on the privileged ground-truth solution.

Uh oh!

Conversation

HaozheZhang6 commented Jun 24, 2026

What does this PR do?

Route

Test

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

ytchx1999 commented Jun 25, 2026

Uh oh!

HaozheZhang6 commented Jun 25, 2026

Uh oh!

ytchx1999 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

HaozheZhang6 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

JacobHelwig Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

HaozheZhang6 Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

JacobHelwig left a comment

Choose a reason for hiding this comment

Uh oh!

HaozheZhang6 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants