Skip to content

[rollout, trainer, cfg] feat: privileged-context teacher scoring for OPSD#6833

Draft
HaozheZhang6 wants to merge 5 commits into
verl-project:mainfrom
HaozheZhang6:feat/opsd-privileged-context
Draft

[rollout, trainer, cfg] feat: privileged-context teacher scoring for OPSD#6833
HaozheZhang6 wants to merge 5 commits into
verl-project:mainfrom
HaozheZhang6:feat/opsd-privileged-context

Conversation

@HaozheZhang6

Copy link
Copy Markdown
Contributor

What does this PR do?

The first piece of On-Policy Self-Distillation (OPSD, #6827): the teacher conditions on the ground-truth solution (privileged context) while the student sees only the problem, and scores the student's own on-policy rollout. This is the half plain OPD lacks -- verl's OPD teacher sees only prompt + response.

  • verl/trainer/distillation/privileged_context.py -- two pure helpers: build_privileged_sequence (splice the GT solution, wrapped in marker tokens, into the teacher input) and slice_privileged_teacher_to_student (realign the teacher's per-token response scores back onto the student's prompt + response positions). CPU-tested.
  • DistillationConfig -- self_distillation, privileged_solution_key, privileged_prefix, privileged_suffix.
  • agent_loop._compute_teacher_logprobs -- when self_distillation is on, build the privileged teacher sequence and slice the scores back; otherwise the path is unchanged (prompt + response).

Route

Reuses the existing external-teacher path with teacher.model_path == student model_path, i.e. a frozen self-teacher (the student's init checkpoint) scoring the rollout with privileged context. The whole OPD data + loss pipeline is reused unchanged -- only the teacher's input sequence and the realignment are new. The dynamic / EMA self-teacher (a second in-engine forward) is a separate, GPU-gated follow-up.

Test

pytest tests/workers/test_opsd_privileged_context_on_cpu.py tests/workers/test_distillation_config_opsd_on_cpu.py -- splice layout + slice-back shapes/dtype + the config fields. CPU-only. The end-to-end GPU run (teacher = student checkpoint, top-k overlap rises, convergence) is the remaining validation, hence draft.

Part of #6827.

…OPSD (verl-project#6827)

On-Policy Self-Distillation (OPSD) has the teacher condition on the ground-truth
solution (privileged context) while the student sees only the problem; the teacher
scores the students own on-policy rollout. This adds the half plain OPD lacks.

- privileged_context.py: two pure helpers -- build_privileged_sequence (splice the
  GT solution into the teacher input) and slice_privileged_teacher_to_student
  (realign the teachers response scores onto the students positions). CPU-tested.
- DistillationConfig: self_distillation / privileged_solution_key / privileged_prefix / privileged_suffix.
- agent_loop._compute_teacher_logprobs: when self_distillation is on, build the
  privileged teacher sequence and slice the scores back; otherwise unchanged.

Reuses the existing external-teacher path with teacher.model_path == student path
(a frozen self-teacher), so the OPD data/loss pipeline is unchanged.

CPU tests cover the splice/slice-back shapes and the config fields. The end-to-end
GPU run (teacher = student checkpoint, convergence) is the follow-up.

Part of verl-project#6827.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces On-Policy Self-Distillation (OPSD) support, allowing the teacher model to condition on a privileged ground-truth solution while sharing weights with the student. It adds helper functions to construct the privileged sequence and realign the teacher's scores, updates the distillation configuration, integrates these changes into the agent loop, and adds corresponding CPU tests. The review feedback highlights two key improvements: handling potential null values for the pad token ID to prevent runtime crashes, and raising an explicit error if self-distillation is enabled but the privileged solution is missing to avoid silent failures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +1052 to +1054
teacher_ids, teacher_logprobs = slice_privileged_teacher_to_student(
teacher_ids, teacher_logprobs, len(prompt_ids), len(response_ids), self.tokenizer.pad_token_id
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

If self.tokenizer.pad_token_id is None (which is common for several tokenizers like LLaMA unless explicitly configured), passing it directly to slice_privileged_teacher_to_student will cause a TypeError inside torch.full. We should fall back to eos_token_id or 0 to prevent runtime crashes.

Suggested change
teacher_ids, teacher_logprobs = slice_privileged_teacher_to_student(
teacher_ids, teacher_logprobs, len(prompt_ids), len(response_ids), self.tokenizer.pad_token_id
)
pad_token_id = self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else (self.tokenizer.eos_token_id if self.tokenizer.eos_token_id is not None else 0)
teacher_ids, teacher_logprobs = slice_privileged_teacher_to_student(
teacher_ids, teacher_logprobs, len(prompt_ids), len(response_ids), pad_token_id
)

Comment on lines +1027 to +1036
if sample_kwargs is not None:
routing_value = sample_kwargs.get(self.teacher_key)
if routing_value is not None:
# Non-tensor batch values arrive as 0-d numpy objects / arrays; normalize to Python.
routing_key = routing_value.item() if hasattr(routing_value, "item") else routing_value
if self.self_distillation:
solution = sample_kwargs.get(self.privileged_solution_key)
if solution is not None:
solution = solution.item() if hasattr(solution, "item") else solution
solution_ids = self.tokenizer.encode(str(solution), add_special_tokens=False)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When self_distillation is enabled, the teacher model expects the privileged context (ground-truth solution). If the privileged solution key is missing from sample_kwargs or is None, the code silently falls back to standard distillation without privileged context. This can lead to silent failures where the user assumes OPSD is running but it is actually not. We should raise a clear ValueError to prevent this.

            if self.self_distillation and (sample_kwargs is None or sample_kwargs.get(self.privileged_solution_key) is None):
                raise ValueError(
                    f"self_distillation is enabled, but privileged solution key '{self.privileged_solution_key}' "
                    f"was not found in sample_kwargs or is None."
                )
            if sample_kwargs is not None:
                routing_value = sample_kwargs.get(self.teacher_key)
                if routing_value is not None:
                    # Non-tensor batch values arrive as 0-d numpy objects / arrays; normalize to Python.
                    routing_key = routing_value.item() if hasattr(routing_value, "item") else routing_value
                if self.self_distillation:
                    solution = sample_kwargs.get(self.privileged_solution_key)
                    solution = solution.item() if hasattr(solution, "item") else solution
                    solution_ids = self.tokenizer.encode(str(solution), add_special_tokens=False)

@wuxibin89 wuxibin89 requested a review from JacobHelwig June 24, 2026 05:24
Add self_distillation / privileged_solution_key / privileged_prefix / privileged_suffix
to the user-facing distillation.yaml and regenerate _generated_ppo_*.yaml so the
autogen-trainer-cfg pre-commit hook stays green.
@ytchx1999

Copy link
Copy Markdown

The prompt_ids here should be the result after apply_chat_template, so should they include the EOS token? When modifying the code, I added an insertion token (to insert the ground truth at the last occurrence of this token). This seems to enhance scalability. I'm not sure if my understanding is correct, and I welcome any clarification.

def build_privileged_sequence(
    prompt_ids: list[int],
    response_ids: list[int],
    solution_ids: list[int],
    prefix_ids: list[int],
    suffix_ids: list[int],
    insert_before_token_ids: list[int] = None,
) -> list[int]:
    """Build the OPSD teacher's input token sequence.
    Layout: ``prompt_prefix + prefix + solution + suffix + prompt_suffix + response``.
    By default, ``insert_before_token_ids`` is None, so the solution is appended to the
    prompt: ``prompt + prefix + solution + suffix + response``.
    If ``insert_before_token_ids`` is provided, we search for the last occurrence of
    this sequence in ``prompt_ids`` and insert the solution before it.
    Args:
        prompt_ids: the student's prompt (problem) token ids.
        response_ids: the student's on-policy response token ids.
        solution_ids: the ground-truth solution token ids.
        prefix_ids: marker tokens placed before the solution.
        suffix_ids: marker tokens placed after the solution.
        insert_before_token_ids: if provided, the solution is inserted before the
            last occurrence of this sequence in ``prompt_ids``.
    Returns:
        The concatenated teacher input token ids.
    """
    if insert_before_token_ids:
        # Search for the last occurrence of the sub-sequence
        n, m = len(prompt_ids), len(insert_before_token_ids)
        split_idx = -1
        for i in range(n - m, -1, -1):
            if prompt_ids[i : i + m] == insert_before_token_ids:
                split_idx = i
                break

        if split_idx != -1:
            return (
                prompt_ids[:split_idx]
                + prefix_ids
                + solution_ids
                + suffix_ids
                + prompt_ids[split_idx:]
                + response_ids
            )

    return prompt_ids + prefix_ids + solution_ids + suffix_ids + response_ids

…en privileged-context (verl-project#6827)

Review fixes for verl-project#6833:
- privileged_solution_key resolves nested keys (default reward_model.ground_truth) and raises when self_distillation is on but no solution resolves. verl stores ground truth nested under reward_model, so the old flat default silently matched nothing and OPSD degraded to plain OPD.
- pad_token_id=None falls back to 0 in the slice helper.
- response_length==0 slices from an explicit start (the negative-zero slice returned the whole teacher tensor).
- teacher input length is checked against max_model_len before scoring; long solutions could silently overflow the teacher engine.
- yaml marker defaults use single-line escaped scalars so the runtime value matches the dataclass.
- config validates self_distillation requires enabled and a non-empty key; ground-truth lists/arrays normalize to a string.
- build_privileged_sequence gains insert_before_token_ids to place the solution before the assistant turn instead of appending, per reviewer suggestion.

CPU tests cover the nested resolver, the 0-length slice, the None pad, and insert-before.
@HaozheZhang6

Copy link
Copy Markdown
Contributor Author

Good point -- appending after the full chat-templated prompt isn't ideal. I added an optional insert_before_token_ids to build_privileged_sequence (config privileged_insert_before): it inserts the solution before the last occurrence of a marker, and defaults to the plain append when unset.

Pushed a few related fixes while I was in there -- the default privileged_solution_key now resolves reward_model.ground_truth (the old flat default silently missed verl's nested ground truth, so OPSD quietly fell back to plain OPD), plus pad-token / max-len / empty-response guards. Thanks for the careful read.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, the filtering was based on the length of the Student's prompt, which is why this error has a chance of occurring. Would it be better to define a separate filtering parameter in rl_dataset?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -- the overflow is because the privileged sequence is longer than the student prompt the dataset filtered on. I kept the teacher-side raise as a loud safety net for now, but a dataset-level filter (capping prompt+solution upfront, like filter_overlong_prompts) is cleaner and I think fits better as a follow-up PR. Open to either.

…+ lock alignment with an e2e test (verl-project#6827)

Follow-up review fixes on verl-project#6833:
- The default privileged_suffix framed the task as evaluating the response, but the tokens that follow are the student response, not an evaluation. Reframe it to ask the teacher to derive the answer using the reference (think step by step), so the per-token target stays on-distribution and matches the reference OPSD transition.
- Document the signal-quality knobs: ground_truth is often just the final answer (use a full-solution field like extra_info.answer for a stronger signal), and OPSD is a supervised loss (use_policy_gradient=False, use_task_rewards=False).
- Add an e2e alignment test: the privileged slice plus the _pad_teacher_outputs left/right pad land response token j at row prompt_width+j, identical to the non-privileged path.
Comment on lines +1047 to +1048
if solution_ids is not None:
# OPSD: the teacher conditions on the privileged ground-truth solution.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these lines be deleted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done -- folded the build into the self.self_distillation branch and gated the slice-back on the same flag, so the solution_ids is not None checks are gone.

…lution_ids sentinel (verl-project#6827)

Per review: fold the privileged-sequence build into the self_distillation branch and gate the teacher slice-back on the same flag, dropping the solution_ids=None sentinel and the two is-not-None checks. Equivalent control flow -- self_distillation is true exactly when a solution was resolved, since a missing solution raises.

@JacobHelwig JacobHelwig left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea to add this. Could you please show metrics from a sample training run without task rewards, along with the command used to run? The key metrics of interest would be the distillation loss, the training rewards, and the validation rewards.

@HaozheZhang6

Copy link
Copy Markdown
Contributor Author

Thanks! Will run a supervised OPSD training (no task rewards) on a small model and follow up here with the distillation loss and the training/validation rewards, plus the exact command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants