Skip to content

fix(ppo): preserve raw KL metric tensor#1493

Open
EazyReal wants to merge 1 commit into
radixark:mainfrom
EazyReal:upstream-pr/ppo-kl-inplace-metric
Open

fix(ppo): preserve raw KL metric tensor#1493
EazyReal wants to merge 1 commit into
radixark:mainfrom
EazyReal:upstream-pr/ppo-kl-inplace-metric

Conversation

@EazyReal

@EazyReal EazyReal commented Jun 27, 2026

Copy link
Copy Markdown

Port of THUDM/slime#2114 for miles' refactored advantage helper.

Summary:

  • build PPO token-level rewards out of place instead of mutating the kl tensor list
  • preserve the raw approximate KL tensor used by rollout/kl logging
  • add a CPU test directly on compute_advantages

Validation:

  • uv run --with pytest --with torch --with numpy pytest --confcutdir=tests/fast/backends/training_utils/loss tests/fast/backends/training_utils/loss/test_ppo_kl_metric.py -q -> 1 passed

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request prevents the in-place mutation of the input kl list elements in compute_advantages by creating a new token_level_rewards variable instead of modifying the elements in-place. Additionally, a unit test is added to verify that the raw KL metric is preserved. I have no feedback to provide as there are no review comments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant