Add high-entropy DAPO: --entropy-top-frac (80/20 forking-tokens mask)#9
Open
RiddleHe wants to merge 1 commit into
Open
Add high-entropy DAPO: --entropy-top-frac (80/20 forking-tokens mask)#9RiddleHe wants to merge 1 commit into
RiddleHe wants to merge 1 commit into
Conversation
Restrict the policy-gradient mask to the top-fraction of highest-entropy response tokens per batch (paper 2506.01939). Previously sat in a stash on rl-recipe-fix; resurrected and rebased onto current master (which since added gspo/cispo and _masked_sequence_mean). - nanorl/loss.py: _apply_entropy_top_frac helper; grpo/dapo/reinforce accept entropy + entropy_top_frac. gspo/cispo unchanged for now (their per-sequence aggregation makes per-token entropy masking ill-defined). - nanorl/rollout.py: get_logprobs additionally returns per-token entropy computed under no_grad (only used for thresholding, never backpropped). - nanorl/scripts/train.py: --entropy-top-frac CLI flag, threaded through to loss_fn. - nanorl/runs/train.sh: ENTROPY_TOP_FRAC env var, conditionally appended. Verified end-to-end: dapo_loss with top_frac=0.2 restricts gradient to ~20% of response tokens (10/48 in smoke test), passes through unchanged when top_frac is None, and remains differentiable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--entropy-top-fracto RL training: restricts the policy-gradientmask to the top-fraction of highest-entropy response tokens per batch
(paper 2506.01939, "forking tokens" / 80/20 rule).
onto current master which since added gspo/cispo and
_masked_sequence_mean.What changed
nanorl/loss.py_apply_entropy_top_frachelper;grpo_loss/dapo_loss/reinforce_lossacceptentropy=andentropy_top_frac=kwargsnanorl/rollout.pyget_logprobsadditionally returns per-token entropyH(π_t) = -Σ_v π(v) log π(v), computed underno_grad(only used for thresholding, never backpropped)nanorl/scripts/train.py--entropy-top-fracCLI flag, threaded intoloss_fnnanorl/runs/train.shENTROPY_TOP_FRACenv var, conditionally appended via${ENTROPY_TOP_FRAC:+...}Scope notes
gspo_loss/cispo_lossare not wired up — they were added to master afterthis work and use per-sequence (not per-token) aggregation, so a per-token
entropy mask is ill-defined for them. Their
**kwargsswallows the newarguments, so passing
--entropy-top-fracwhile running gspo/cispo silentlyhas no effect (known limitation; the flag is documented as DAPO-targeted).
Usage
Verification
End-to-end smoke test:
top_frac=None→ identity passthrough ✓top_frac=0.2→ restricts a synthetic 17-token mask to 4 positions (≈24%) ✓dapo_loss(top_frac=0.2)is differentiable and produces gradient on only10/48 response positions vs 48/48 for the unmasked baseline ✓
bash -n nanorl/runs/train.shpasses ✓--entropy-top-fraccorrectly exposed inpython -m nanorl.scripts.train --help✓Test plan
ENTROPY_TOP_FRAC=0.2 NUM_STEPS=20 bash nanorl/runs/train.sh smoke_high_entropyruns to completion on a 4+1 GPU split (4 train, 1 rollout) with Qwen2.5-1.5B
stage3_entropy_top20_500results in.nanochat/rl/🤖 Generated with Claude Code