Add high-entropy DAPO: --entropy-top-frac (80/20 forking-tokens mask) by RiddleHe · Pull Request #9 · RiddleHe/nanochat

RiddleHe · 2026-05-05T03:14:48Z

Summary

Adds --entropy-top-frac to RL training: restricts the policy-gradient
mask to the top-fraction of highest-entropy response tokens per batch
(paper 2506.01939, "forking tokens" / 80/20 rule).
Resurrected from a stash that pre-dated the rl-recipe-fix merge; rebased
onto current master which since added gspo/cispo and _masked_sequence_mean.

What changed

file	change
`nanorl/loss.py`	new `_apply_entropy_top_frac` helper; `grpo_loss` / `dapo_loss` / `reinforce_loss` accept `entropy=` and `entropy_top_frac=` kwargs
`nanorl/rollout.py`	`get_logprobs` additionally returns per-token entropy `H(π_t) = -Σ_v π(v) log π(v)`, computed under `no_grad` (only used for thresholding, never backpropped)
`nanorl/scripts/train.py`	new `--entropy-top-frac` CLI flag, threaded into `loss_fn`
`nanorl/runs/train.sh`	new `ENTROPY_TOP_FRAC` env var, conditionally appended via `${ENTROPY_TOP_FRAC:+...}`

Scope notes

gspo_loss / cispo_loss are not wired up — they were added to master after
this work and use per-sequence (not per-token) aggregation, so a per-token
entropy mask is ill-defined for them. Their **kwargs swallows the new
arguments, so passing --entropy-top-frac while running gspo/cispo silently
has no effect (known limitation; the flag is documented as DAPO-targeted).

Usage

# Baseline DAPO (full mask)
bash nanorl/runs/train.sh my_run

# High-entropy DAPO (top 20% of response tokens by entropy)
ENTROPY_TOP_FRAC=0.2 bash nanorl/runs/train.sh my_run

Verification

End-to-end smoke test:

top_frac=None → identity passthrough ✓
top_frac=0.2 → restricts a synthetic 17-token mask to 4 positions (≈24%) ✓
dapo_loss(top_frac=0.2) is differentiable and produces gradient on only
10/48 response positions vs 48/48 for the unmasked baseline ✓
bash -n nanorl/runs/train.sh passes ✓
--entropy-top-frac correctly exposed in python -m nanorl.scripts.train --help ✓

Test plan

ENTROPY_TOP_FRAC=0.2 NUM_STEPS=20 bash nanorl/runs/train.sh smoke_high_entropy
runs to completion on a 4+1 GPU split (4 train, 1 rollout) with Qwen2.5-1.5B
Confirm the saved checkpoint loads back into vLLM via the in-place reload
(Already validated previously) Full 500-step run reproduces the
stage3_entropy_top20_500 results in .nanochat/rl/

🤖 Generated with Claude Code

Restrict the policy-gradient mask to the top-fraction of highest-entropy response tokens per batch (paper 2506.01939). Previously sat in a stash on rl-recipe-fix; resurrected and rebased onto current master (which since added gspo/cispo and _masked_sequence_mean). - nanorl/loss.py: _apply_entropy_top_frac helper; grpo/dapo/reinforce accept entropy + entropy_top_frac. gspo/cispo unchanged for now (their per-sequence aggregation makes per-token entropy masking ill-defined). - nanorl/rollout.py: get_logprobs additionally returns per-token entropy computed under no_grad (only used for thresholding, never backpropped). - nanorl/scripts/train.py: --entropy-top-frac CLI flag, threaded through to loss_fn. - nanorl/runs/train.sh: ENTROPY_TOP_FRAC env var, conditionally appended. Verified end-to-end: dapo_loss with top_frac=0.2 restricts gradient to ~20% of response tokens (10/48 in smoke test), passes through unchanged when top_frac is None, and remains differentiable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add high-entropy DAPO: --entropy-top-frac (80/20 forking-tokens mask)#9

Add high-entropy DAPO: --entropy-top-frac (80/20 forking-tokens mask)#9
RiddleHe wants to merge 1 commit into
masterfrom
yuchen/high-entropy-dapo

RiddleHe commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RiddleHe commented May 5, 2026

Summary

What changed

Scope notes

Usage

Verification

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants