Skip to content

Add high-entropy DAPO: --entropy-top-frac (80/20 forking-tokens mask)#9

Open
RiddleHe wants to merge 1 commit into
masterfrom
yuchen/high-entropy-dapo
Open

Add high-entropy DAPO: --entropy-top-frac (80/20 forking-tokens mask)#9
RiddleHe wants to merge 1 commit into
masterfrom
yuchen/high-entropy-dapo

Conversation

@RiddleHe

@RiddleHe RiddleHe commented May 5, 2026

Copy link
Copy Markdown
Owner

Summary

  • Adds --entropy-top-frac to RL training: restricts the policy-gradient
    mask to the top-fraction of highest-entropy response tokens per batch
    (paper 2506.01939, "forking tokens" / 80/20 rule).
  • Resurrected from a stash that pre-dated the rl-recipe-fix merge; rebased
    onto current master which since added gspo/cispo and _masked_sequence_mean.

What changed

file change
nanorl/loss.py new _apply_entropy_top_frac helper; grpo_loss / dapo_loss / reinforce_loss accept entropy= and entropy_top_frac= kwargs
nanorl/rollout.py get_logprobs additionally returns per-token entropy H(π_t) = -Σ_v π(v) log π(v), computed under no_grad (only used for thresholding, never backpropped)
nanorl/scripts/train.py new --entropy-top-frac CLI flag, threaded into loss_fn
nanorl/runs/train.sh new ENTROPY_TOP_FRAC env var, conditionally appended via ${ENTROPY_TOP_FRAC:+...}

Scope notes

  • gspo_loss / cispo_loss are not wired up — they were added to master after
    this work and use per-sequence (not per-token) aggregation, so a per-token
    entropy mask is ill-defined for them. Their **kwargs swallows the new
    arguments, so passing --entropy-top-frac while running gspo/cispo silently
    has no effect (known limitation; the flag is documented as DAPO-targeted).

Usage

# Baseline DAPO (full mask)
bash nanorl/runs/train.sh my_run

# High-entropy DAPO (top 20% of response tokens by entropy)
ENTROPY_TOP_FRAC=0.2 bash nanorl/runs/train.sh my_run

Verification

End-to-end smoke test:

  • top_frac=None → identity passthrough ✓
  • top_frac=0.2 → restricts a synthetic 17-token mask to 4 positions (≈24%) ✓
  • dapo_loss(top_frac=0.2) is differentiable and produces gradient on only
    10/48 response positions vs 48/48 for the unmasked baseline ✓
  • bash -n nanorl/runs/train.sh passes ✓
  • --entropy-top-frac correctly exposed in python -m nanorl.scripts.train --help

Test plan

  • ENTROPY_TOP_FRAC=0.2 NUM_STEPS=20 bash nanorl/runs/train.sh smoke_high_entropy
    runs to completion on a 4+1 GPU split (4 train, 1 rollout) with Qwen2.5-1.5B
  • Confirm the saved checkpoint loads back into vLLM via the in-place reload
  • (Already validated previously) Full 500-step run reproduces the
    stage3_entropy_top20_500 results in .nanochat/rl/

🤖 Generated with Claude Code

Restrict the policy-gradient mask to the top-fraction of highest-entropy
response tokens per batch (paper 2506.01939). Previously sat in a stash
on rl-recipe-fix; resurrected and rebased onto current master (which
since added gspo/cispo and _masked_sequence_mean).

- nanorl/loss.py: _apply_entropy_top_frac helper; grpo/dapo/reinforce
  accept entropy + entropy_top_frac. gspo/cispo unchanged for now (their
  per-sequence aggregation makes per-token entropy masking ill-defined).
- nanorl/rollout.py: get_logprobs additionally returns per-token entropy
  computed under no_grad (only used for thresholding, never backpropped).
- nanorl/scripts/train.py: --entropy-top-frac CLI flag, threaded through
  to loss_fn.
- nanorl/runs/train.sh: ENTROPY_TOP_FRAC env var, conditionally appended.

Verified end-to-end: dapo_loss with top_frac=0.2 restricts gradient to
~20% of response tokens (10/48 in smoke test), passes through unchanged
when top_frac is None, and remains differentiable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants