Add entropy-gradient probing script for RL checkpoints by Upcccccc · Pull Request #4 · RiddleHe/nanochat

Upcccccc · 2026-04-21T00:48:41Z

Summary

Adds nanorl/scripts/probe_entropy_gradients.py — offline probe for which token positions drive gradient updates in RL checkpoints.

What it does

For each HF checkpoint:

Samples trajectories from a math RL prompt subset (uses build_rl_dataset, so RL_DATASET_PATH env var applies).
Per response token, computes entropy at the current checkpoint and buckets tokens by entropy quantile (default 5 bins = 20-percentile).
Picks up to --max-positions-per-group positions per (entropy_bin × correctness) group.
For each position, computes ∂(token_logp)/∂W w.r.t. targeted rows of self_attn.k_proj / mlp.down_proj across configurable layers.
Writes per-token records + (ckpt × bin × correctness × param × row) summary: grad_norm, row_cosine, proj_energy_frac, effective_rank.

Caveat (module docstring)

This probes per-token log-prob gradients under teacher forcing; it does not reconstruct the exact token-level contribution to the trainer's sample-aggregated DAPO/GRPO objective (trainer reduces response tokens to a sample-level masked-mean log-prob before loss).

Usage

RL_DATASET_PATH=/path/to/rl.jsonl \
CUDA_VISIBLE_DEVICES=7 \
python -m nanorl.scripts.probe_entropy_gradients \
  --checkpoint-root <run>/checkpoints \
  --steps 20,80,160,240 --include-final \
  --num-prompts 24 --num-samples 2 \
  --max-new-tokens 2048 \
  --entropy-bins 5 --max-positions-per-group 12 \
  --layer-indices 0,14,27 --row-spec start,mid,last \
  --output-dir <out>

Outputs: summary.json, token_probe_records.jsonl, manifest.json.

Fixes vs earlier local drafts

pad-token trimming in sample_trajectories: HF generate(num_return_sequences>1) pads shorter sequences with pad_token_id (= eos_token_id here); those pad positions were polluting entropy/gradient stats. Now trim at first EOS in each response (keeping the EOS itself — its entropy is a real policy decision).
--max-new-tokens default 512 → 2048 (training uses 8192; 512 cut off most math responses).
Removed an unused loop variable.

Test plan

Module imports cleanly
Smoke: 2 ckpts × 4 prompts × 512 max tokens → 1080 records, no NaN
Full: 5 ckpts × 24 prompts × 1536 max tokens → 10800 records; grad_norm scales monotonically with entropy across all checkpoints

Does not modify

Trainer / losses / rollout / data loader untouched.

Offline probe for answering: at which token positions do gradients live, and do high-entropy ("forking") positions have special gradient stats? For each HF checkpoint: 1. Sample trajectories from a math RL prompt subset. 2. Compute per-token response entropy; bucket positions by entropy quantile (default 5 bins = 20-percentile buckets). 3. For selected positions per (bin x correctness) group, compute the gradient of that token's log-prob w.r.t. targeted parameter rows on Qwen2 k_proj / down_proj layers. 4. Summarize grad_norm, row_cosine (alignment with current weight row), proj_energy_frac (fraction of grad energy in current-row direction), and effective_rank of stacked gradient rows. Caveat (called out in the module docstring): this is the per-token log-prob gradient under teacher forcing, not the exact token-level contribution to the current trainer's sample-aggregated DAPO objective. Entry: python -m nanorl.scripts.probe_entropy_gradients \ --checkpoint-root <dir> [--steps 20,80,160,240] [--include-final] \ --output-dir <out>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add entropy-gradient probing script for RL checkpoints#4

Add entropy-gradient probing script for RL checkpoints#4
Upcccccc wants to merge 1 commit into
masterfrom
probe-entropy-gradients

Upcccccc commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Upcccccc commented Apr 21, 2026

Summary

What it does

Caveat (module docstring)

Usage

Fixes vs earlier local drafts

Test plan

Does not modify

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant