Skip to content

Add entropy-gradient probing script for RL checkpoints#4

Open
Upcccccc wants to merge 1 commit into
masterfrom
probe-entropy-gradients
Open

Add entropy-gradient probing script for RL checkpoints#4
Upcccccc wants to merge 1 commit into
masterfrom
probe-entropy-gradients

Conversation

@Upcccccc

Copy link
Copy Markdown
Collaborator

Summary

Adds nanorl/scripts/probe_entropy_gradients.py — offline probe for which token positions drive gradient updates in RL checkpoints.

What it does

For each HF checkpoint:

  1. Samples trajectories from a math RL prompt subset (uses build_rl_dataset, so RL_DATASET_PATH env var applies).
  2. Per response token, computes entropy at the current checkpoint and buckets tokens by entropy quantile (default 5 bins = 20-percentile).
  3. Picks up to --max-positions-per-group positions per (entropy_bin × correctness) group.
  4. For each position, computes ∂(token_logp)/∂W w.r.t. targeted rows of self_attn.k_proj / mlp.down_proj across configurable layers.
  5. Writes per-token records + (ckpt × bin × correctness × param × row) summary: grad_norm, row_cosine, proj_energy_frac, effective_rank.

Caveat (module docstring)

This probes per-token log-prob gradients under teacher forcing; it does not reconstruct the exact token-level contribution to the trainer's sample-aggregated DAPO/GRPO objective (trainer reduces response tokens to a sample-level masked-mean log-prob before loss).

Usage

RL_DATASET_PATH=/path/to/rl.jsonl \
CUDA_VISIBLE_DEVICES=7 \
python -m nanorl.scripts.probe_entropy_gradients \
  --checkpoint-root <run>/checkpoints \
  --steps 20,80,160,240 --include-final \
  --num-prompts 24 --num-samples 2 \
  --max-new-tokens 2048 \
  --entropy-bins 5 --max-positions-per-group 12 \
  --layer-indices 0,14,27 --row-spec start,mid,last \
  --output-dir <out>

Outputs: summary.json, token_probe_records.jsonl, manifest.json.

Fixes vs earlier local drafts

  • pad-token trimming in sample_trajectories: HF generate(num_return_sequences>1) pads shorter sequences with pad_token_id (= eos_token_id here); those pad positions were polluting entropy/gradient stats. Now trim at first EOS in each response (keeping the EOS itself — its entropy is a real policy decision).
  • --max-new-tokens default 512 → 2048 (training uses 8192; 512 cut off most math responses).
  • Removed an unused loop variable.

Test plan

  • Module imports cleanly
  • Smoke: 2 ckpts × 4 prompts × 512 max tokens → 1080 records, no NaN
  • Full: 5 ckpts × 24 prompts × 1536 max tokens → 10800 records; grad_norm scales monotonically with entropy across all checkpoints

Does not modify

Trainer / losses / rollout / data loader untouched.

Offline probe for answering: at which token positions do gradients live,
and do high-entropy ("forking") positions have special gradient stats?

For each HF checkpoint:
  1. Sample trajectories from a math RL prompt subset.
  2. Compute per-token response entropy; bucket positions by entropy quantile
     (default 5 bins = 20-percentile buckets).
  3. For selected positions per (bin x correctness) group, compute the gradient
     of that token's log-prob w.r.t. targeted parameter rows on
     Qwen2 k_proj / down_proj layers.
  4. Summarize grad_norm, row_cosine (alignment with current weight row),
     proj_energy_frac (fraction of grad energy in current-row direction),
     and effective_rank of stacked gradient rows.

Caveat (called out in the module docstring): this is the per-token log-prob
gradient under teacher forcing, not the exact token-level contribution to
the current trainer's sample-aggregated DAPO objective.

Entry: python -m nanorl.scripts.probe_entropy_gradients \
         --checkpoint-root <dir> [--steps 20,80,160,240] [--include-final] \
         --output-dir <out>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant