Smooth-L0 (Geman-McClure) importance-minimality loss#852
Open
danbraunai-goodfire wants to merge 1 commit into
Open
Smooth-L0 (Geman-McClure) importance-minimality loss#852danbraunai-goodfire wants to merge 1 commit into
danbraunai-goodfire wants to merge 1 commit into
Conversation
…ality loss Add SmoothL0ImportanceMinimalityLoss, an alternative CI-sparsity penalty φ(c)=c²/(c²+γ²) to the L_p ImportanceMinimalityLoss: flat gradient at 0, bounded (~0.65/γ near c≈γ), so no gradient cliff as the threshold tightens (L_p's p·c^(p-1) blows up as c→0 for p<1). γ anneals like p. Self-contained (no edits to importance_minimality.py). Register it loss-side (configs.py, dispatch.py) and both imp-min metrics eval-side (eval_metrics) so a run driven by one logs the other's sparsity proxy at eval. Add unit tests, two cross-logging comparison configs (L_p control + smooth-L0), and a short docs/ note with run commands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
SmoothL0ImportanceMinimalityLoss— a bounded smooth-L0 CI-sparsity penaltyφ(c) = c²/(c²+γ²)(Geman–McClure) as an alternative to the existingL_pImportanceMinimalityLoss. TheL_ppenalty's gradientp·c^(p-1)blows up asc→0forp<1(an infinite cliff at the accumulation point where most components sit). Smooth-L0 is:φ'(0)=0) — no cliff,~0.65/γat the meaningful thresholdc≈γ).γanneals likep. The implementation is self-contained — it does not modifyimportance_minimality.py.Changes
param_decomp/metrics/smooth_l0_importance_minimality.py— loss + config (gamma,beta,gamma_final, anneal fracs), mirroring theL_pstructure (entropy term, DDPworld_sizehandling, un-namespaced
{name}/{name}_no_betacompute keys).configs.pyAnyLossMetricConfig,dispatch.py) and bothimp-min metrics eval-side (
param_decomp_lab/eval_metrics/__init__.py) so a run driven byone logs the other's sparsity proxy as an eval-only metric.
param_decomp/tests/metrics/test_smooth_l0_importance_minimality_loss.py— 15 tests incl.the defining flat-finite-gradient-at-0 and bounded-gradient-peaks-at-γ checks.
pile_llama_simple_mlp-4L.yaml:..._impmin_lp.yaml(L_p control + smooth-L0 eval probe) and..._impmin_smoothl0.yaml(smooth-L0 driver + L_p eval probe).
docs/smooth_l0_importance_minimality.md— short note: the configs, run commands, themetrics to compare, and a local-data fallback for clusters where HF streaming is down.
param_decomp/metrics/CLAUDE.md— variants note.How to compare
Compare at 5k/10k:
eval/.../CI_L0total (sparsity) vseval/ce_kl/kl_ci_maskedandeval/loss/PGDReconLoss(faithfulness).batch_sizeis global, so--dp 8≡--dp 16trajectory. Sweep
coeffon each to trace the full sparsity/faithfulness frontier (smooth-L0'sloss scale ≈ active-count, so its
coeffis not 1:1 withL_p's).Prior validation (sibling
feature/jaxbranch, same loss math)A 10-run sweep found smooth-L0 dominates the
L_psparsity/faithfulness trade-off at 5k/10k— ~0.1–0.15 lower KL at matched CI_L0 and lower 20-step adversarial PGD recon, including a
fast-anneal cliff-regime (γ→0.05 / p→0.4 by step 8k) pair, training stably. Re-running this on
mainis the purpose of this PR.Caveat: only early training (5k/10k) was tested; the long-horizon collapse-robustness story is
not yet evaluated.
Test
pytest param_decomp/tests/metrics/test_smooth_l0_importance_minimality_loss.py(15 passed);basedpyright + ruff clean (pre-commit).