Skip to content

test(jax): prove C49k checkpoint migration is value-exact#877

Merged
ocg-goodfire merged 1 commit into
feature/jaxfrom
worktree-agent-ad952bba2d82a7deb
Jun 17, 2026
Merged

test(jax): prove C49k checkpoint migration is value-exact#877
ocg-goodfire merged 1 commit into
feature/jaxfrom
worktree-agent-ad952bba2d82a7deb

Conversation

@ocg-goodfire

Copy link
Copy Markdown
Collaborator

What

Adds param_decomp_jax/jax_single_pool/tools/verify_c49k_migration.py, a streaming leaf-by-leaf verifier that proves the C49k checkpoint migration (migrate_c49k_checkpoint.py, #870) is value-exact — i.e. the migrated checkpoint reproduces the original decomposition bit-for-bit, not just structurally.

This closes the #1 correctness gap flagged in MIGRATION_REVIEW.md: the migration was verified for STRUCTURE (shapes / step==175000 / finiteness) but NOT VALUE. A swapped VU, a mis-mapped g/u/dgate/up/down, or a wrong squeeze axis would have passed every existing check and silently corrupted the 175k fine-tune base.

How it proves it

The migration is a pure copy + reshape (re-key + squeeze the legacy leading singleton) — no recompute — so every migrated leaf must equal its source leaf bit-for-bit under the remap. The verifier:

  • Builds the inverse of the migration's remap from that tool's own constants (KIND_TO_SITE_SUFFIX, SOURCE_STATE_KEY), so the mapping under test == the mapping applied.
  • Asserts the remap covers both trees 1:1 (every source leaf and every migrated leaf mapped exactly once — 144 each).
  • For each leaf: restores the source + migrated counterpart single-device on CPU one pair at a time (all other leaves PLACEHOLDER, never read from disk), squeezes the singleton on the 6 component V/U leaves, and np.array_equals them. Peak RAM is ~2× one V/U leaf (~1.6 GB) — the 47 GB trees never coexist (that OOM-killed the migration).

The g/u/dgate/up/down mapping and V→[0]/U→[1] ordering are independently confirmed by the asymmetric down_proj shapes (d_in=14336 vs 4096 for gate/up), which would mismatch under any swap or mismap.

Verdict

per-group:
  [PASS] ci_fn                    37/37 leaves bit-identical
  [PASS] ci_fn_opt_state          76/76 leaves bit-identical
  [PASS] components               6/6 leaves bit-identical
  [PASS] components_opt_state     14/14 leaves bit-identical
  [PASS] sources                  3/3 leaves bit-identical
  [PASS] sources_opt_state        7/7 leaves bit-identical
  [PASS] step                     1/1 leaves bit-identical

VERDICT: PASS — all 144 leaves bit-identical under the remap.
The migrated 175k checkpoint is VALUE-EXACT.

The migrated 175k checkpoint is value-exact. The fine-tune base is proven sound.

make check-jax clean on the new file. Read-only tool — touches neither checkpoint dir, goldens unaffected.

🤖 Generated with Claude Code

Adds tools/verify_c49k_migration.py, a streaming leaf-by-leaf verifier that
proves the migrated 175k checkpoint (p-bd3cd4d4) reproduces the frozen-clone
source (jax-l18-C49k-200k) bit-for-bit under migrate_c49k_checkpoint.py's
remap, closing the #1 correctness gap in MIGRATION_REVIEW.md (structure was
verified, value was not).

The comparison table is the inverse of the migration's remap, built from that
tool's own constants (KIND_TO_SITE_SUFFIX, SOURCE_STATE_KEY) so the mapping
under test and the mapping applied are the same. Each leaf pair is restored
single-device on CPU one at a time (all other leaves PLACEHOLDER, never read),
so peak RAM is ~2x one V/U leaf (~1.6 GB) and the 47 GB trees never coexist.
Asserts 1:1 coverage of both trees, then np.array_equal per leaf (squeezing
the legacy leading singleton on the 6 component V/U leaves).

Run verdict: all 144 leaves bit-identical -> migration VALUE-EXACT.
The g/u/d->gate/up/down mapping and V->[0]/U->[1] ordering are confirmed by
the asymmetric down_proj shapes (d_in=14336), which would mismatch under any
swap or mismap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ocg-goodfire ocg-goodfire force-pushed the worktree-agent-ad952bba2d82a7deb branch from 3af15a3 to 427f59e Compare June 17, 2026 17:39
@ocg-goodfire ocg-goodfire merged commit 123d132 into feature/jax Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant