test(jax): prove C49k checkpoint migration is value-exact#877
Merged
Conversation
Adds tools/verify_c49k_migration.py, a streaming leaf-by-leaf verifier that proves the migrated 175k checkpoint (p-bd3cd4d4) reproduces the frozen-clone source (jax-l18-C49k-200k) bit-for-bit under migrate_c49k_checkpoint.py's remap, closing the #1 correctness gap in MIGRATION_REVIEW.md (structure was verified, value was not). The comparison table is the inverse of the migration's remap, built from that tool's own constants (KIND_TO_SITE_SUFFIX, SOURCE_STATE_KEY) so the mapping under test and the mapping applied are the same. Each leaf pair is restored single-device on CPU one at a time (all other leaves PLACEHOLDER, never read), so peak RAM is ~2x one V/U leaf (~1.6 GB) and the 47 GB trees never coexist. Asserts 1:1 coverage of both trees, then np.array_equal per leaf (squeezing the legacy leading singleton on the 6 component V/U leaves). Run verdict: all 144 leaves bit-identical -> migration VALUE-EXACT. The g/u/d->gate/up/down mapping and V->[0]/U->[1] ordering are confirmed by the asymmetric down_proj shapes (d_in=14336), which would mismatch under any swap or mismap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
3af15a3 to
427f59e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
param_decomp_jax/jax_single_pool/tools/verify_c49k_migration.py, a streaming leaf-by-leaf verifier that proves the C49k checkpoint migration (migrate_c49k_checkpoint.py, #870) is value-exact — i.e. the migrated checkpoint reproduces the original decomposition bit-for-bit, not just structurally.This closes the #1 correctness gap flagged in
MIGRATION_REVIEW.md: the migration was verified for STRUCTURE (shapes /step==175000/ finiteness) but NOT VALUE. A swappedV↔U, a mis-mappedg/u/d→gate/up/down, or a wrong squeeze axis would have passed every existing check and silently corrupted the 175k fine-tune base.How it proves it
The migration is a pure copy + reshape (re-key + squeeze the legacy leading singleton) — no recompute — so every migrated leaf must equal its source leaf bit-for-bit under the remap. The verifier:
KIND_TO_SITE_SUFFIX,SOURCE_STATE_KEY), so the mapping under test == the mapping applied.PLACEHOLDER, never read from disk), squeezes the singleton on the 6 component V/U leaves, andnp.array_equals them. Peak RAM is ~2× one V/U leaf (~1.6 GB) — the 47 GB trees never coexist (that OOM-killed the migration).The
g/u/d→gate/up/downmapping andV→[0]/U→[1] ordering are independently confirmed by the asymmetricdown_projshapes (d_in=14336vs4096for gate/up), which would mismatch under any swap or mismap.Verdict
The migrated 175k checkpoint is value-exact. The fine-tune base is proven sound.
make check-jaxclean on the new file. Read-only tool — touches neither checkpoint dir, goldens unaffected.🤖 Generated with Claude Code