Skip to content

Fix ANKH tokenizer to load from checkpoint hub id#31

Open
avivko wants to merge 1 commit into
Synthyra:mainfrom
avivko:fix/ankh-tokenizer-from-checkpoint
Open

Fix ANKH tokenizer to load from checkpoint hub id#31
avivko wants to merge 1 commit into
Synthyra:mainfrom
avivko:fix/ankh-tokenizer-from-checkpoint

Conversation

@avivko
Copy link
Copy Markdown

@avivko avivko commented May 18, 2026

FAST_ANKH_ENCODER always attached ElnaggarLab/ankh-base, so ANKH3 checkpoints (256-token vocab) were tokenized with the wrong token ids, generating NaNs in the embeddings. Load from config._name_or_path (same pattern as DPLM) with ankh-base fallback for bare configs.

Add slow tests that fast tokenizer ids match the official repo for ANKH_base, ANKH3_large, and ANKH3_xl.

Summary

  • FAST_ANKH_ENCODER no longer hardcodes ElnaggarLab/ankh-base; it loads the tokenizer from config._name_or_path (e.g. Synthyra/ANKH3_large), matching the checkpoint vocab.
  • Adds slow tests comparing fast vs official token ids for ANKH_base, ANKH3_large, and ANKH3_xl.

Hub follow-up

After merge, re-push modeling_ankh.py to Synthyra ANKH checkpoints (e.g. via get_weights.py) so trust_remote_code=True users pick up the fix without a local checkout.

Test plan

  • pytest testing/test_ankh_tokenizer.py -v (Docker, 6 passed)

FAST_ANKH_ENCODER always attached ElnaggarLab/ankh-base, so ANKH3
checkpoints (256-token vocab) were tokenized with the wrong ids in
embed.py. Load from config._name_or_path (same pattern as DPLM) with
ankh-base fallback for bare configs.

Add slow tests that fast tokenizer ids match the official repo for
ANKH_base, ANKH3_large, and ANKH3_xl.

Co-authored-by: Cursor <cursoragent@cursor.com>
@avivko
Copy link
Copy Markdown
Author

avivko commented May 28, 2026

@lhallee An agent wrote this PR, sorry that it was a bit messy. But it fixes an important issue: I was getting NaNs in the embeddings when using ANKH3 and ANKH3XL. The tests for these models passed your original parity tests because you are using the native tokenizer for both in the parity test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant