TokSuite is a collection of models and a benchmark designed for studying how tokenizer choice affects language model behavior. By training multiple 1B-parameter models with identical architectures, data, and training budgets, varying only the tokenizer, TokSuite enables clean scientific ablations that isolate tokenization effects from confounding variables.
- Controlled by design: Same architecture, dataset, training budget, and initialization — only the tokenizer changes.
- Broad coverage: 14 tokenizers evaluated, ranging from character-level and byte-level to subword tokenizers from major model families.
- New robustness benchmark: A custom multilingual evaluation dataset testing model sensitivity to real-world text perturbations that affect tokenization (orthographic noise, diacritics, OCR artifacts, Unicode variants, and more).
- Multilingual focus: The models are trained on English, Chinese, Turkish, Italian, and Farsi, and the parallel benchmark captures real-world perturbations across all five languages by applying them to the same canonical questions translated into each target language.
- Fully open: Code, models, datasets, and paper are all publicly released.
See our paper for details: https://arxiv.org/abs/2512.20757
- Models
- Datasets
- Spaces
- Set-up
- Usage
- Training
- Converting Supertoken Models
- Plotting and Reproducibility
- Citation
- License
We release 14 controlled 1B-parameter models, each trained with a different tokenizer under identical conditions (Llama-3.2-1B architecture, ~100B token training budget). Browse evaluation results on the leaderboard.
| Tokenizer | Method | Vocab. Size | Languages | HuggingFace |
|---|---|---|---|---|
| ByT5 | Bytes | 259 | Language-agnostic | toksuite/google-byt5-small |
| TokenMonster | Custom | 32,000 | English-only | toksuite/tokenmonster-englishcode-32000-consistent-v1 |
| Phi-3 | BPE | 32,064 | Multilingual | toksuite/microsoft-Phi-3-mini-4k-instruct |
| GPT-2 | BPE | 50,257 | English-only | toksuite/gpt2 |
| Comma | BPE | 64,000 | Multilingual | toksuite/common-pile-comma-v0.1 |
| mBERT | WordPiece | 110,000 | Multilingual | toksuite/google-bert-bert-base-multilingual-cased |
| Llama-3.2 | BPE | 128,256 | Multilingual | toksuite/meta-llama-Llama-3.2-1B |
| Tekken | BPE | 130,000 | Multilingual | toksuite/mistralai-tekken |
| Qwen-3 | BPE | 151,646 | Multilingual | toksuite/Qwen-Qwen3-8B |
| GPT-4o | BPE | 200,000 | Multilingual | toksuite/tiktoken-gpt-4o |
| BLOOM | BPE | 250,680 | Multilingual | toksuite/bigscience-bloom |
| Aya | BPE | 255,029 | Multilingual | toksuite/CohereLabs-aya-expanse-8b |
| XGLM | Unigram | 256,008 | Multilingual | toksuite/facebook-xglm-564M |
| Gemma-2 | Unigram | 256,128 | Multilingual | toksuite/google-gemma-2-2b |
All models share the same initialization via a super-vocabulary approach, ensuring fair comparison.
A multilingual corpus of ~100B tokens used to train all suite models:
- 40B tokens from FineWeb-Edu (English)
- 60B tokens distributed across Chinese, Turkish, Italian, and Farsi
Available at: toksuite/toksuite-pretraining-data
A parallel collection of multiple-choice text completion questions paired with a wide range of real-world surface-form perturbations that are known to interact strongly with tokenization covering English, Farsi, Turkish, Italian, and Chinese languages as well as STEM and Math domains.
Available at: toksuite/toksuite-robustness
Explore evaluation results across all 14 TokSuite models and tasks: toksuite/leaderboard
Interactively visualize how different tokenizers segment any text: toksuite/tokenizer-comparison
We recommend using uv (install it with pip install uv or install from https://astral.sh/uv/install.sh if not already available). Use r-three lm-eval fork until this PR is merged to lm-eval.
On the Killarney cluster (Compute Canada), you need to first load the following modules:
module load StdEnv/2023 gcc/13.3 openmpi/5.0.3 cuda/12.6 python/3.10.13and for the first time you run the code, you need to install the packages to the system:
curl -LsSf https://astral.sh/uv/install.sh | sh# If you don't have a virtual environment already, you can either
# 1. Install the packages to the system (though we don't recommend this)
uv pip install -e . --system
# 2. Create a venv with uv
# make sure to load cuda (locally built with cuda-12.4)
uv venv --python 3.10
source .venv/bin/activate
## First run
uv sync --extra build
uv sync --all-extras
# on machines w/o cuda
uv sync --all-extras --all-groups --no-install-package flash-attnIf you have another uv venv, you can add this package to the original projects pyproject.toml as below and run uv sync --extra tokenizers in the main directory:
[project.optional-dependencies]
tokenizers = ["toksuite"]
[tool.uv.sources]
toksuite = { path = "../toksuite", editable = true }Compute fertility, parity, proportion of continued words (PCW), and vocabulary overlap across tokenizers and languages:
# Run all analyses for all 14 TokSuite tokenizers across all 5 languages
python -m toksuite.scripts.calculate_intrinsic_tokenizer_metrics \
--tokenizers all \
--languages all \
--analyses all
# Run specific analyses for a subset of tokenizers
python -m toksuite.scripts.calculate_intrinsic_tokenizer_metrics \
--tokenizers "GPT-2,Llama-3.2,BLOOM" \
--languages all \
--analyses fertility,parity,pcw
# Use custom tokenizers not in the default list (JSON mapping)
python -m toksuite.scripts.calculate_intrinsic_tokenizer_metrics \
--tokenizers '{"My tokenizer": "org/my-model"}' \
--languages all \
--analyses fertility--tokenizers: all to use all 14 TokSuite tokenizers, a comma-separated list of shortnames, or a JSON dict mapping display name → HuggingFace path. Supported shortnames: Comma, Llama-3.2, Phi-3, GPT-2, GPT-4o, BLOOM, XGLM, Tekken, ByT5, mBERT, Qwen-3, TokenMonster, Gemma-2, Aya.
--languages: all for all 5 languages (English, Chinese, Turkish, Farsi, Italian), or a comma-separated list of Flores-200 column keys (sentence_eng_Latn, sentence_zho_Hans, sentence_tur_Latn, sentence_pes_Arab, sentence_ita_Latn).
--analyses: all, or a comma-separated subset of: vocab_sizes, vocab_overlap, fertility, parity, pcw, example_tokenizations.
--dataset_name: HuggingFace dataset to use for text-based analyses (default: Muennighoff/flores200). --sample_size: number of examples to sample (default: 10000). --sample_sentence: sentence used for example_tokenizations (default: "Hello World"). --dataset_path: local path to a pre-saved Arrow dataset (see note below).
Outputs are saved as CSV files and plots (.png) in the current directory.
Note — Flores-200 compatibility:
datasets >= 3.0dropped support for Python-based loading scripts, butMuennighoff/flores200uses one. If you are running withdatasets >= 3.0, loading the dataset will fail withRuntimeError: Dataset scripts are no longer supported. To work around this, save the dataset to disk once using an older version, then pass the path via--dataset_path:pip install "datasets==2.21.0" python -c " from datasets import load_dataset ds = load_dataset('Muennighoff/flores200', 'all', split='dev', trust_remote_code=True) ds.save_to_disk('/path/to/flores200_dev') " pip install "datasets==3.6.0" # restore your versionThen pass
--dataset_path /path/to/flores200_devwhen running the script. Thecalculate_intrinsic_tokenizer_metrics.shconvenience script handles this automatically on first run.
TokSuite tasks are available on lm-evaluation-harness. We provide sample scripts to run evaluation under .slurm_scripts.
You can override any config field from the command line, or create your own YAML config pointing to any HuggingFace model.
Note that you need the most recent lm-eval to run evaluation for tokenmonster, tiktoken, and tekken TokSuite models.
Convenience SLURM scripts are provided for batch evaluation on an HPC cluster:
# Evaluate a single model across all tasks (interactive-style; edit flags inside script)
sbatch slurm_scripts/eval_all_toksuite_models.sh
# Run the common-benchmarks suite across all TokSuite models
sbatch slurm_scripts/eval_toksuite_on_common_benchmarks.shBefore submitting, update the paths, account name, and GPU partition at the top of each script. The defaults target the Killarney cluster.
Note on tokenizer backends:
- The SLURM scripts auto-detect special tokenizer runtimes and pass a
tokenizer_backendvalue to the evaluation harness via--model_args. tokenmonsterrefers to the TokenMonster tokenizer (a custom implementation) and is handled by passingtokenizer_backend=tokenmonsterso the harness uses the TokenMonster runtime.tekken(themistralai-tekkentokenizer in our models) is part of the Mistral family and is handled via themistralbackend (tokenizer_backend=mistral).
If you add other non-standard tokenizers, update the detection logic in slurm_scripts/eval_toksuite_on_common_benchmarks.sh to set the correct tokenizer_backend.
Analyze and visualize how different tokenizers segment text interactively on the Tokenizer Comparison Space, or run locally:
token-alysis \
--tokenizers meta-llama/Llama-3.2-1B Qwen/Qwen3-8B \
--text "Your input text here"We use lingua framework to train our models, please refer to r-three/lingua for more information on training.
In this repository, we provide auxilary files for Lingua.
For tiktoken-based tokenizers (gpt-4o, gpt-4), Lingua — the tokenizer backend used during evaluation — requires a local .tiktoken file. Generate one before running evaluation on those models:
python -m toksuite.scripts.create_tiktoken gpt-4o \
--output vocabs/tiktoken-gpt-4o/gpt-4o.tiktoken| Model / alias | Encoding |
|---|---|
gpt-4o, gpt-4o-mini |
o200k_base |
gpt-4, gpt-3.5-turbo |
cl100k_base |
gpt-3, gpt-2 |
r50k_base |
HuggingFace-backed tokenizers (Llama, Mistral, BLOOM, etc.) do not need this step.
To reproduce the TokSuite models, you first need to build the super vocabulary described in Section 3.2 of the paper. The super vocabulary is the union of all 14 tokenizer vocabularies (normalized to UTF-8 bytes), along with per-tokenizer alignment mappings used to initialize shared embedding weights across models. For convenience we provide initial checkpoints for every model used in the paper at toksuite/initializations, please note that this supervocab contains 19 models (5 more than the models used in the paper) but the corresponding initializations for each model is consistent.
Run the script, which handles the tiktoken extraction and vocab build in one step:
bash toksuite/scripts/build_super_vocab.shBefore running, update the
SCRATCHpath at the top ofbuild_super_vocab.shto point to your own scratch or cache directory. This keeps model downloads out of your home directory.
To use a custom set of tokenizers instead, invoke the Python module directly:
python -m toksuite.scripts.super_vocab \
--tokenizers \
google/byt5-small \
toksuite/tokenmonster-englishcode-32000-consistent-v1 \
microsoft/Phi-3-mini-4k-instruct \
openai-community/gpt2 \
nikandish/common-pile-comma-v0.1 \
google-bert/bert-base-multilingual-cased \
meta-llama/Llama-3.2-1B \
mistralai/Mistral-7B-v0.3 \
Qwen/Qwen3-8B \
vocabs/tiktoken-gpt-4o/gpt-4o.tiktoken \
bigscience/bloom \
CohereLabs/aya-expanse-8b \
facebook/xglm-564M \
google/gemma-2-2b \
--output_dir vocabs/Outputs in vocabs/:
| File | Description |
|---|---|
super_vocab.json |
Master vocabulary mapping token string → super-vocab index |
{tokenizer}_super_mapping.json |
Per-tokenizer alignment: original token ID → super-vocab ID |
{tokenizer}_vocab.json |
Original vocabulary for each tokenizer |
{tokenizer}.yaml |
Tokenizer metadata |
The super_vocab.json and *_super_mapping.json files are then used as the embedding initialization for model training (see Section 3.2 of the paper).
model="gpt2"
tokenizer="gpt2"
model_name="craffel/supertoken_models"
model_path="$model_name/$model/"
tokenizer_name="blester125/supervocab-$tokenizer"
hf_model_path="$PROJECT/models/$model_name"
tokenizer_path="$PROJECT/tokenizers/$tokenizer"
hf_out_path="gsaltintas/supertoken_models-llama_$model"
# Create directories
mkdir -p "$hf_model_path"
mkdir -p "$hf_model_path"
huggingface-cli download $model_name --local-dir=$hf_model_path
huggingface-cli download $tokenizer_name --local-dir=$tokenizer_path
# Convert LLaMA weights to HuggingFace format
echo "Converting model weights to HuggingFace format..."
python -m xarch_tokenizers.scripts.convert_supertoken_models \
--input_dir "$hf_model_path/$model" \
--model_size 1B \
--output_dir "$hf_model_path" \
--llama_version 3 --tokenizer_version 3 \
--tokenizer_path "$tokenizer_path" \
--push_to_hub --output_dir $hf_out_path \
--only_model --public
# Run lm_eval with converted model
## TODO: clean
echo "Running lm_eval..."
lm_eval \
--model hf --model_args "pretrained=$hf_out_path,tokenizer=$tokenizer" \
--device cuda \
--tasks toksuite \
--log_samples \
--verbosity DEBUG \
--output_path "results/tokenization_robustness/v102-cleaned/supertoken/$model"Here we list ways to reproduce the figures from the paper:
- Figure 3-4-5: Run
notebooks/intrinsic-metrics-plots.ipynb - Table 1:
- Figure 7:
- Figure 8 and Table 6 (Canonical Accuracy): It's fairly easy to reproduce the tables from the paper using toksuite utils on the lm-eval repo
If you use TokSuite in your work please cite the paper below. BibTeX entries are provided for convenience.
@inproceedings{altintas2026toksuite,
author = {G{"u}l Sena Altınta\c{s} and Malikeh Ehghaghi and Brian Lester and Fengyuan Liu and Wanru Zhao and Marco Ciccone and Colin Raffel},
title = {{TokSuite}: Measuring the Impact of Tokenizer Choice on Language Model Behavior},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
year = {2026},
eprint = {2512.20757},
archivePrefix= {arXiv},
url = {https://arxiv.org/abs/2512.20757}
}This project is licensed under the MIT License — see the LICENSE file for details.
