TokSuite

A controlled suite for measuring the impact of tokenizer choice on language model behavior.

TokSuite is a collection of models and a benchmark designed for studying how tokenizer choice affects language model behavior. By training multiple 1B-parameter models with identical architectures, data, and training budgets, varying only the tokenizer, TokSuite enables clean scientific ablations that isolate tokenization effects from confounding variables.

Controlled by design: Same architecture, dataset, training budget, and initialization — only the tokenizer changes.
Broad coverage: 14 tokenizers evaluated, ranging from character-level and byte-level to subword tokenizers from major model families.
New robustness benchmark: A custom multilingual evaluation dataset testing model sensitivity to real-world text perturbations that affect tokenization (orthographic noise, diacritics, OCR artifacts, Unicode variants, and more).
Multilingual focus: The models are trained on English, Chinese, Turkish, Italian, and Farsi, and the parallel benchmark captures real-world perturbations across all five languages by applying them to the same canonical questions translated into each target language.
Fully open: Code, models, datasets, and paper are all publicly released.

See our paper for details: https://arxiv.org/abs/2512.20757

Models

We release 14 controlled 1B-parameter models, each trained with a different tokenizer under identical conditions (Llama-3.2-1B architecture, ~100B token training budget). Browse evaluation results on the leaderboard.

Tokenizer	Method	Vocab. Size	Languages	HuggingFace
ByT5	Bytes	259	Language-agnostic	toksuite/google-byt5-small
TokenMonster	Custom	32,000	English-only	toksuite/tokenmonster-englishcode-32000-consistent-v1
Phi-3	BPE	32,064	Multilingual	toksuite/microsoft-Phi-3-mini-4k-instruct
GPT-2	BPE	50,257	English-only	toksuite/gpt2
Comma	BPE	64,000	Multilingual	toksuite/common-pile-comma-v0.1
mBERT	WordPiece	110,000	Multilingual	toksuite/google-bert-bert-base-multilingual-cased
Llama-3.2	BPE	128,256	Multilingual	toksuite/meta-llama-Llama-3.2-1B
Tekken	BPE	130,000	Multilingual	toksuite/mistralai-tekken
Qwen-3	BPE	151,646	Multilingual	toksuite/Qwen-Qwen3-8B
GPT-4o	BPE	200,000	Multilingual	toksuite/tiktoken-gpt-4o
BLOOM	BPE	250,680	Multilingual	toksuite/bigscience-bloom
Aya	BPE	255,029	Multilingual	toksuite/CohereLabs-aya-expanse-8b
XGLM	Unigram	256,008	Multilingual	toksuite/facebook-xglm-564M
Gemma-2	Unigram	256,128	Multilingual	toksuite/google-gemma-2-2b

All models share the same initialization via a super-vocabulary approach, ensuring fair comparison.

Datasets

TokSuite Pretraining Data

A multilingual corpus of ~100B tokens used to train all suite models:

40B tokens from FineWeb-Edu (English)
60B tokens distributed across Chinese, Turkish, Italian, and Farsi

Available at: toksuite/toksuite-pretraining-data

TokSuite Robustness Benchmark

A parallel collection of multiple-choice text completion questions paired with a wide range of real-world surface-form perturbations that are known to interact strongly with tokenization covering English, Farsi, Turkish, Italian, and Chinese languages as well as STEM and Math domains.

Available at: toksuite/toksuite-robustness

Spaces

Leaderboard

Explore evaluation results across all 14 TokSuite models and tasks: toksuite/leaderboard

Tokenizer Comparison

Interactively visualize how different tokenizers segment any text: toksuite/tokenizer-comparison

Set-up

We recommend using uv (install it with pip install uv or install from https://astral.sh/uv/install.sh if not already available). Use r-three lm-eval fork until this PR is merged to lm-eval.

On Killarney

On the Killarney cluster (Compute Canada), you need to first load the following modules:

module load StdEnv/2023  gcc/13.3  openmpi/5.0.3 cuda/12.6 python/3.10.13

and for the first time you run the code, you need to install the packages to the system:

curl -LsSf https://astral.sh/uv/install.sh | sh

# If you don't have a virtual environment already, you can either
# 1. Install the packages to the system (though we don't recommend this)
uv pip install -e . --system

# 2. Create a venv with uv
# make sure to load cuda (locally built with cuda-12.4)
uv venv --python 3.10
source .venv/bin/activate
## First run
uv sync --extra build
uv sync --all-extras
# on machines w/o cuda
uv sync --all-extras --all-groups  --no-install-package flash-attn

If you have another uv venv, you can add this package to the original projects pyproject.toml as below and run uv sync --extra tokenizers in the main directory:

[project.optional-dependencies]
tokenizers = ["toksuite"]

[tool.uv.sources]
toksuite = { path = "../toksuite", editable = true }

Usage

Computing Intrinsic Tokenizer Metrics

Compute fertility, parity, proportion of continued words (PCW), and vocabulary overlap across tokenizers and languages:

# Run all analyses for all 14 TokSuite tokenizers across all 5 languages
python -m toksuite.scripts.calculate_intrinsic_tokenizer_metrics \
  --tokenizers all \
  --languages all \
  --analyses all

# Run specific analyses for a subset of tokenizers
python -m toksuite.scripts.calculate_intrinsic_tokenizer_metrics \
  --tokenizers "GPT-2,Llama-3.2,BLOOM" \
  --languages all \
  --analyses fertility,parity,pcw

# Use custom tokenizers not in the default list (JSON mapping)
python -m toksuite.scripts.calculate_intrinsic_tokenizer_metrics \
  --tokenizers '{"My tokenizer": "org/my-model"}' \
  --languages all \
  --analyses fertility

--tokenizers: all to use all 14 TokSuite tokenizers, a comma-separated list of shortnames, or a JSON dict mapping display name → HuggingFace path. Supported shortnames: Comma, Llama-3.2, Phi-3, GPT-2, GPT-4o, BLOOM, XGLM, Tekken, ByT5, mBERT, Qwen-3, TokenMonster, Gemma-2, Aya.

--languages: all for all 5 languages (English, Chinese, Turkish, Farsi, Italian), or a comma-separated list of Flores-200 column keys (sentence_eng_Latn, sentence_zho_Hans, sentence_tur_Latn, sentence_pes_Arab, sentence_ita_Latn).

--analyses: all, or a comma-separated subset of: vocab_sizes, vocab_overlap, fertility, parity, pcw, example_tokenizations.

--dataset_name: HuggingFace dataset to use for text-based analyses (default: Muennighoff/flores200). --sample_size: number of examples to sample (default: 10000). --sample_sentence: sentence used for example_tokenizations (default: "Hello World"). --dataset_path: local path to a pre-saved Arrow dataset (see note below).

Outputs are saved as CSV files and plots (.png) in the current directory.

Note — Flores-200 compatibility: datasets >= 3.0 dropped support for Python-based loading scripts, but Muennighoff/flores200 uses one. If you are running with datasets >= 3.0, loading the dataset will fail with RuntimeError: Dataset scripts are no longer supported. To work around this, save the dataset to disk once using an older version, then pass the path via --dataset_path:
pip install "datasets==2.21.0"
python -c "
from datasets import load_dataset
ds = load_dataset('Muennighoff/flores200', 'all', split='dev', trust_remote_code=True)
ds.save_to_disk('/path/to/flores200_dev')
"
pip install "datasets==3.6.0"   # restore your version
Then pass --dataset_path /path/to/flores200_dev when running the script. The calculate_intrinsic_tokenizer_metrics.sh convenience script handles this automatically on first run.

Running Evaluation

TokSuite tasks are available on lm-evaluation-harness. We provide sample scripts to run evaluation under .slurm_scripts.

You can override any config field from the command line, or create your own YAML config pointing to any HuggingFace model.

Note that you need the most recent lm-eval to run evaluation for tokenmonster, tiktoken, and tekken TokSuite models.

Running on an HPC Cluster (SLURM)

Convenience SLURM scripts are provided for batch evaluation on an HPC cluster:

# Evaluate a single model across all tasks (interactive-style; edit flags inside script)
sbatch slurm_scripts/eval_all_toksuite_models.sh

# Run the common-benchmarks suite across all TokSuite models
sbatch slurm_scripts/eval_toksuite_on_common_benchmarks.sh

Before submitting, update the paths, account name, and GPU partition at the top of each script. The defaults target the Killarney cluster.

Note on tokenizer backends:

The SLURM scripts auto-detect special tokenizer runtimes and pass a tokenizer_backend value to the evaluation harness via --model_args.
tokenmonster refers to the TokenMonster tokenizer (a custom implementation) and is handled by passing tokenizer_backend=tokenmonster so the harness uses the TokenMonster runtime.
tekken (the mistralai-tekken tokenizer in our models) is part of the Mistral family and is handled via the mistral backend (tokenizer_backend=mistral).

If you add other non-standard tokenizers, update the detection logic in slurm_scripts/eval_toksuite_on_common_benchmarks.sh to set the correct tokenizer_backend.

Comparing Tokenizers

Analyze and visualize how different tokenizers segment text interactively on the Tokenizer Comparison Space, or run locally:

token-alysis \
  --tokenizers meta-llama/Llama-3.2-1B Qwen/Qwen3-8B \
  --text "Your input text here"

Training

We use lingua framework to train our models, please refer to r-three/lingua for more information on training.

In this repository, we provide auxilary files for Lingua.

Extracting tiktoken Vocabulary Files

For tiktoken-based tokenizers (gpt-4o, gpt-4), Lingua — the tokenizer backend used during evaluation — requires a local .tiktoken file. Generate one before running evaluation on those models:

python -m toksuite.scripts.create_tiktoken gpt-4o \
    --output vocabs/tiktoken-gpt-4o/gpt-4o.tiktoken

Model / alias	Encoding
`gpt-4o`, `gpt-4o-mini`	`o200k_base`
`gpt-4`, `gpt-3.5-turbo`	`cl100k_base`
`gpt-3`, `gpt-2`	`r50k_base`

HuggingFace-backed tokenizers (Llama, Mistral, BLOOM, etc.) do not need this step.

Building the Super-Vocabulary

To reproduce the TokSuite models, you first need to build the super vocabulary described in Section 3.2 of the paper. The super vocabulary is the union of all 14 tokenizer vocabularies (normalized to UTF-8 bytes), along with per-tokenizer alignment mappings used to initialize shared embedding weights across models. For convenience we provide initial checkpoints for every model used in the paper at toksuite/initializations, please note that this supervocab contains 19 models (5 more than the models used in the paper) but the corresponding initializations for each model is consistent.

Run the script, which handles the tiktoken extraction and vocab build in one step:

bash toksuite/scripts/build_super_vocab.sh

Before running, update the SCRATCH path at the top of build_super_vocab.sh to point to your own scratch or cache directory. This keeps model downloads out of your home directory.

To use a custom set of tokenizers instead, invoke the Python module directly:

python -m toksuite.scripts.super_vocab \
  --tokenizers \
    google/byt5-small \
    toksuite/tokenmonster-englishcode-32000-consistent-v1 \
    microsoft/Phi-3-mini-4k-instruct \
    openai-community/gpt2 \
    nikandish/common-pile-comma-v0.1 \
    google-bert/bert-base-multilingual-cased \
    meta-llama/Llama-3.2-1B \
    mistralai/Mistral-7B-v0.3 \
    Qwen/Qwen3-8B \
    vocabs/tiktoken-gpt-4o/gpt-4o.tiktoken \
    bigscience/bloom \
    CohereLabs/aya-expanse-8b \
    facebook/xglm-564M \
    google/gemma-2-2b \
  --output_dir vocabs/

Outputs in vocabs/:

File	Description
`super_vocab.json`	Master vocabulary mapping token string → super-vocab index
`{tokenizer}_super_mapping.json`	Per-tokenizer alignment: original token ID → super-vocab ID
`{tokenizer}_vocab.json`	Original vocabulary for each tokenizer
`{tokenizer}.yaml`	Tokenizer metadata

The super_vocab.json and *_super_mapping.json files are then used as the embedding initialization for model training (see Section 3.2 of the paper).

Converting Lingua Models

model="gpt2"
tokenizer="gpt2"
model_name="craffel/supertoken_models"
model_path="$model_name/$model/"
tokenizer_name="blester125/supervocab-$tokenizer"
hf_model_path="$PROJECT/models/$model_name"
tokenizer_path="$PROJECT/tokenizers/$tokenizer"
hf_out_path="gsaltintas/supertoken_models-llama_$model"

# Create directories
mkdir -p "$hf_model_path"
mkdir -p "$hf_model_path"

huggingface-cli download $model_name --local-dir=$hf_model_path
huggingface-cli download $tokenizer_name --local-dir=$tokenizer_path
# Convert LLaMA weights to HuggingFace format
echo "Converting model weights to HuggingFace format..."
python -m xarch_tokenizers.scripts.convert_supertoken_models \
    --input_dir "$hf_model_path/$model" \
    --model_size 1B \
    --output_dir "$hf_model_path" \
    --llama_version 3 --tokenizer_version 3 \
    --tokenizer_path "$tokenizer_path" \
    --push_to_hub --output_dir $hf_out_path \
    --only_model --public

# Run lm_eval with converted model
## TODO: clean
echo "Running lm_eval..."
lm_eval \
--model hf --model_args "pretrained=$hf_out_path,tokenizer=$tokenizer" \
--device cuda \
--tasks toksuite \
--log_samples \
--verbosity DEBUG \
--output_path "results/tokenization_robustness/v102-cleaned/supertoken/$model"

Plotting and Reproducibility

Here we list ways to reproduce the figures from the paper:

Figure 3-4-5: Run notebooks/intrinsic-metrics-plots.ipynb
Table 1:
Figure 7:
Figure 8 and Table 6 (Canonical Accuracy): It's fairly easy to reproduce the tables from the paper using toksuite utils on the lm-eval repo

Citation

If you use TokSuite in your work please cite the paper below. BibTeX entries are provided for convenience.

@inproceedings{altintas2026toksuite,
  author       = {G{"u}l Sena Altınta\c{s} and Malikeh Ehghaghi and Brian Lester and Fengyuan Liu and Wanru Zhao and Marco Ciccone and Colin Raffel},
  title        = {{TokSuite}: Measuring the Impact of Tokenizer Choice on Language Model Behavior},
  booktitle    = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year         = {2026},
  eprint       = {2512.20757},
  archivePrefix= {arXiv},
  url          = {https://arxiv.org/abs/2512.20757}
}

License

This project is licensed under the MIT License — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data/canonical_questions		data/canonical_questions
figures		figures
notebooks		notebooks
results/intrinsic_metrics		results/intrinsic_metrics
slurm_scripts		slurm_scripts
tests		tests
toksuite		toksuite
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TokSuite

A controlled suite for measuring the impact of tokenizer choice on language model behavior.

Table of Contents

Models

Datasets

TokSuite Pretraining Data

TokSuite Robustness Benchmark

Spaces

Leaderboard

Tokenizer Comparison

Set-up

On Killarney

Usage

Computing Intrinsic Tokenizer Metrics

Running Evaluation

Running on an HPC Cluster (SLURM)

Comparing Tokenizers

Training

Extracting tiktoken Vocabulary Files

Building the Super-Vocabulary

Converting Lingua Models

Plotting and Reproducibility

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TokSuite

A controlled suite for measuring the impact of tokenizer choice on language model behavior.

Table of Contents

Models

Datasets

TokSuite Pretraining Data

TokSuite Robustness Benchmark

Spaces

Leaderboard

Tokenizer Comparison

Set-up

On Killarney

Usage

Computing Intrinsic Tokenizer Metrics

Running Evaluation

Running on an HPC Cluster (SLURM)

Comparing Tokenizers

Training

Extracting tiktoken Vocabulary Files

Building the Super-Vocabulary

Converting Lingua Models

Plotting and Reproducibility

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages