Train byte-level BPE tokenizer (MULTI-1379) by RobbieMcKinstry · Pull Request #4 · wack/wubbie

RobbieMcKinstry · 2026-06-29T17:24:27Z

Add a GPT-2-style byte-level BPE tokenizer, trained via the tokenizers
crate, plus a wubbie tokenizer subcommand to train it on a corpus slice.

Locked here (both feed downstream phases and can't change without
retraining):

Vocab size: 32_000 (tokenizer::VOCAB_SIZE), the value the model config
and the loss-at-init ~ ln(vocab) check read from.
Special-token inventory (tokenizer::SPECIAL_TOKENS): pad/bos/eos and the
ChatML-style turn markers, reserved as atomic tokens at fixed low ids
(pad pinned to id 0). The chat-template rendering format is finalized
later; the tokens themselves exist now.

The tokenizer is wired byte-level end to end (pre-tokenizer, decoder,
post-processor) with add_prefix_space=false so decode(encode(text))
is an exact round-trip. wubbie tokenizer trains, writes tokenizer.json,
then runs the DoD checks against a corpus sample: exact round-trip, atomic
special tokens (hard failures), and the ~3.5-4 chars/token compression
ratio (a miss warns).

No corpus is committed yet (the CommonPile slice is still being selected);
unit tests train in-memory on a small synthetic corpus.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD

Add a GPT-2-style byte-level BPE tokenizer, trained via the `tokenizers` crate, plus a `wubbie tokenizer` subcommand to train it on a corpus slice. Locked here (both feed downstream phases and can't change without retraining): - Vocab size: 32_000 (`tokenizer::VOCAB_SIZE`), the value the model config and the loss-at-init ~ ln(vocab) check read from. - Special-token inventory (`tokenizer::SPECIAL_TOKENS`): pad/bos/eos and the ChatML-style turn markers, reserved as atomic tokens at fixed low ids (pad pinned to id 0). The chat-template rendering format is finalized later; the tokens themselves exist now. The tokenizer is wired byte-level end to end (pre-tokenizer, decoder, post-processor) with `add_prefix_space=false` so `decode(encode(text))` is an exact round-trip. `wubbie tokenizer` trains, writes tokenizer.json, then runs the DoD checks against a corpus sample: exact round-trip, atomic special tokens (hard failures), and the ~3.5-4 chars/token compression ratio (a miss warns). No corpus is committed yet (the CommonPile slice is still being selected); unit tests train in-memory on a small synthetic corpus. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD

…8/1379) Adapt the tokenizer to the revised corpus storage model: the filtered CommonPile slice stays on Hugging Face and is pulled on demand via the pure-Rust `hf-hub` client (pinned by repo + revision), rather than being mirrored to object storage as local `.txt`. - Add `corpus` module: resolves a source (local path or pinned HF dataset) to local shard paths and streams trainable text records out of them. Reads JSON Lines (`.jsonl`/`.jsonl.gz`, document text under a configurable field) and plain text (`.txt`/`.txt.gz`); `.gz` is decompressed transparently. Parsing/IO faults are logged and skipped, not fatal. - Tokenizer training is now format-agnostic: `train_from_sequences` consumes the extracted-text stream (replacing the raw-line `train_from_files`). - `wubbie tokenizer` gains `--hf-repo`/`--hf-revision`/`--hf-file` (XOR with `--input`) and `--text-field`. The HF fetch resolves to cached local paths, trains, and runs the same DoD checks against a corpus sample. - Lock VOCAB_SIZE at 16k (small end of the ~16-32k band): this round of pre-training runs on CPU (iMac), where a smaller vocab keeps the embedding/softmax cheap. Deps: add `hf-hub` (ureq + rustls, pure-Rust TLS) and `flate2` (miniz_oxide, pure-Rust gzip); promote `serde_json` to a normal dependency for JSONL. No corpus is committed yet; HF access itself is verified under MULTI-1378. Unit tests cover format detection, JSONL/gz record extraction, and training. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD

Make the trained tokenizer the source of truth for vocabulary size instead of the hardcoded constant. The embedding table and LM head must match the tokenizer exactly, so the model now sizes itself from the actual tokenizer.json rather than defaulting blindly to VOCAB_SIZE. - Add `tokenizer::vocab_size_from_file`: reads the actual vocab size (incl. special tokens) from a tokenizer.json. - `wubbie train --tokenizer <file>`: when given, overrides the resolved `vocab_size` with the tokenizer's actual size — authoritative over every config layer, including an explicit `--vocab-size` (warns on mismatch). - Reframe `VOCAB_SIZE` as the default/target used only until a tokenizer exists; document the derivation on `ModelConfig::vocab_size` and in the model.rs stub so the model-assembly ticket (MULTI-1383) sizes the embeddings from the resolved config, not the constant. The transformer itself is still stubbed; this only wires the config seam (which runs and is tested) so the hint is in place when the stub is built. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD

Add a TODO(MULTI-1383) on the optional `--tokenizer` arg: it is only `Option` so the stubbed command and config tests can run without a tokenizer artifact, and should be made required once the training loop exists (a model whose vocab_size doesn't match its tokenizer is always a bug). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD

Pulling the (hundreds-of-GB) corpus is now an explicit, resumable step instead of a side effect of tokenizer training. - New `wubbie download` subcommand: fetches a pinned HF dataset's corpus shards into the local HF cache via hf-hub, reporting progress (periodic log lines with per-file %, cumulative bytes, and throughput) and skipping already-cached files so an interrupted run resumes. `--cache-dir` targets a large volume. - `wubbie tokenizer` HF path is now cache-only: it reads shards already in the cache and never bulk-downloads; a missing shard errors with a message to run `download` first. Added `--cache-dir` for symmetry. - corpus: add `download_hf` + a `DownloadReporter` trait (keeps presentation in the CLI layer), factor out HF cache/repo/listing helpers, and thread a `cache_dir` override through `HfSource`. Unit tests cover the byte formatter, the progress reporter's accounting, and the offline file-listing path; the network fetch itself is exercised under MULTI-1378. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD

claude added 5 commits June 29, 2026 15:27

RobbieMcKinstry merged commit 672a8f0 into trunk Jun 29, 2026
3 checks passed

RobbieMcKinstry deleted the claude/byte-level-bpe-tokenizer-v0d3v7 branch June 29, 2026 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Train byte-level BPE tokenizer (MULTI-1379)#4

Train byte-level BPE tokenizer (MULTI-1379)#4
RobbieMcKinstry merged 5 commits into
trunkfrom
claude/byte-level-bpe-tokenizer-v0d3v7

RobbieMcKinstry commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

RobbieMcKinstry commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants