Skip to content

Train byte-level BPE tokenizer (MULTI-1379)#4

Merged
RobbieMcKinstry merged 5 commits into
trunkfrom
claude/byte-level-bpe-tokenizer-v0d3v7
Jun 29, 2026
Merged

Train byte-level BPE tokenizer (MULTI-1379)#4
RobbieMcKinstry merged 5 commits into
trunkfrom
claude/byte-level-bpe-tokenizer-v0d3v7

Conversation

@RobbieMcKinstry

Copy link
Copy Markdown
Contributor

Add a GPT-2-style byte-level BPE tokenizer, trained via the tokenizers
crate, plus a wubbie tokenizer subcommand to train it on a corpus slice.

Locked here (both feed downstream phases and can't change without
retraining):

  • Vocab size: 32_000 (tokenizer::VOCAB_SIZE), the value the model config
    and the loss-at-init ~ ln(vocab) check read from.
  • Special-token inventory (tokenizer::SPECIAL_TOKENS): pad/bos/eos and the
    ChatML-style turn markers, reserved as atomic tokens at fixed low ids
    (pad pinned to id 0). The chat-template rendering format is finalized
    later; the tokens themselves exist now.

The tokenizer is wired byte-level end to end (pre-tokenizer, decoder,
post-processor) with add_prefix_space=false so decode(encode(text))
is an exact round-trip. wubbie tokenizer trains, writes tokenizer.json,
then runs the DoD checks against a corpus sample: exact round-trip, atomic
special tokens (hard failures), and the ~3.5-4 chars/token compression
ratio (a miss warns).

No corpus is committed yet (the CommonPile slice is still being selected);
unit tests train in-memory on a small synthetic corpus.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD

claude added 5 commits June 29, 2026 15:27
Add a GPT-2-style byte-level BPE tokenizer, trained via the `tokenizers`
crate, plus a `wubbie tokenizer` subcommand to train it on a corpus slice.

Locked here (both feed downstream phases and can't change without
retraining):

- Vocab size: 32_000 (`tokenizer::VOCAB_SIZE`), the value the model config
  and the loss-at-init ~ ln(vocab) check read from.
- Special-token inventory (`tokenizer::SPECIAL_TOKENS`): pad/bos/eos and the
  ChatML-style turn markers, reserved as atomic tokens at fixed low ids
  (pad pinned to id 0). The chat-template rendering format is finalized
  later; the tokens themselves exist now.

The tokenizer is wired byte-level end to end (pre-tokenizer, decoder,
post-processor) with `add_prefix_space=false` so `decode(encode(text))`
is an exact round-trip. `wubbie tokenizer` trains, writes tokenizer.json,
then runs the DoD checks against a corpus sample: exact round-trip, atomic
special tokens (hard failures), and the ~3.5-4 chars/token compression
ratio (a miss warns).

No corpus is committed yet (the CommonPile slice is still being selected);
unit tests train in-memory on a small synthetic corpus.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD
…8/1379)

Adapt the tokenizer to the revised corpus storage model: the filtered
CommonPile slice stays on Hugging Face and is pulled on demand via the
pure-Rust `hf-hub` client (pinned by repo + revision), rather than being
mirrored to object storage as local `.txt`.

- Add `corpus` module: resolves a source (local path or pinned HF dataset)
  to local shard paths and streams trainable text records out of them.
  Reads JSON Lines (`.jsonl`/`.jsonl.gz`, document text under a configurable
  field) and plain text (`.txt`/`.txt.gz`); `.gz` is decompressed
  transparently. Parsing/IO faults are logged and skipped, not fatal.
- Tokenizer training is now format-agnostic: `train_from_sequences` consumes
  the extracted-text stream (replacing the raw-line `train_from_files`).
- `wubbie tokenizer` gains `--hf-repo`/`--hf-revision`/`--hf-file` (XOR with
  `--input`) and `--text-field`. The HF fetch resolves to cached local paths,
  trains, and runs the same DoD checks against a corpus sample.
- Lock VOCAB_SIZE at 16k (small end of the ~16-32k band): this round of
  pre-training runs on CPU (iMac), where a smaller vocab keeps the
  embedding/softmax cheap.

Deps: add `hf-hub` (ureq + rustls, pure-Rust TLS) and `flate2` (miniz_oxide,
pure-Rust gzip); promote `serde_json` to a normal dependency for JSONL.

No corpus is committed yet; HF access itself is verified under MULTI-1378.
Unit tests cover format detection, JSONL/gz record extraction, and training.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD
Make the trained tokenizer the source of truth for vocabulary size instead
of the hardcoded constant. The embedding table and LM head must match the
tokenizer exactly, so the model now sizes itself from the actual
tokenizer.json rather than defaulting blindly to VOCAB_SIZE.

- Add `tokenizer::vocab_size_from_file`: reads the actual vocab size
  (incl. special tokens) from a tokenizer.json.
- `wubbie train --tokenizer <file>`: when given, overrides the resolved
  `vocab_size` with the tokenizer's actual size — authoritative over every
  config layer, including an explicit `--vocab-size` (warns on mismatch).
- Reframe `VOCAB_SIZE` as the default/target used only until a tokenizer
  exists; document the derivation on `ModelConfig::vocab_size` and in the
  model.rs stub so the model-assembly ticket (MULTI-1383) sizes the
  embeddings from the resolved config, not the constant.

The transformer itself is still stubbed; this only wires the config seam
(which runs and is tested) so the hint is in place when the stub is built.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD
Add a TODO(MULTI-1383) on the optional `--tokenizer` arg: it is only
`Option` so the stubbed command and config tests can run without a
tokenizer artifact, and should be made required once the training loop
exists (a model whose vocab_size doesn't match its tokenizer is always a
bug).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD
Pulling the (hundreds-of-GB) corpus is now an explicit, resumable step
instead of a side effect of tokenizer training.

- New `wubbie download` subcommand: fetches a pinned HF dataset's corpus
  shards into the local HF cache via hf-hub, reporting progress (periodic
  log lines with per-file %, cumulative bytes, and throughput) and skipping
  already-cached files so an interrupted run resumes. `--cache-dir` targets a
  large volume.
- `wubbie tokenizer` HF path is now cache-only: it reads shards already in
  the cache and never bulk-downloads; a missing shard errors with a message
  to run `download` first. Added `--cache-dir` for symmetry.
- corpus: add `download_hf` + a `DownloadReporter` trait (keeps presentation
  in the CLI layer), factor out HF cache/repo/listing helpers, and thread a
  `cache_dir` override through `HfSource`.

Unit tests cover the byte formatter, the progress reporter's accounting, and
the offline file-listing path; the network fetch itself is exercised under
MULTI-1378.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD
@RobbieMcKinstry RobbieMcKinstry merged commit 672a8f0 into trunk Jun 29, 2026
3 checks passed
@RobbieMcKinstry RobbieMcKinstry deleted the claude/byte-level-bpe-tokenizer-v0d3v7 branch June 29, 2026 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants