Train byte-level BPE tokenizer (MULTI-1379)#4
Merged
Conversation
Add a GPT-2-style byte-level BPE tokenizer, trained via the `tokenizers` crate, plus a `wubbie tokenizer` subcommand to train it on a corpus slice. Locked here (both feed downstream phases and can't change without retraining): - Vocab size: 32_000 (`tokenizer::VOCAB_SIZE`), the value the model config and the loss-at-init ~ ln(vocab) check read from. - Special-token inventory (`tokenizer::SPECIAL_TOKENS`): pad/bos/eos and the ChatML-style turn markers, reserved as atomic tokens at fixed low ids (pad pinned to id 0). The chat-template rendering format is finalized later; the tokens themselves exist now. The tokenizer is wired byte-level end to end (pre-tokenizer, decoder, post-processor) with `add_prefix_space=false` so `decode(encode(text))` is an exact round-trip. `wubbie tokenizer` trains, writes tokenizer.json, then runs the DoD checks against a corpus sample: exact round-trip, atomic special tokens (hard failures), and the ~3.5-4 chars/token compression ratio (a miss warns). No corpus is committed yet (the CommonPile slice is still being selected); unit tests train in-memory on a small synthetic corpus. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD
…8/1379) Adapt the tokenizer to the revised corpus storage model: the filtered CommonPile slice stays on Hugging Face and is pulled on demand via the pure-Rust `hf-hub` client (pinned by repo + revision), rather than being mirrored to object storage as local `.txt`. - Add `corpus` module: resolves a source (local path or pinned HF dataset) to local shard paths and streams trainable text records out of them. Reads JSON Lines (`.jsonl`/`.jsonl.gz`, document text under a configurable field) and plain text (`.txt`/`.txt.gz`); `.gz` is decompressed transparently. Parsing/IO faults are logged and skipped, not fatal. - Tokenizer training is now format-agnostic: `train_from_sequences` consumes the extracted-text stream (replacing the raw-line `train_from_files`). - `wubbie tokenizer` gains `--hf-repo`/`--hf-revision`/`--hf-file` (XOR with `--input`) and `--text-field`. The HF fetch resolves to cached local paths, trains, and runs the same DoD checks against a corpus sample. - Lock VOCAB_SIZE at 16k (small end of the ~16-32k band): this round of pre-training runs on CPU (iMac), where a smaller vocab keeps the embedding/softmax cheap. Deps: add `hf-hub` (ureq + rustls, pure-Rust TLS) and `flate2` (miniz_oxide, pure-Rust gzip); promote `serde_json` to a normal dependency for JSONL. No corpus is committed yet; HF access itself is verified under MULTI-1378. Unit tests cover format detection, JSONL/gz record extraction, and training. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD
Make the trained tokenizer the source of truth for vocabulary size instead of the hardcoded constant. The embedding table and LM head must match the tokenizer exactly, so the model now sizes itself from the actual tokenizer.json rather than defaulting blindly to VOCAB_SIZE. - Add `tokenizer::vocab_size_from_file`: reads the actual vocab size (incl. special tokens) from a tokenizer.json. - `wubbie train --tokenizer <file>`: when given, overrides the resolved `vocab_size` with the tokenizer's actual size — authoritative over every config layer, including an explicit `--vocab-size` (warns on mismatch). - Reframe `VOCAB_SIZE` as the default/target used only until a tokenizer exists; document the derivation on `ModelConfig::vocab_size` and in the model.rs stub so the model-assembly ticket (MULTI-1383) sizes the embeddings from the resolved config, not the constant. The transformer itself is still stubbed; this only wires the config seam (which runs and is tested) so the hint is in place when the stub is built. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD
Add a TODO(MULTI-1383) on the optional `--tokenizer` arg: it is only `Option` so the stubbed command and config tests can run without a tokenizer artifact, and should be made required once the training loop exists (a model whose vocab_size doesn't match its tokenizer is always a bug). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD
Pulling the (hundreds-of-GB) corpus is now an explicit, resumable step instead of a side effect of tokenizer training. - New `wubbie download` subcommand: fetches a pinned HF dataset's corpus shards into the local HF cache via hf-hub, reporting progress (periodic log lines with per-file %, cumulative bytes, and throughput) and skipping already-cached files so an interrupted run resumes. `--cache-dir` targets a large volume. - `wubbie tokenizer` HF path is now cache-only: it reads shards already in the cache and never bulk-downloads; a missing shard errors with a message to run `download` first. Added `--cache-dir` for symmetry. - corpus: add `download_hf` + a `DownloadReporter` trait (keeps presentation in the CLI layer), factor out HF cache/repo/listing helpers, and thread a `cache_dir` override through `HfSource`. Unit tests cover the byte formatter, the progress reporter's accounting, and the offline file-listing path; the network fetch itself is exercised under MULTI-1378. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a GPT-2-style byte-level BPE tokenizer, trained via the
tokenizerscrate, plus a
wubbie tokenizersubcommand to train it on a corpus slice.Locked here (both feed downstream phases and can't change without
retraining):
tokenizer::VOCAB_SIZE), the value the model configand the loss-at-init ~ ln(vocab) check read from.
tokenizer::SPECIAL_TOKENS): pad/bos/eos and theChatML-style turn markers, reserved as atomic tokens at fixed low ids
(pad pinned to id 0). The chat-template rendering format is finalized
later; the tokens themselves exist now.
The tokenizer is wired byte-level end to end (pre-tokenizer, decoder,
post-processor) with
add_prefix_space=falsesodecode(encode(text))is an exact round-trip.
wubbie tokenizertrains, writes tokenizer.json,then runs the DoD checks against a corpus sample: exact round-trip, atomic
special tokens (hard failures), and the ~3.5-4 chars/token compression
ratio (a miss warns).
No corpus is committed yet (the CommonPile slice is still being selected);
unit tests train in-memory on a small synthetic corpus.
Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01FoooM7zWfSo6SLKvUFt4LD