Walk the local corpus directory recursively#8
Open
RobbieMcKinstry wants to merge 1 commit into
Open
Conversation
The local corpus loader swept the --input directory non-recursively, so a corpus laid out one subdirectory per source (the CommonPile layout: <source>/<source>.chunk.NN.jsonl.gz) yielded zero shards and the tokenizer run aborted. Replace the single-level read_dir with an iterative depth-first walk that finds shards at any depth. - Sorts the full result so shard order is deterministic regardless of filesystem return order (no-flake testing policy). - Recurses via file_type() (does not follow symlinks) to stay free of symlink cycles, while file detection still follows symlinked shards. - Updates the CLI help / doc strings that described the old non-recursive sweep, and adds a recursive-discovery test. Also gitignore the local ./corpus cache so the downloaded pretraining data is never accidentally committed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

What
The local corpus loader (
collect_local_filesincorpus.rs) swept the--inputdirectory non-recursively (a single-levelread_dir). The downloaded CommonPile slice is laid out one subdirectory per source —<source>/<source>.chunk.NN.jsonl.gz— so pointingwubbie tokenizer --input ./corpusat it found zero shards and aborted withno corpus files … found in directory. The previous workaround was to flatten the tree into a directory of symlinks; this removes that need.This replaces the single-level sweep with an iterative depth-first walk that discovers shards at any depth.
Details
.jsonl/.jsonl.gz/.txt/… shards nested at any depth.entry.file_type()(does not follow symlinks), so symlinked directories are skipped and the walk can't follow a symlink cycle. File detection still usesis_file(), so a symlinked shard is still picked up.--inputhelp and the doc strings that described the old non-recursive sweep.collects_corpus_files_recursively_and_sortedcovering nested shards + ignored non-corpus files../corpuscache so the downloaded pretraining data is never accidentally committed.Testing
cargo fmt --all✅cargo clippy --all-targets --workspace --locked -- -D warnings✅cargo nextest run --workspace— corpus/tokenizer tests pass, including the new one../corpus(a.jsonlone level down, a.jsonl.gztwo levels down, plusREADME.md/manifest.jsonto ignore). It discovered both shards, skipped the non-corpus files, trained, round-tripped, and wrote the tokenizer (exit 0).🤖 Generated with Claude Code