Walk the local corpus directory recursively by RobbieMcKinstry · Pull Request #8 · wack/wubbie

RobbieMcKinstry · 2026-06-30T23:47:25Z

What

The local corpus loader (collect_local_files in corpus.rs) swept the --input directory non-recursively (a single-level read_dir). The downloaded CommonPile slice is laid out one subdirectory per source — <source>/<source>.chunk.NN.jsonl.gz — so pointing wubbie tokenizer --input ./corpus at it found zero shards and aborted with no corpus files … found in directory. The previous workaround was to flatten the tree into a directory of symlinks; this removes that need.

This replaces the single-level sweep with an iterative depth-first walk that discovers shards at any depth.

Details

Recursive walk: iterative DFS over a stack of directories; finds .jsonl/.jsonl.gz/.txt/… shards nested at any depth.
Deterministic order: the full result is sorted, so shard order is independent of filesystem return order (the repo's no-flake testing policy).
Symlink-safe: recursion decisions use entry.file_type() (does not follow symlinks), so symlinked directories are skipped and the walk can't follow a symlink cycle. File detection still uses is_file(), so a symlinked shard is still picked up.
Updates the CLI --input help and the doc strings that described the old non-recursive sweep.
Adds collects_corpus_files_recursively_and_sorted covering nested shards + ignored non-corpus files.
Gitignores the local ./corpus cache so the downloaded pretraining data is never accidentally committed.

Testing

cargo fmt --all ✅
cargo clippy --all-targets --workspace --locked -- -D warnings ✅
cargo nextest run --workspace — corpus/tokenizer tests pass, including the new one.
End-to-end: ran the release binary against a tree mirroring ./corpus (a .jsonl one level down, a .jsonl.gz two levels down, plus README.md/manifest.json to ignore). It discovered both shards, skipped the non-corpus files, trained, round-tripped, and wrote the tokenizer (exit 0).

🤖 Generated with Claude Code

The local corpus loader swept the --input directory non-recursively, so a corpus laid out one subdirectory per source (the CommonPile layout: <source>/<source>.chunk.NN.jsonl.gz) yielded zero shards and the tokenizer run aborted. Replace the single-level read_dir with an iterative depth-first walk that finds shards at any depth. - Sorts the full result so shard order is deterministic regardless of filesystem return order (no-flake testing policy). - Recurses via file_type() (does not follow symlinks) to stay free of symlink cycles, while file detection still follows symlinked shards. - Updates the CLI help / doc strings that described the old non-recursive sweep, and adds a recursive-discovery test. Also gitignore the local ./corpus cache so the downloaded pretraining data is never accidentally committed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RobbieMcKinstry · 2026-06-30T23:48:59Z

Walk the local corpus directory recursively #8 👈 (View in Graphite)
trunk

This stack of pull requests is managed by Graphite. Learn more about stacking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Walk the local corpus directory recursively#8

Walk the local corpus directory recursively#8
RobbieMcKinstry wants to merge 1 commit into
trunkfrom
recursive-corpus-walk

RobbieMcKinstry commented Jun 30, 2026

Uh oh!

RobbieMcKinstry commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

RobbieMcKinstry commented Jun 30, 2026

What

Details

Testing

Uh oh!

RobbieMcKinstry commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant