Skip to content

Walk the local corpus directory recursively#8

Open
RobbieMcKinstry wants to merge 1 commit into
trunkfrom
recursive-corpus-walk
Open

Walk the local corpus directory recursively#8
RobbieMcKinstry wants to merge 1 commit into
trunkfrom
recursive-corpus-walk

Conversation

@RobbieMcKinstry

Copy link
Copy Markdown
Contributor

What

The local corpus loader (collect_local_files in corpus.rs) swept the --input directory non-recursively (a single-level read_dir). The downloaded CommonPile slice is laid out one subdirectory per source — <source>/<source>.chunk.NN.jsonl.gz — so pointing wubbie tokenizer --input ./corpus at it found zero shards and aborted with no corpus files … found in directory. The previous workaround was to flatten the tree into a directory of symlinks; this removes that need.

This replaces the single-level sweep with an iterative depth-first walk that discovers shards at any depth.

Details

  • Recursive walk: iterative DFS over a stack of directories; finds .jsonl/.jsonl.gz/.txt/… shards nested at any depth.
  • Deterministic order: the full result is sorted, so shard order is independent of filesystem return order (the repo's no-flake testing policy).
  • Symlink-safe: recursion decisions use entry.file_type() (does not follow symlinks), so symlinked directories are skipped and the walk can't follow a symlink cycle. File detection still uses is_file(), so a symlinked shard is still picked up.
  • Updates the CLI --input help and the doc strings that described the old non-recursive sweep.
  • Adds collects_corpus_files_recursively_and_sorted covering nested shards + ignored non-corpus files.
  • Gitignores the local ./corpus cache so the downloaded pretraining data is never accidentally committed.

Testing

  • cargo fmt --all
  • cargo clippy --all-targets --workspace --locked -- -D warnings
  • cargo nextest run --workspace — corpus/tokenizer tests pass, including the new one.
  • End-to-end: ran the release binary against a tree mirroring ./corpus (a .jsonl one level down, a .jsonl.gz two levels down, plus README.md/manifest.json to ignore). It discovered both shards, skipped the non-corpus files, trained, round-tripped, and wrote the tokenizer (exit 0).

🤖 Generated with Claude Code

The local corpus loader swept the --input directory non-recursively, so a
corpus laid out one subdirectory per source (the CommonPile layout:
<source>/<source>.chunk.NN.jsonl.gz) yielded zero shards and the tokenizer
run aborted. Replace the single-level read_dir with an iterative depth-first
walk that finds shards at any depth.

- Sorts the full result so shard order is deterministic regardless of
  filesystem return order (no-flake testing policy).
- Recurses via file_type() (does not follow symlinks) to stay free of
  symlink cycles, while file detection still follows symlinked shards.
- Updates the CLI help / doc strings that described the old non-recursive
  sweep, and adds a recursive-discovery test.

Also gitignore the local ./corpus cache so the downloaded pretraining data
is never accidentally committed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copy link
Copy Markdown
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant