Add datu split command by aisrael · Pull Request #58 · aisrael/datu

aisrael · 2026-07-01T02:05:32Z

Summary

Adds datu split, the inverse of concat: splits a single input file into multiple output files of at most --split rows each (default 100000), with an optional --limit on total rows processed (default 0 = unlimited).
Streams the input in a single pass using the existing record-batch reader/writer layer (the same primitives concat, diff, and the ORC pipeline already use), so it doesn't require a second read pass or materializing the whole file in memory.
--split also accepts a byte size instead of a row count: kb/mb/gb/tb (decimal) or kib/mib/gib/tib (binary), case-insensitive (e.g. 64mb, 1.5GiB), for sizing partitions by approximate output size.

Changes

src/pipeline/split.rs (new): core split_file() — validates input/output formats, opens a lazy row-by-row reader (JSON is materialized via DataFusion, matching diff's existing approach), and slices RecordBatches across partition boundaries via a small PartitionReader wrapper.
src/bin/datu/commands/split.rs (new): CLI wrapper — datu split <INPUT> [OUTPUT] [OPTIONS] with -I/-O, --split, --limit, --sparse, --json-pretty, and --input-headers, following the same conventions as concat/convert. OUTPUT is an optional second positional (like head/tail/sample) that defaults to the input path.
Partition files are named by inserting a zero-padded .partNNNNN segment (1-based) before the extension, e.g. large-file.avro → large-file.part00001.avro, large-file.part00002.avro, ...
Wired into src/pipeline.rs, src/bin/datu/commands/mod.rs, and src/bin/datu/main.rs.
README.md: new ### split section plus updated "Supported Formats" bullets.
features/cli/cli.feature: updated the hardcoded --help/help output docstrings to include the new split subcommand.

`--split` byte-size support

New SplitSize type (Rows(usize) / Bytes(u64)) parses --split: a plain integer means rows (unchanged default behavior); a number followed by a unit means bytes. Byte sizes are estimated from each RecordBatch's in-memory Arrow size (get_array_memory_size()), since the exact on-disk/compressed size can't be known before writing — documented as approximate in --help and the README.
PartitionReader's row-count budget generalized to a RemainingBudget enum (Rows/Bytes) so the same slicing logic drives both modes.
--limit (total row cap) is now applied once up front via the existing apply_offset_limit helper (src/pipeline/record_batch.rs) instead of being threaded through the per-partition loop — that per-partition composition doesn't generalize to byte-sized partitions, and this is a net simplification of the original row-only logic too.
Bug fix found via testing: a partition boundary landing exactly at the end of a source batch could stash a zero-row leftover batch in the reader's pending slot, which falsely signaled "more data available" and produced a phantom empty/corrupt trailing output file (reproduced with small byte budgets against a 1000-row Avro fixture, 69 partitions, last one corrupt). Fixed by only stashing a leftover when it actually contains rows.
features/cli/split.feature: 5 new scenarios (decimal unit, case-insensitivity, binary unit, unknown-unit error, zero-byte-size error) added to the 6 from the original PR.

Testing

cargo build / cargo clippy --all-targets — clean
cargo fmt --check — clean
cargo test --lib — 186 passed, including 19 pipeline::split tests (SplitSize parsing + row/byte-based split_file)
cargo test --bins — 38 passed, including 4 commands::split tests
cargo test --test cli — 134/134 Cucumber scenarios passed (11 split scenarios total)
cargo test --test repl — 80/80 Cucumber scenarios passed (regression check)
Manual smoke test: datu split fixtures/userdata5.avro u.avro --split 300 on a 1000-row Avro file produced 4 partitions (300/300/300/100 rows), row counts verified to sum back to 1000.
Manual smoke test: datu split fixtures/userdata5.avro u.avro --split 20kb produced 63 partitions; summed per-partition row counts equal 1000.

Checklist

Tests pass
Documentation updated
No breaking changes

🤖 Generated with Claude Code

Adds `datu split`, the inverse of `concat`: splits a single input file into multiple output files of at most `--split` rows each (default 100000), with an optional `--limit` on total rows processed. Streams the input in a single pass via the existing record-batch reader/writer layer rather than materializing the whole file, so it scales to large inputs. Partition files are named by inserting a zero-padded `.partNNNNN` segment before the extension of the (optional) output path, which defaults to the input path. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

--split now also takes kb/mb/gb/tb (decimal) or kib/mib/gib/tib (binary), case-insensitive, to size partitions by approximate output size instead of row count. Byte sizes are estimated from each RecordBatch's in-memory Arrow size, since exact on-disk/compressed size can't be known before writing. Also refactors --limit to be applied once via the existing apply_offset_limit helper instead of being threaded through the per-partition loop, since that composition doesn't generalize to byte-sized partitions. Fixes a bug found while testing byte-based splitting: a partition boundary landing exactly at a batch's end could stash a zero-row leftover in the reader's pending slot, which falsely signaled more data was available and produced a phantom empty/corrupt output file. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

aisrael and others added 2 commits June 30, 2026 22:05

aisrael merged commit a43c38a into main Jul 1, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add datu split command#58

Add datu split command#58
aisrael merged 2 commits into
mainfrom
feat/add-split-command

aisrael commented Jul 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aisrael commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

--split byte-size support

Testing

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aisrael commented Jul 1, 2026 •

edited

Loading

`--split` byte-size support