Skip to content

Add datu split command#58

Merged
aisrael merged 2 commits into
mainfrom
feat/add-split-command
Jul 1, 2026
Merged

Add datu split command#58
aisrael merged 2 commits into
mainfrom
feat/add-split-command

Conversation

@aisrael

@aisrael aisrael commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Summary

  • Adds datu split, the inverse of concat: splits a single input file into multiple output files of at most --split rows each (default 100000), with an optional --limit on total rows processed (default 0 = unlimited).
  • Streams the input in a single pass using the existing record-batch reader/writer layer (the same primitives concat, diff, and the ORC pipeline already use), so it doesn't require a second read pass or materializing the whole file in memory.
  • --split also accepts a byte size instead of a row count: kb/mb/gb/tb (decimal) or kib/mib/gib/tib (binary), case-insensitive (e.g. 64mb, 1.5GiB), for sizing partitions by approximate output size.

Changes

  • src/pipeline/split.rs (new): core split_file() — validates input/output formats, opens a lazy row-by-row reader (JSON is materialized via DataFusion, matching diff's existing approach), and slices RecordBatches across partition boundaries via a small PartitionReader wrapper.
  • src/bin/datu/commands/split.rs (new): CLI wrapper — datu split <INPUT> [OUTPUT] [OPTIONS] with -I/-O, --split, --limit, --sparse, --json-pretty, and --input-headers, following the same conventions as concat/convert. OUTPUT is an optional second positional (like head/tail/sample) that defaults to the input path.
  • Partition files are named by inserting a zero-padded .partNNNNN segment (1-based) before the extension, e.g. large-file.avrolarge-file.part00001.avro, large-file.part00002.avro, ...
  • Wired into src/pipeline.rs, src/bin/datu/commands/mod.rs, and src/bin/datu/main.rs.
  • README.md: new ### split section plus updated "Supported Formats" bullets.
  • features/cli/cli.feature: updated the hardcoded --help/help output docstrings to include the new split subcommand.

--split byte-size support

  • New SplitSize type (Rows(usize) / Bytes(u64)) parses --split: a plain integer means rows (unchanged default behavior); a number followed by a unit means bytes. Byte sizes are estimated from each RecordBatch's in-memory Arrow size (get_array_memory_size()), since the exact on-disk/compressed size can't be known before writing — documented as approximate in --help and the README.
  • PartitionReader's row-count budget generalized to a RemainingBudget enum (Rows/Bytes) so the same slicing logic drives both modes.
  • --limit (total row cap) is now applied once up front via the existing apply_offset_limit helper (src/pipeline/record_batch.rs) instead of being threaded through the per-partition loop — that per-partition composition doesn't generalize to byte-sized partitions, and this is a net simplification of the original row-only logic too.
  • Bug fix found via testing: a partition boundary landing exactly at the end of a source batch could stash a zero-row leftover batch in the reader's pending slot, which falsely signaled "more data available" and produced a phantom empty/corrupt trailing output file (reproduced with small byte budgets against a 1000-row Avro fixture, 69 partitions, last one corrupt). Fixed by only stashing a leftover when it actually contains rows.
  • features/cli/split.feature: 5 new scenarios (decimal unit, case-insensitivity, binary unit, unknown-unit error, zero-byte-size error) added to the 6 from the original PR.

Testing

  • cargo build / cargo clippy --all-targets — clean
  • cargo fmt --check — clean
  • cargo test --lib — 186 passed, including 19 pipeline::split tests (SplitSize parsing + row/byte-based split_file)
  • cargo test --bins — 38 passed, including 4 commands::split tests
  • cargo test --test cli — 134/134 Cucumber scenarios passed (11 split scenarios total)
  • cargo test --test repl — 80/80 Cucumber scenarios passed (regression check)
  • Manual smoke test: datu split fixtures/userdata5.avro u.avro --split 300 on a 1000-row Avro file produced 4 partitions (300/300/300/100 rows), row counts verified to sum back to 1000.
  • Manual smoke test: datu split fixtures/userdata5.avro u.avro --split 20kb produced 63 partitions; summed per-partition row counts equal 1000.

Checklist

  • Tests pass
  • Documentation updated
  • No breaking changes

🤖 Generated with Claude Code

aisrael and others added 2 commits June 30, 2026 22:05
Adds `datu split`, the inverse of `concat`: splits a single input file
into multiple output files of at most `--split` rows each (default
100000), with an optional `--limit` on total rows processed. Streams
the input in a single pass via the existing record-batch reader/writer
layer rather than materializing the whole file, so it scales to large
inputs. Partition files are named by inserting a zero-padded
`.partNNNNN` segment before the extension of the (optional) output
path, which defaults to the input path.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
--split now also takes kb/mb/gb/tb (decimal) or kib/mib/gib/tib
(binary), case-insensitive, to size partitions by approximate output
size instead of row count. Byte sizes are estimated from each
RecordBatch's in-memory Arrow size, since exact on-disk/compressed
size can't be known before writing.

Also refactors --limit to be applied once via the existing
apply_offset_limit helper instead of being threaded through the
per-partition loop, since that composition doesn't generalize to
byte-sized partitions.

Fixes a bug found while testing byte-based splitting: a partition
boundary landing exactly at a batch's end could stash a zero-row
leftover in the reader's pending slot, which falsely signaled more
data was available and produced a phantom empty/corrupt output file.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
@aisrael aisrael merged commit a43c38a into main Jul 1, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant