Description
Description
Dataset.zip requires all inputs to have the same number of rows; otherwise
ZipOperator raises ValueError("Cannot zip datasets of different number of rows").
There's a long-standing TODO(Clark) to support user-directed handling of mismatched
lengths instead of erroring.
Proposed solution
Add an on_mismatch option to Dataset.zip (default "error", preserving current behavior):
"error" — raise, as today.
"drop" — truncate all inputs to the shortest input's row count.
"pad" — extend shorter inputs to the longest, filling missing values with nulls.
Thread the option through the Zip logical operator, the planner (plan_zip_op), and
ZipOperator. The streaming alignment already aligns block fronts to min_rows, so
"drop" stops once any input is exhausted and "pad" synthesizes null rows for
exhausted inputs.
Relevant issue
related to #56300
Use case
No response
Description
Description
Dataset.ziprequires all inputs to have the same number of rows; otherwiseZipOperatorraisesValueError("Cannot zip datasets of different number of rows").There's a long-standing
TODO(Clark)to support user-directed handling of mismatchedlengths instead of erroring.
Proposed solution
Add an
on_mismatchoption toDataset.zip(default"error", preserving current behavior):"error"— raise, as today."drop"— truncate all inputs to the shortest input's row count."pad"— extend shorter inputs to the longest, filling missing values with nulls.Thread the option through the
Ziplogical operator, the planner (plan_zip_op), andZipOperator. The streaming alignment already aligns block fronts tomin_rows, so"drop"stops once any input is exhausted and"pad"synthesizes null rows forexhausted inputs.
Relevant issue
related to #56300
Use case
No response