Skip to content

[Data] Support zipping datasets with different numbers of rows #64396

Description

@Hyunoh-Yeo

Description

Description

Dataset.zip requires all inputs to have the same number of rows; otherwise
ZipOperator raises ValueError("Cannot zip datasets of different number of rows").
There's a long-standing TODO(Clark) to support user-directed handling of mismatched
lengths instead of erroring.

Proposed solution

Add an on_mismatch option to Dataset.zip (default "error", preserving current behavior):

  • "error" — raise, as today.
  • "drop" — truncate all inputs to the shortest input's row count.
  • "pad" — extend shorter inputs to the longest, filling missing values with nulls.

Thread the option through the Zip logical operator, the planner (plan_zip_op), and
ZipOperator. The streaming alignment already aligns block fronts to min_rows, so
"drop" stops once any input is exhausted and "pad" synthesizes null rows for
exhausted inputs.

Relevant issue

related to #56300

Use case

No response

Metadata

Metadata

Assignees

Labels

community-backlogdataRay Data-related issuesenhancementRequest for new feature and/or capabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)usability

Type

No type
No fields configured for issues without a type.

Projects

Status
Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions