Skip to content

fix(list): use valid empty offsets in ListArray::new_empty#12

Closed
JONBRWN wants to merge 1 commit into
fmurphy/bugfix-records-parsingfrom
ENG-6097/fix-list-array-new-empty-offsets
Closed

fix(list): use valid empty offsets in ListArray::new_empty#12
JONBRWN wants to merge 1 commit into
fmurphy/bugfix-records-parsingfrom
ENG-6097/fix-list-array-new-empty-offsets

Conversation

@JONBRWN
Copy link
Copy Markdown

@JONBRWN JONBRWN commented Apr 28, 2026

Summary

ListArray::new_empty was using OffsetsBuffer::default() which produces a zero-length offsets buffer. The Arrow spec requires N+1 offsets for N elements — an empty list array (0 elements) must have a single [0] offset (4 bytes), not an empty buffer.

When new_null_array() is called on a nested list type (e.g. list<list<list<list<double>>>>), it recurses through each nesting level via new_empty_array(), which calls new_empty() at each level. Every child level ends up with a zero-length offset buffer, producing malformed Arrow IPC data.

Root cause (ENG-6097)

The Wallaroo engine nulls in.tensor (a 4-deep nested list column) when inference log size exceeds the byte limit. The resulting Arrow IPC bytes contain zero-length child offset buffers at every nesting level. PyArrow 14 segfaults in _table_to_blocks when to_pandas() encounters these invalid buffers.

Confirmed by inspecting the raw buffers of real production data vs synthetically constructed tables — the only difference was size=0 vs size=4 on child offset buffers.

Fix

Replace OffsetsBuffer::default() with Offsets::new_zeroed(0).into(), which produces a valid single-element [0] offset buffer. This propagates correctly through all nesting levels when building null arrays.

🤖 Generated with Claude Code

OffsetsBuffer::default() produces a zero-length offsets buffer, which
violates the Arrow spec (N elements require N+1 offsets). When
new_null_array() recurses through nested list types via new_empty_array(),
every child level gets an empty offset buffer. PyArrow 14 segfaults in
_table_to_blocks when it encounters these zero-length child offset buffers
during to_pandas() conversion.

Fixes ENG-6097

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates ListArray::new_empty to construct offsets via Offsets::new_zeroed(0) (converted into an OffsetsBuffer) instead of using OffsetsBuffer::default(), intended to ensure empty list arrays have a valid single-element [0] offsets buffer.

Changes:

  • Change ListArray::new_empty to use Offsets::new_zeroed(0).into() for offsets initialization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/array/list/mod.rs
Comment on lines 90 to 93
pub fn new_empty(data_type: DataType) -> Self {
let values = new_empty_array(Self::get_child_type(&data_type).clone());
Self::new(data_type, OffsetsBuffer::default(), values, None)
Self::new(data_type, Offsets::new_zeroed(0).into(), values, None)
}
Comment thread src/array/list/mod.rs
Comment on lines 90 to 93
pub fn new_empty(data_type: DataType) -> Self {
let values = new_empty_array(Self::get_child_type(&data_type).clone());
Self::new(data_type, OffsetsBuffer::default(), values, None)
Self::new(data_type, Offsets::new_zeroed(0).into(), values, None)
}
@JONBRWN
Copy link
Copy Markdown
Author

JONBRWN commented Apr 28, 2026

closing in favor of an in engine fix

@JONBRWN JONBRWN closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants