fix: preserve zero-length buffers in binary copy compaction#6992
fix: preserve zero-length buffers in binary copy compaction#6992zhangyue19921010 wants to merge 1 commit into
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@claude review |
There was a problem hiding this comment.
LGTM — targeted fix with a regression test.
Extended reasoning...
Overview
This PR fixes a bug in binary copy compaction where zero-length buffers (e.g., a column of all empty strings) caused failures during compaction and subsequent scalar index creation. The change touches rust/lance/src/dataset/optimize/binary_copy.rs in two analogous places (page-level buffer loop and column-level buffer loop) and adds a regression test in tests/binary_copy.rs.
Security risks
None apparent. This is internal data-file rewriting logic with no external input boundary, no auth/crypto/permissions, and no new I/O surface.
Level of scrutiny
Moderate — this is on the compaction write path (not user-facing API), but the change is small (~48 lines) and the logic is symmetric in both spots: filter zero-size entries before submit_request so the scheduler isn't asked for 0-byte ranges, then on the write side iterate the original buffer_offsets_and_sizes and emit (start, 0) placeholders for zero-size entries without consuming from bytes_iter. This preserves buffer count and ordering in the resulting page/column metadata.
Other factors
A regression test was added that exercises the original failure mode (4000 empty strings → ForceBinaryCopy compaction → create_index(Scalar)), run across all non-legacy file versions. Codecov flagged low patch coverage (32%) but the missing lines appear to be the defensive Error::execution branches for the bytes_iter.next() mismatch — those are unreachable in normal operation by construction. The PR has already been merged as 6bd78b5.
closes: #6991