Skip to content

fix: preserve zero-length buffers in binary copy compaction#6992

Open
zhangyue19921010 wants to merge 1 commit into
lance-format:mainfrom
zhangyue19921010:fix-binary-copy-zero-buffer
Open

fix: preserve zero-length buffers in binary copy compaction#6992
zhangyue19921010 wants to merge 1 commit into
lance-format:mainfrom
zhangyue19921010:fix-binary-copy-zero-buffer

Conversation

@zhangyue19921010
Copy link
Copy Markdown
Contributor

closes: #6991

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the bug Something isn't working label May 29, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

❌ Patch coverage is 32.14286% with 19 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/optimize/binary_copy.rs 32.14% 17 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@yanghua
Copy link
Copy Markdown
Collaborator

yanghua commented May 31, 2026

@claude review

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — targeted fix with a regression test.

Extended reasoning...

Overview

This PR fixes a bug in binary copy compaction where zero-length buffers (e.g., a column of all empty strings) caused failures during compaction and subsequent scalar index creation. The change touches rust/lance/src/dataset/optimize/binary_copy.rs in two analogous places (page-level buffer loop and column-level buffer loop) and adds a regression test in tests/binary_copy.rs.

Security risks

None apparent. This is internal data-file rewriting logic with no external input boundary, no auth/crypto/permissions, and no new I/O surface.

Level of scrutiny

Moderate — this is on the compaction write path (not user-facing API), but the change is small (~48 lines) and the logic is symmetric in both spots: filter zero-size entries before submit_request so the scheduler isn't asked for 0-byte ranges, then on the write side iterate the original buffer_offsets_and_sizes and emit (start, 0) placeholders for zero-size entries without consuming from bytes_iter. This preserves buffer count and ordering in the resulting page/column metadata.

Other factors

A regression test was added that exercises the original failure mode (4000 empty strings → ForceBinaryCopy compaction → create_index(Scalar)), run across all non-legacy file versions. Codecov flagged low patch coverage (32%) but the missing lines appear to be the defensive Error::execution branches for the bytes_iter.next() mismatch — those are unreachable in normal operation by construction. The PR has already been merged as 6bd78b5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Binary Copy Compaction Drops Zero-Length Buffers In V2.0

2 participants