Skip to content

feat(blobv2): support all BlobKind types in blob v2 compact_files#7017

Open
yyzhao2025 wants to merge 1 commit into
lance-format:mainfrom
yyzhao2025:yyzhao2025/compact_files_blobv2
Open

feat(blobv2): support all BlobKind types in blob v2 compact_files#7017
yyzhao2025 wants to merge 1 commit into
lance-format:mainfrom
yyzhao2025:yyzhao2025/compact_files_blobv2

Conversation

@yyzhao2025
Copy link
Copy Markdown

@yyzhao2025 yyzhao2025 commented May 31, 2026

feat(blobv2): support all BlobKind types in blob v2 compact_files

compact_files did not support blob v2 columns. Running compaction on a dataset with blob v2 columns would corrupt blob data for all BlobKind types (Inline, Packed, Dedicated, and External). This PR adds full blob v2 support to the compaction path so that all 4 BlobKind types are correctly preserved after compact_files.

Why compaction failed for blob v2

Blob v2 uses a multi-file storage layout where blob payloads live in sidecar .blob files and are referenced through a 5-field descriptor struct (kind, position, size, blob_id, blob_uri). The existing compaction path was designed for blob v1's single-file layout and did not account for this:

  1. Binary copy treated blob v2 as opaque bytes. can_use_binary_copy copied raw page data without understanding the descriptor layout, out-of-line buffers, or sidecar file references. This lost the relationship between descriptors and their .blob files for Packed/Dedicated blobs, produced size=0 for Inline blobs, and broke External blob URI resolution since base_id references to manifest base_paths were not preserved in the new fragment context.

  2. Re-encode path received only descriptors, no actual blob data. prepare_reader used the default BlobsDescriptions mode, which unloads blob v2 columns into their descriptor view. Without actual blob bytes, BlobPreprocessor re-encoding produced size=0 for all non-External kinds, and External blob URIs could not be correctly resolved without the original base_idbase_paths mapping.

  3. Descriptor-format pass-through branches were incorrect. BlobPreprocessor and BlobV2StructuralEncoder had branches intended to handle descriptor-only data, but these produced wrong output without backing .blob files.

Changes

  • Disable binary copy for all blob columns in can_use_binary_copy, preventing raw page-level copying that ignores blob v2's multi-file layout.
  • Add with_row_addr() in prepare_reader for blob v2 datasets, enabling row-address-based access to actual blob data via take_blobs_by_addresses.
  • Add transform_blob_v2_batch to read real blob data per-row and convert descriptor columns to the Struct<data, uri, position, size> user-view format expected by BlobPreprocessor before re-encoding. This function:
    • Classifies each row as Null, External, or DataBlob (Inline/Packed/Dedicated) via classify_rows
    • Reads actual blob bytes for DataBlob rows using take_blobs_by_addresses (lazy, per-row loading to avoid OOM)
    • Reconstructs absolute URIs for External blobs from base_id and manifest base_paths
    • Assembles the user-view struct with correct null propagation
  • Remove incorrect descriptor-format handling from BlobPreprocessor
  • Remove incorrect descriptor-format pass-through from BlobV2StructuralEncoder
  • Resolve External blob URIs by reconstructing absolute paths from base_id and manifest base_paths during compaction
  • Add shared BLOB_V2_USER_FIELDS / BLOB_V2_USER_TYPE constants in lance-core::datatypes to eliminate schema hardcoding between the transform and write paths

Testing

12 new blob v2 compact tests cover all BlobKind types and edge cases:

Scenario Test
External + Inline mixed test_compact_blob_v2_preserves_external_references
Inline + Packed + Dedicated test_compact_blob_v2_packed_and_dedicated
NULL rows test_compact_blob_v2_with_null_rows
Deleted rows not resurrected test_compact_blob_v2_deleted_rows_not_resurrected
External + DataBlob mixed test_compact_blob_v2_external_and_data_blob_mixed
Multiple blob v2 columns test_compact_blob_v2_multiple_blob_columns
External + NULL mixed test_compact_blob_v2_external_and_null_mixed
All-NULL / All-External fragments test_compact_blob_v2_all_null_and_all_external_fragments
Multiple base_ids test_compact_blob_v2_external_with_multiple_base_ids
Large blob lazy loading test_compact_blob_v2_large_blobs
BlobKind reclassification test_compact_blob_v2_blob_kind_reclassification
Multi-batch processing test_compact_blob_v2_multi_batch

All tests verify actual blob bytes via take_blobs, not just descriptor metadata. 83 optimize tests and 50 blob tests pass with zero clippy warnings on changed files.

Related Issue

Closes #6938

Blob v2 compact_files previously corrupted data for Inline, Packed, and
Dedicated BlobKind types. Only External blobs survived compaction
correctly.

Root causes:
- Binary copy copied raw page bytes without understanding blob v2's
  packed struct descriptor layout, out-of-line buffers, and sidecar
  file references, causing size=0 for Inline blobs and missing .blob
  files for Packed/Dedicated blobs
- Reencode path received only descriptors (no actual blob data) via
  BlobsDescriptions mode, resulting in size=0 after re-encoding
- BlobPreprocessor and BlobV2StructuralEncoder had buggy descriptor-
  format pass-through branches that produced incorrect output

Fixes:
- Disable binary copy for all blob columns in can_use_binary_copy
- Add with_row_addr() in prepare_reader for blob v2 datasets to enable
  reading actual blob data via take_blobs_by_addresses
- Add transform_blob_v2_batch to read real blob data and convert
  descriptor columns to Struct<data, uri> format before re-encoding
- Remove buggy descriptor-format handling from BlobPreprocessor
- Remove buggy descriptor-format pass-through from BlobV2StructuralEncoder
- Resolve External blob URIs by reconstructing absolute paths from
  base_id and manifest base_paths

All 4 BlobKind types (Inline, Packed, Dedicated, External) now work
correctly after compact_files. 83 optimize tests and 50 blob tests
pass with zero clippy warnings.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the enhancement New feature or request label May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(blobv2):support for compaction

1 participant