feat(blobv2): support all BlobKind types in blob v2 compact_files#7017
Open
yyzhao2025 wants to merge 1 commit into
Open
feat(blobv2): support all BlobKind types in blob v2 compact_files#7017yyzhao2025 wants to merge 1 commit into
yyzhao2025 wants to merge 1 commit into
Conversation
Blob v2 compact_files previously corrupted data for Inline, Packed, and Dedicated BlobKind types. Only External blobs survived compaction correctly. Root causes: - Binary copy copied raw page bytes without understanding blob v2's packed struct descriptor layout, out-of-line buffers, and sidecar file references, causing size=0 for Inline blobs and missing .blob files for Packed/Dedicated blobs - Reencode path received only descriptors (no actual blob data) via BlobsDescriptions mode, resulting in size=0 after re-encoding - BlobPreprocessor and BlobV2StructuralEncoder had buggy descriptor- format pass-through branches that produced incorrect output Fixes: - Disable binary copy for all blob columns in can_use_binary_copy - Add with_row_addr() in prepare_reader for blob v2 datasets to enable reading actual blob data via take_blobs_by_addresses - Add transform_blob_v2_batch to read real blob data and convert descriptor columns to Struct<data, uri> format before re-encoding - Remove buggy descriptor-format handling from BlobPreprocessor - Remove buggy descriptor-format pass-through from BlobV2StructuralEncoder - Resolve External blob URIs by reconstructing absolute paths from base_id and manifest base_paths All 4 BlobKind types (Inline, Packed, Dedicated, External) now work correctly after compact_files. 83 optimize tests and 50 blob tests pass with zero clippy warnings.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(blobv2): support all BlobKind types in blob v2 compact_files
compact_filesdid not support blob v2 columns. Running compaction on a dataset with blob v2 columns would corrupt blob data for allBlobKindtypes (Inline, Packed, Dedicated, and External). This PR adds full blob v2 support to the compaction path so that all 4BlobKindtypes are correctly preserved aftercompact_files.Why compaction failed for blob v2
Blob v2 uses a multi-file storage layout where blob payloads live in sidecar
.blobfiles and are referenced through a 5-field descriptor struct (kind, position, size, blob_id, blob_uri). The existing compaction path was designed for blob v1's single-file layout and did not account for this:Binary copy treated blob v2 as opaque bytes.
can_use_binary_copycopied raw page data without understanding the descriptor layout, out-of-line buffers, or sidecar file references. This lost the relationship between descriptors and their.blobfiles for Packed/Dedicated blobs, producedsize=0for Inline blobs, and broke External blob URI resolution sincebase_idreferences to manifestbase_pathswere not preserved in the new fragment context.Re-encode path received only descriptors, no actual blob data.
prepare_readerused the defaultBlobsDescriptionsmode, which unloads blob v2 columns into their descriptor view. Without actual blob bytes,BlobPreprocessorre-encoding producedsize=0for all non-External kinds, and External blob URIs could not be correctly resolved without the originalbase_id→base_pathsmapping.Descriptor-format pass-through branches were incorrect.
BlobPreprocessorandBlobV2StructuralEncoderhad branches intended to handle descriptor-only data, but these produced wrong output without backing.blobfiles.Changes
can_use_binary_copy, preventing raw page-level copying that ignores blob v2's multi-file layout.with_row_addr()inprepare_readerfor blob v2 datasets, enabling row-address-based access to actual blob data viatake_blobs_by_addresses.transform_blob_v2_batchto read real blob data per-row and convert descriptor columns to theStruct<data, uri, position, size>user-view format expected byBlobPreprocessorbefore re-encoding. This function:classify_rowstake_blobs_by_addresses(lazy, per-row loading to avoid OOM)base_idand manifestbase_pathsBlobPreprocessorBlobV2StructuralEncoderbase_idand manifestbase_pathsduring compactionBLOB_V2_USER_FIELDS/BLOB_V2_USER_TYPEconstants inlance-core::datatypesto eliminate schema hardcoding between the transform and write pathsTesting
12 new blob v2 compact tests cover all
BlobKindtypes and edge cases:test_compact_blob_v2_preserves_external_referencestest_compact_blob_v2_packed_and_dedicatedtest_compact_blob_v2_with_null_rowstest_compact_blob_v2_deleted_rows_not_resurrectedtest_compact_blob_v2_external_and_data_blob_mixedtest_compact_blob_v2_multiple_blob_columnstest_compact_blob_v2_external_and_null_mixedtest_compact_blob_v2_all_null_and_all_external_fragmentstest_compact_blob_v2_external_with_multiple_base_idstest_compact_blob_v2_large_blobstest_compact_blob_v2_blob_kind_reclassificationtest_compact_blob_v2_multi_batchAll tests verify actual blob bytes via
take_blobs, not just descriptor metadata. 83 optimize tests and 50 blob tests pass with zero clippy warnings on changed files.Related Issue
Closes #6938