Skip to content

feat(blobv2):support for compaction #6938

@yyzhao2025

Description

@yyzhao2025

Motivation

Datasets containing blob v2 columns (including external reference types) cannot be compacted. Calling compact_files() on such datasets raises a decoder error. Additionally, even if the decoder error were bypassed, external references would be incorrectly downgraded to inline/packed storage during re-encoding.

Reproduction

test code

import lance
import pyarrow as pa
from lance import Blob
import tempfile, os

with tempfile.TemporaryDirectory() as tmp_dir:
    local_file_path = os.path.join(tmp_dir, "external_blob.bin")
    with open(local_file_path, "wb") as f:
        f.write(b"hello world")

    file_uri = f"file://{local_file_path}"
    blob_ref = Blob.from_uri(file_uri, position=0, size=11)

    schema = pa.schema([
        pa.field("id", pa.int32()),
        lance.blob_field("blob")
    ])

    table = pa.table({
        "id": [1, 2, 3],
        "blob": lance.blob_array([blob_ref, blob_ref, blob_ref])
    }, schema=schema)

    ds = lance.write_dataset(
        table, os.path.join(tmp_dir, "test_dataset"),
        schema=schema,
        max_rows_per_file=1,
        data_storage_version="2.2",
        allow_external_blob_outside_bases=True
    )

    ds.optimize.compact_files(num_threads=1)  # 💥 Error

error

Invalid user input: there were more fields in the schema than provided column indices / infos, /Users/bytedance/RustProject/community/lance/rust/lance-encoding/src/decoder.rs:454:13

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions