Motivation
Datasets containing blob v2 columns (including external reference types) cannot be compacted. Calling compact_files() on such datasets raises a decoder error. Additionally, even if the decoder error were bypassed, external references would be incorrectly downgraded to inline/packed storage during re-encoding.
Reproduction
test code
import lance
import pyarrow as pa
from lance import Blob
import tempfile, os
with tempfile.TemporaryDirectory() as tmp_dir:
local_file_path = os.path.join(tmp_dir, "external_blob.bin")
with open(local_file_path, "wb") as f:
f.write(b"hello world")
file_uri = f"file://{local_file_path}"
blob_ref = Blob.from_uri(file_uri, position=0, size=11)
schema = pa.schema([
pa.field("id", pa.int32()),
lance.blob_field("blob")
])
table = pa.table({
"id": [1, 2, 3],
"blob": lance.blob_array([blob_ref, blob_ref, blob_ref])
}, schema=schema)
ds = lance.write_dataset(
table, os.path.join(tmp_dir, "test_dataset"),
schema=schema,
max_rows_per_file=1,
data_storage_version="2.2",
allow_external_blob_outside_bases=True
)
ds.optimize.compact_files(num_threads=1) # 💥 Error
error
Invalid user input: there were more fields in the schema than provided column indices / infos, /Users/bytedance/RustProject/community/lance/rust/lance-encoding/src/decoder.rs:454:13
Motivation
Datasets containing blob v2 columns (including external reference types) cannot be compacted. Calling
compact_files()on such datasets raises a decoder error. Additionally, even if the decoder error were bypassed, external references would be incorrectly downgraded to inline/packed storage during re-encoding.Reproduction
test code
error