feat: blob v2 descriptor read support by geruh · Pull Request #548 · lance-format/lance-spark

geruh · 2026-05-21T00:51:32Z

related to #539 and #505

Adds read support for blob v2 columns. Instead of hiding blob metadata behind virtual columns, we surface the raw descriptor struct directly to Spark like what the original issue was stating.

struct<kind: short, position: long, size: long, blob_id: long, blob_uri: string>

Querying blob metadata is just a column projection now, no byte fetch:

SELECT id, payload.size, payload.kind FROM lance.ns.tbl;

A column is blob v2 when any of these hold:

Arrow extension name lance.blob.v2 is set in lance
metadata key lance-encoding:blob-v2 = true

Schema rewrite lives in BlobUtils.applyBlobV2DescriptorSchema(...), called from LanceDataset.schema(), LanceDataSource.inferSchema(), and the LanceScanBuilder constructor.

Filter pushdown

Suppressed for any table with a v2 column. The previous per-predicate gate had to be reverted because calc_eager_projection panics on filters that never reference blob columns. Zonemap fragment pruning still runs. Need more investigation here

Testing

Added some tests for BlobUtils
Added some tests for LanceArrowColumnVector

No end-to-end pylance interop tests here. The connector can't write v2 blobs yet, so producing test data requires pylance as an external dep. E2E coverage lands naturally once the write path exists, extending BaseBlobCreateTableTest's pattern.

Local verification

Wrote a v2 dataset with pylance hitting all four BlobKind tiers (Inline, Packed, Dedicated, External) plus a null row, then read it back through Spark on this branch:

write_blob_v2.py

import lance, pyarrow as pa

with open("/tmp/blob_v2_external.bin", "wb") as f:
    f.write(b"external blob payload " * 64)

values = [
    b"hi",                                    # 2 B → Inline (kind=0, ≤ 64KB)
    b"x" * (200 * 1024),                      # 200KB → Packed (kind=1, > 64KB, ≤ 4MB)
    b"y" * (5 * 1024 * 1024),                 # 5 MB → Dedicated (kind=2, > 4MB)
    "file:///tmp/blob_v2_external.bin",        # str → External (kind=3)
    None,                                      # null
]

table = pa.table({
    "id": pa.array([0, 1, 2, 3, 4], type=pa.int32()),
    "label": pa.array(["inline", "packed", "dedicated", "external", "null"]),
    "payload": lance.blob_array(values),
})

lance.write_dataset(table, "/tmp/blob_v2_kinds.lance",
    data_storage_version="2.2",
    allow_external_blob_outside_bases=True)

Spark output

> printSchema
root
 |-- id: integer (nullable = true)
 |-- label: string (nullable = true)
 |-- payload: struct (nullable = true)
 | |-- kind: short (nullable = true)
 | |-- position: long (nullable = true)
 | |-- size: long (nullable = true)
 | |-- blob_id: long (nullable = true)
 | |-- blob_uri: string (nullable = true)
 
 
> SELECT *
+---+---------+----------------------------------------------+
|id |label    |payload                                       |
+---+---------+----------------------------------------------+
|0  |inline   |{0, 0, 2, 0, }                               |
|1  |packed   |{1, 0, 204800, 1, }                          |
|2  |dedicated|{2, 0, 5242880, 2, }                         |
|3  |external |{3, 0, 0, 0, file:///tmp/blob_v2_external.bin}|
|4  |null     |{0, 0, 0, 0, }                               |
+---+---------+----------------------------------------------+

> projected fields
+---+---------+----+-------+-------+--------+--------------------------------+
|id |label    |kind|size   |blob_id|position|blob_uri                        |
+---+---------+----+-------+-------+--------+--------------------------------+
|0  |inline   |0   |2      |0      |0       |                                |
|1  |packed   |1   |204800 |1      |0       |                                |
|2  |dedicated|2   |5242880|2      |0       |                                |
|3  |external |3   |0      |0      |0       |file:///tmp/blob_v2_external.bin|
|4  |null     |0   |0      |0      |0       |                                |
+---+---------+----+-------+-------+--------+--------------------------------+

filter id >= 2
+---+---------+----+
|id |label    |kind|
+---+---------+----+
|2  |dedicated|2   |
|3  |external |3   |
|4  |null     |0   |
+---+---------+----+

hamersaw · 2026-05-28T15:25:06Z

Great PR! I think this opens a few questions. In #355 we merged a similar concept for v1 blobs but take a quite different approach, maybe it's worth stating goals here to understand advantages / disadvantages.

That work chose to encode metadata into a binary format and then transparently retrieve when writing which gives us:
(1) native binary types in the schema: Able to call getBinary internally to grab the binary data.
(2) write side coalesced reads: If the BlobReference type is found we can batch together take operations when writing so that we don't incur 1 IOP per row.

Alternatively, this approach returns a struct with metadata, which makes it super easy to parse information about the blob but IMO may make actually retrieving the data a little less ergonomic and more expensive. I'm interested on your thoughts here?

SDK-wise I think it's a tradeoff of the root uses, with either approach to retreive both metadata and actual binary data we would probably need to add Spark fucntoins. For example:

# encoding as binary data
SELECT x, lance_blob_size(x) FROM y

# encoded as struct
SELECT lance_blob(x), x.size FROM y

But I think the performance of our use-case should be the top concern here.

geruh · 2026-05-28T18:59:24Z

Thanks for the review @hamersaw! Yeah, I agree that this is kind of a different goal from 355. For this PR, I really just wanted to scope v2 reads to the descriptor model to follow what people might be used to when interacting with LanceDB or Lance raw APIs. And that is defaulting to a descriptor read mode.

In this mode, materialization is cheap because it's just the descriptor column value, that's intentionally different from the v1 binary blob reference path, which optimizes copy through Spark and right-side preservation by keeping getBinary semantics and batching blob fetches.

For fetching the actual bytes, we can add an explicit follow-up read mode similar to the blob handling modes defined on the LanceDB side(i.e. lazy, materialize). That follow-up would have that same performance story we have in V1 with the additional gains that stem from V2 on the format side.

Ultimately, my intent here is to keep things simple with descriptor first, follow up with read modes.

hamersaw · 2026-05-29T14:15:28Z

For posterity - we addressed the conversation ^^^ offline. The consensus is that materializing the blob metadata as a Struct field in the Spark schema is a reasonable approach as long as it's well documented.

The only other comment I have is that the PR description suggests that filter pushdowns:

Suppressed for any table with a v2 column. The previous per-predicate gate had to be reverted because calc_eager_projection panics on filters that never reference blob columns. Zonemap fragment pruning still runs. Need more investigation here

I'm not seeing this logic anywhere in the code? It looks like the hasBlobV2Fields function was originally meant for this, but it doesn't seem to be used anywhere. I think it's OK to allow filter pushdowns as long as they're not on the blobv2 column specifically right?

hamersaw · 2026-05-29T15:30:37Z

+          .add("kind", DataTypes.ShortType)
+          .add("position", DataTypes.LongType)
+          .add("size", DataTypes.LongType)
+          .add("blob_id", DataTypes.LongType)


Does it make sense to store position / size or would it be better to store the _rowaddr? When we read this the former means we need to open an objectstore reader and request the bytes, if we use _rowaddr then we can use the lance-native tools to call like take_blobs I think which handles coalesced reads / etc.

github-actions Bot added the enhancement New feature or request label May 21, 2026

geruh force-pushed the v2read-descriptor branch 2 times, most recently from eae03ac to 1f819d6 Compare May 26, 2026 01:57

geruh mentioned this pull request May 26, 2026

feat: blob v2 write support #560

Open

feat: blob v2 descriptor read support

41b3682

geruh force-pushed the v2read-descriptor branch from 1f819d6 to 41b3682 Compare May 27, 2026 00:39

hamersaw reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: blob v2 descriptor read support#548

feat: blob v2 descriptor read support#548
geruh wants to merge 1 commit into
lance-format:mainfrom
geruh:v2read-descriptor

geruh commented May 21, 2026

Uh oh!

hamersaw commented May 28, 2026

Uh oh!

geruh commented May 28, 2026

Uh oh!

hamersaw commented May 29, 2026

Uh oh!

hamersaw May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

geruh commented May 21, 2026

Uh oh!

hamersaw commented May 28, 2026

Uh oh!

geruh commented May 28, 2026

Uh oh!

hamersaw commented May 29, 2026

Uh oh!

hamersaw May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants