Skip to content

feat: blob v2 descriptor read support#548

Open
geruh wants to merge 1 commit into
lance-format:mainfrom
geruh:v2read-descriptor
Open

feat: blob v2 descriptor read support#548
geruh wants to merge 1 commit into
lance-format:mainfrom
geruh:v2read-descriptor

Conversation

@geruh
Copy link
Copy Markdown
Collaborator

@geruh geruh commented May 21, 2026

related to #539 and #505

Adds read support for blob v2 columns. Instead of hiding blob metadata behind virtual columns, we surface the raw descriptor struct directly to Spark like what the original issue was stating.

struct<kind: short, position: long, size: long, blob_id: long, blob_uri: string>

Querying blob metadata is just a column projection now, no byte fetch:

SELECT id, payload.size, payload.kind FROM lance.ns.tbl;

A column is blob v2 when any of these hold:

  • Arrow extension name lance.blob.v2 is set in lance
  • metadata key lance-encoding:blob-v2 = true

Schema rewrite lives in BlobUtils.applyBlobV2DescriptorSchema(...), called from LanceDataset.schema(), LanceDataSource.inferSchema(), and the LanceScanBuilder constructor.

Filter pushdown

Suppressed for any table with a v2 column. The previous per-predicate gate had to be reverted because calc_eager_projection panics on filters that never reference blob columns. Zonemap fragment pruning still runs. Need more investigation here

Testing

  • Added some tests for BlobUtils
  • Added some tests for LanceArrowColumnVector

No end-to-end pylance interop tests here. The connector can't write v2 blobs yet, so producing test data requires pylance as an external dep. E2E coverage lands naturally once the write path exists, extending BaseBlobCreateTableTest's pattern.

Local verification

Wrote a v2 dataset with pylance hitting all four BlobKind tiers (Inline, Packed, Dedicated, External) plus a null row, then read it back through Spark on this branch:

write_blob_v2.py
import lance, pyarrow as pa

with open("/tmp/blob_v2_external.bin", "wb") as f:
    f.write(b"external blob payload " * 64)

values = [
    b"hi",                                    # 2 B → Inline (kind=0, ≤ 64KB)
    b"x" * (200 * 1024),                      # 200KB → Packed (kind=1, > 64KB, ≤ 4MB)
    b"y" * (5 * 1024 * 1024),                 # 5 MB → Dedicated (kind=2, > 4MB)
    "file:///tmp/blob_v2_external.bin",        # str → External (kind=3)
    None,                                      # null
]

table = pa.table({
    "id": pa.array([0, 1, 2, 3, 4], type=pa.int32()),
    "label": pa.array(["inline", "packed", "dedicated", "external", "null"]),
    "payload": lance.blob_array(values),
})

lance.write_dataset(table, "/tmp/blob_v2_kinds.lance",
    data_storage_version="2.2",
    allow_external_blob_outside_bases=True)
Spark output
> printSchema
root
 |-- id: integer (nullable = true)
 |-- label: string (nullable = true)
 |-- payload: struct (nullable = true)
 | |-- kind: short (nullable = true)
 | |-- position: long (nullable = true)
 | |-- size: long (nullable = true)
 | |-- blob_id: long (nullable = true)
 | |-- blob_uri: string (nullable = true)
 
 
> SELECT *
+---+---------+----------------------------------------------+
|id |label    |payload                                       |
+---+---------+----------------------------------------------+
|0  |inline   |{0, 0, 2, 0, }                               |
|1  |packed   |{1, 0, 204800, 1, }                          |
|2  |dedicated|{2, 0, 5242880, 2, }                         |
|3  |external |{3, 0, 0, 0, file:///tmp/blob_v2_external.bin}|
|4  |null     |{0, 0, 0, 0, }                               |
+---+---------+----------------------------------------------+

> projected fields
+---+---------+----+-------+-------+--------+--------------------------------+
|id |label    |kind|size   |blob_id|position|blob_uri                        |
+---+---------+----+-------+-------+--------+--------------------------------+
|0  |inline   |0   |2      |0      |0       |                                |
|1  |packed   |1   |204800 |1      |0       |                                |
|2  |dedicated|2   |5242880|2      |0       |                                |
|3  |external |3   |0      |0      |0       |file:///tmp/blob_v2_external.bin|
|4  |null     |0   |0      |0      |0       |                                |
+---+---------+----+-------+-------+--------+--------------------------------+

filter id >= 2
+---+---------+----+
|id |label    |kind|
+---+---------+----+
|2  |dedicated|2   |
|3  |external |3   |
|4  |null     |0   |
+---+---------+----+

@github-actions github-actions Bot added the enhancement New feature or request label May 21, 2026
@geruh geruh force-pushed the v2read-descriptor branch 2 times, most recently from eae03ac to 1f819d6 Compare May 26, 2026 01:57
@geruh geruh force-pushed the v2read-descriptor branch from 1f819d6 to 41b3682 Compare May 27, 2026 00:39
@hamersaw
Copy link
Copy Markdown
Collaborator

Great PR! I think this opens a few questions. In #355 we merged a similar concept for v1 blobs but take a quite different approach, maybe it's worth stating goals here to understand advantages / disadvantages.

That work chose to encode metadata into a binary format and then transparently retrieve when writing which gives us:
(1) native binary types in the schema: Able to call getBinary internally to grab the binary data.
(2) write side coalesced reads: If the BlobReference type is found we can batch together take operations when writing so that we don't incur 1 IOP per row.

Alternatively, this approach returns a struct with metadata, which makes it super easy to parse information about the blob but IMO may make actually retrieving the data a little less ergonomic and more expensive. I'm interested on your thoughts here?

SDK-wise I think it's a tradeoff of the root uses, with either approach to retreive both metadata and actual binary data we would probably need to add Spark fucntoins. For example:

# encoding as binary data
SELECT x, lance_blob_size(x) FROM y

# encoded as struct
SELECT lance_blob(x), x.size FROM y

But I think the performance of our use-case should be the top concern here.

@geruh
Copy link
Copy Markdown
Collaborator Author

geruh commented May 28, 2026

Thanks for the review @hamersaw! Yeah, I agree that this is kind of a different goal from 355. For this PR, I really just wanted to scope v2 reads to the descriptor model to follow what people might be used to when interacting with LanceDB or Lance raw APIs. And that is defaulting to a descriptor read mode.

In this mode, materialization is cheap because it's just the descriptor column value, that's intentionally different from the v1 binary blob reference path, which optimizes copy through Spark and right-side preservation by keeping getBinary semantics and batching blob fetches.

For fetching the actual bytes, we can add an explicit follow-up read mode similar to the blob handling modes defined on the LanceDB side(i.e. lazy, materialize). That follow-up would have that same performance story we have in V1 with the additional gains that stem from V2 on the format side.

Ultimately, my intent here is to keep things simple with descriptor first, follow up with read modes.

@hamersaw
Copy link
Copy Markdown
Collaborator

For posterity - we addressed the conversation ^^^ offline. The consensus is that materializing the blob metadata as a Struct field in the Spark schema is a reasonable approach as long as it's well documented.

The only other comment I have is that the PR description suggests that filter pushdowns:

Suppressed for any table with a v2 column. The previous per-predicate gate had to be reverted because calc_eager_projection panics on filters that never reference blob columns. Zonemap fragment pruning still runs. Need more investigation here

I'm not seeing this logic anywhere in the code? It looks like the hasBlobV2Fields function was originally meant for this, but it doesn't seem to be used anywhere. I think it's OK to allow filter pushdowns as long as they're not on the blobv2 column specifically right?

.add("kind", DataTypes.ShortType)
.add("position", DataTypes.LongType)
.add("size", DataTypes.LongType)
.add("blob_id", DataTypes.LongType)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to store position / size or would it be better to store the _rowaddr? When we read this the former means we need to open an objectstore reader and request the bytes, if we use _rowaddr then we can use the lance-native tools to call like take_blobs I think which handles coalesced reads / etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants