feat: blob v2 descriptor read support#548
Conversation
eae03ac to
1f819d6
Compare
|
Great PR! I think this opens a few questions. In #355 we merged a similar concept for v1 blobs but take a quite different approach, maybe it's worth stating goals here to understand advantages / disadvantages. That work chose to encode metadata into a binary format and then transparently retrieve when writing which gives us: Alternatively, this approach returns a struct with metadata, which makes it super easy to parse information about the blob but IMO may make actually retrieving the data a little less ergonomic and more expensive. I'm interested on your thoughts here? SDK-wise I think it's a tradeoff of the root uses, with either approach to retreive both metadata and actual binary data we would probably need to add Spark fucntoins. For example: But I think the performance of our use-case should be the top concern here. |
|
Thanks for the review @hamersaw! Yeah, I agree that this is kind of a different goal from 355. For this PR, I really just wanted to scope v2 reads to the descriptor model to follow what people might be used to when interacting with LanceDB or Lance raw APIs. And that is defaulting to a descriptor read mode. In this mode, materialization is cheap because it's just the descriptor column value, that's intentionally different from the v1 binary blob reference path, which optimizes copy through Spark and right-side preservation by keeping For fetching the actual bytes, we can add an explicit follow-up read mode similar to the blob handling modes defined on the LanceDB side(i.e. Ultimately, my intent here is to keep things simple with descriptor first, follow up with read modes. |
|
For posterity - we addressed the conversation ^^^ offline. The consensus is that materializing the blob metadata as a The only other comment I have is that the PR description suggests that filter pushdowns: I'm not seeing this logic anywhere in the code? It looks like the |
| .add("kind", DataTypes.ShortType) | ||
| .add("position", DataTypes.LongType) | ||
| .add("size", DataTypes.LongType) | ||
| .add("blob_id", DataTypes.LongType) |
There was a problem hiding this comment.
Does it make sense to store position / size or would it be better to store the _rowaddr? When we read this the former means we need to open an objectstore reader and request the bytes, if we use _rowaddr then we can use the lance-native tools to call like take_blobs I think which handles coalesced reads / etc.
related to #539 and #505
Adds read support for blob v2 columns. Instead of hiding blob metadata behind virtual columns, we surface the raw descriptor struct directly to Spark like what the original issue was stating.
Querying blob metadata is just a column projection now, no byte fetch:
A column is blob v2 when any of these hold:
lance.blob.v2is set in lancelance-encoding:blob-v2 = trueSchema rewrite lives in
BlobUtils.applyBlobV2DescriptorSchema(...), called fromLanceDataset.schema(),LanceDataSource.inferSchema(), and theLanceScanBuilderconstructor.Filter pushdown
Suppressed for any table with a v2 column. The previous per-predicate gate had to be reverted because
calc_eager_projectionpanics on filters that never reference blob columns. Zonemap fragment pruning still runs. Need more investigation hereTesting
No end-to-end pylance interop tests here. The connector can't write v2 blobs yet, so producing test data requires pylance as an external dep. E2E coverage lands naturally once the write path exists, extending
BaseBlobCreateTableTest's pattern.Local verification
Wrote a v2 dataset with pylance hitting all four
BlobKindtiers (Inline, Packed, Dedicated, External) plus a null row, then read it back through Spark on this branch:write_blob_v2.py
Spark output