Skip to content

fix: support multivector IVF centroids in segment builds#6995

Merged
yanghua merged 6 commits into
lance-format:mainfrom
ddupg:codex/ddu-67-multivector
May 31, 2026
Merged

fix: support multivector IVF centroids in segment builds#6995
yanghua merged 6 commits into
lance-format:mainfrom
ddupg:codex/ddu-67-multivector

Conversation

@ddupg
Copy link
Copy Markdown
Contributor

@ddupg ddupg commented May 29, 2026

Distributed IVF index builds failed for multivector columns when using precomputed centroids:

Failed to cast centroids to column type: Cast error: Casting from Float32 to FixedSizeList(4 x Float32) not supported

This fixes the centroid type handling for list<fixed_size_list<...>> columns and allows IndicesBuilder to infer the inner vector type for distributed builds.

@github-actions github-actions Bot added bug Something isn't working python labels May 29, 2026
@ddupg ddupg marked this pull request as ready for review May 29, 2026 09:30
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@yanghua
Copy link
Copy Markdown
Collaborator

yanghua commented May 30, 2026

@claude review

@yanghua yanghua self-requested a review May 30, 2026 02:14
Comment on lines 599 to +611

def _vector_dimension(self, data_type):
if pa.types.is_fixed_size_list(data_type):
return data_type.list_size
if pa.types.is_list(data_type) and pa.types.is_fixed_size_list(
data_type.value_type
):
return data_type.value_type.list_size
if isinstance(data_type, pa.FixedShapeTensorType) and len(data_type.shape) == 1:
return data_type.shape[0]
raise TypeError(
"Vector column must be FixedSizeListArray "
f"1-dimensional FixedShapeTensorArray, got {data_type}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Three call sites in IndicesBuilder still use self.dataset.schema.field(self.column[0]).type.list_size directly instead of the new _vector_dimension helper, so calling train_pq (line 204), transform_vectors (line 364), or load_shuffled_vectors (line 457) on a multivector column raises AttributeError: ListType object has no attribute list_size. These are exactly the distributed-build steps the PR description says it enables for multivectors — the new test only exercises create_index_uncommitted (Rust path), so CI doesn't catch the regression. Fix: replace each with self.dimension (already set correctly via _vector_dimension in __init__) and delete the dead expression statement at line 204.

Extended reasoning...

What's wrong

This PR adds a _vector_dimension helper (builder.py:599-611) to support multivector columns whose type is pa.list_(pa.list_(pa.float32(), N)). It's correctly invoked from __init__ to set self.dimension. However, three other places in IndicesBuilder still reach into the raw column type with the pre-PR pattern self.dataset.schema.field(self.column[0]).type.list_size:

  • builder.py:204 in train_pq — a bare expression whose value is discarded (dead code), but the attribute access still executes.
  • builder.py:364 in transform_vectors — assigned to dimension and passed to indices.transform_vectors.
  • builder.py:457 in load_shuffled_vectors — assigned to pq_dimension and passed to indices.load_shuffled_vectors.

For a multivector column the outer arrow type is a variable-size ListType (not FixedSizeListType). ListType has no list_size attribute — only the inner FixedSizeListType does — so this raises AttributeError.

Why CI doesn't catch it

The new test_ivf_centroids_multivector_fragment_ids only constructs an IndicesBuilder (to check builder.dimension == 4) and then calls ds.create_index_uncommitted(...), which goes through the Rust path prepare_vector_index_params rather than the Python distributed-build helpers. None of train_pq, transform_vectors, or load_shuffled_vectors are exercised on a multivector column anywhere in the test suite.

Impact

The PR description explicitly states that the change "allows IndicesBuilder to infer the inner vector type for distributed builds." But transform_vectors → shuffle_transformed_vectors → load_shuffled_vectors is the distributed build pipeline, and train_pq is the PQ training step for it. With these three sites still using the broken pattern, the PR's stated goal is only half-achieved — the Rust-side centroid cast is fixed, but a Python user invoking the distributed-build helpers on a multivector dataset will crash before reaching the Rust code.

Step-by-step proof

import pyarrow as pa

# Multivector column type (matches the new test):
mv_type = pa.list_(pa.list_(pa.float32(), 4))
print(type(mv_type).__name__)   # 'ListType'
mv_type.list_size               # AttributeError: 'pyarrow.lib.ListType' object has no attribute 'list_size'

So when a user runs the documented distributed-build flow on a multivector dataset:

builder = IndicesBuilder(ds, "embeddings")   # ok — uses _vector_dimension
ivf = builder.train_ivf(num_partitions=2)
pq = builder.train_pq(ivf)                     # AttributeError at builder.py:204

…the call fails at the very first distributed-build step. Even if line 204 is deleted, transform_vectors and load_shuffled_vectors would hit the same error at lines 364 and 457.

Fix

Each of the three sites should use self.dimension, which __init__ already populates correctly via the new _vector_dimension helper. The line at builder.py:204 is a vestigial expression statement (its result is unused) and should simply be deleted.

🔬 also observed by verify-runtime

Copy link
Copy Markdown
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two comments.

Comment thread python/python/lance/indices/builder.py Outdated
f"1-dimensional FixedShapeTensorArray, got {data_type}"
)

def _vector_value_type(self, data_type):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two method bodies have the same judgment logic. Maybe we can extract them into a common place, e.g. _describe_vector_type(data_type) -> Optional[(dim, scalar_type)].

Comment thread python/python/lance/indices/builder.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there still be any problems here?

Copy link
Copy Markdown
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment left, otherwise LGTM.

Comment thread python/python/lance/indices/builder.py Outdated
Comment on lines +605 to +606
"Vector column must be FixedSizeListArray "
f"1-dimensional FixedShapeTensorArray, got {data_type}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this? Vector column must be FixedSizeListArray, list<FixedSizeList> (multivector), or 1-dimensional FixedShapeTensorArray, got {data_type}?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing, fixed this.

Comment thread python/src/dataset.rs
}

let centroid_type = match column_type {
DataType::List(field)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can also support LargeList<FixedSizeList>. But not must have it in this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I checked the Rust and Python vector index paths, and LargeList is not supported end-to-end today. Most multivector code only handles List.

This PR keeps the scope to fixing the existing List distributed/segment index build issue. LargeList support should be handled separately.

Copy link
Copy Markdown
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yanghua yanghua merged commit db21c37 into lance-format:main May 31, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants