Lance is making distributed indexing a first-class capability by unifying all index
families on a global segment model: a logical index is a set of physical segments, and
build, commit, query, append, optimize, and GC all operate on segment sets.
Distributed indexing should not be a special path; it should be the natural
consequence of a segment-native index system.
Target interface — the contract we are building toward
Every index family must converge on the same segment-native interface. Contributions that bypass it — by adding a per-type distributed path, or by repurposing an identifier or directory as coordination state — are built on the wrong model even when they pass tests. Please align new work with the contract below.
Data model. A logical index is a set of physical segments. Each segment has its own physical UUID, lives at _indices/<segment_uuid>/, is self-contained and immutable once built, and covers a disjoint set of fragments (segments in one logical index never overlap). build, commit, query, append, optimize, and GC all operate on segment sets, so distributed indexing is just "build N segments on N workers, then commit them as one logical index" — not a special code path.
Canonical build workflow (identical for every index type):
create_index_uncommitted(column, index_type, fragment_ids=...) — each worker builds one uncommitted segment from its assigned fragment subset; returns segment metadata, does not touch the manifest.
merge_existing_index_segments(segments) — build the staged segments into the final physical segment set.
commit_existing_index_segments(index_name, column, segments) — publish the segment set as one logical index in a single transaction.
The four per-family capabilities tracked below map onto this workflow: distributed build (step 1) · distributed merge (inside step 2 / the primitive) · commit (step 3) · distributed multi-segment query.
Anti-patterns — do not build on these:
- ❌ Treating
index_uuid as a shared merge directory or job id (the legacy Bitmap merge_index_metadata(index_uuid, "BITMAP") + shard_id model). A physical segment is identified by its own UUID, not by a coordination handle.
- ❌ Adding a bespoke distributed or merge path for a single index type instead of implementing the hooks behind the canonical API above.
- ❌ Merging by stitching metadata or sharing a directory across workers, instead of rewriting into a new self-contained segment.
Adding a new index family means implementing these hooks — not inventing a parallel path.
Cross-cutting foundation (index-type agnostic)
The segment model, transaction semantics, commit/validation, and the multi-segment
query plumbing are shared across all families.
Vector (IVF_FLAT / IVF_SQ / IVF_PQ / IVF_RQ / IVF_HNSW_*)
IVF family (FLAT/SQ/PQ/RQ) has all four capabilities. HNSW variants build/commit/query
but merge discards the graph (degrades to brute-force IVF), so HNSW relies on the
non-shared-centroid build+commit+query path instead of merge.
FTS / Inverted
Segmented build + merge + commit + multi-segment search with a shared global BM25 scorer
all landed (#6305). Remaining work is the efficiency layer for multi-segment search.
Scalar (BTree / Bitmap / LabelList / NGram / ZoneMap / BloomFilter / RTree)
Commit and multi-segment query are index-type-agnostic and already work for all scalar
families. The gap is distributed build (only BTree/Bitmap today) and merge (none yet —
merge_existing_index_segments accepts only vector/inverted).
Cross-family decisions
References
Lance is making distributed indexing a first-class capability by unifying all index
families on a global segment model: a logical index is a set of physical segments, and
build, commit, query, append, optimize, and GC all operate on segment sets.
Target interface — the contract we are building toward
Every index family must converge on the same segment-native interface. Contributions that bypass it — by adding a per-type distributed path, or by repurposing an identifier or directory as coordination state — are built on the wrong model even when they pass tests. Please align new work with the contract below.
Data model. A logical index is a set of physical segments. Each segment has its own physical UUID, lives at
_indices/<segment_uuid>/, is self-contained and immutable once built, and covers a disjoint set of fragments (segments in one logical index never overlap).build,commit,query,append,optimize, andGCall operate on segment sets, so distributed indexing is just "build N segments on N workers, then commit them as one logical index" — not a special code path.Canonical build workflow (identical for every index type):
create_index_uncommitted(column, index_type, fragment_ids=...)— each worker builds one uncommitted segment from its assigned fragment subset; returns segment metadata, does not touch the manifest.merge_existing_index_segments(segments)— build the staged segments into the final physical segment set.commit_existing_index_segments(index_name, column, segments)— publish the segment set as one logical index in a single transaction.The four per-family capabilities tracked below map onto this workflow: distributed build (step 1) · distributed merge (inside step 2 / the primitive) · commit (step 3) · distributed multi-segment query.
Anti-patterns — do not build on these:
index_uuidas a shared merge directory or job id (the legacy Bitmapmerge_index_metadata(index_uuid, "BITMAP")+shard_idmodel). A physical segment is identified by its own UUID, not by a coordination handle.Adding a new index family means implementing these hooks — not inventing a parallel path.
Cross-cutting foundation (index-type agnostic)
The segment model, transaction semantics, commit/validation, and the multi-segment
query plumbing are shared across all families.
fragment_bitmapdisjoint coverage at commit timeVector (IVF_FLAT / IVF_SQ / IVF_PQ / IVF_RQ / IVF_HNSW_*)
IVF family (FLAT/SQ/PQ/RQ) has all four capabilities. HNSW variants build/commit/query
but merge discards the graph (degrades to brute-force IVF), so HNSW relies on the
non-shared-centroid build+commit+query path instead of merge.
non-shared-centroid the only HNSW distributed path
FTS / Inverted
Segmented build + merge + commit + multi-segment search with a shared global BM25 scorer
all landed (#6305). Remaining work is the efficiency layer for multi-segment search.
Scalar (BTree / Bitmap / LabelList / NGram / ZoneMap / BloomFilter / RTree)
Commit and multi-segment query are index-type-agnostic and already work for all scalar
families. The gap is distributed build (only BTree/Bitmap today) and merge (none yet —
merge_existing_index_segmentsaccepts only vector/inverted).range-partitioned build — feat(btree): support distributed range-partitioned BTree index build #6859
distributed build + merge (today they reject fragment-scoped training)
Cross-family decisions
References