Skip to content

Tracking: Distributed Indexes Search #6309

@Xuanwo

Description

@Xuanwo

Lance is making distributed indexing a first-class capability by unifying all index
families on a global segment model: a logical index is a set of physical segments, and
build, commit, query, append, optimize, and GC all operate on segment sets.

Distributed indexing should not be a special path; it should be the natural
consequence of a segment-native index system.

Target interface — the contract we are building toward

Every index family must converge on the same segment-native interface. Contributions that bypass it — by adding a per-type distributed path, or by repurposing an identifier or directory as coordination state — are built on the wrong model even when they pass tests. Please align new work with the contract below.

Data model. A logical index is a set of physical segments. Each segment has its own physical UUID, lives at _indices/<segment_uuid>/, is self-contained and immutable once built, and covers a disjoint set of fragments (segments in one logical index never overlap). build, commit, query, append, optimize, and GC all operate on segment sets, so distributed indexing is just "build N segments on N workers, then commit them as one logical index" — not a special code path.

Canonical build workflow (identical for every index type):

  1. create_index_uncommitted(column, index_type, fragment_ids=...) — each worker builds one uncommitted segment from its assigned fragment subset; returns segment metadata, does not touch the manifest.
  2. merge_existing_index_segments(segments) — build the staged segments into the final physical segment set.
  3. commit_existing_index_segments(index_name, column, segments) — publish the segment set as one logical index in a single transaction.

The four per-family capabilities tracked below map onto this workflow: distributed build (step 1) · distributed merge (inside step 2 / the primitive) · commit (step 3) · distributed multi-segment query.

Anti-patterns — do not build on these:

  • ❌ Treating index_uuid as a shared merge directory or job id (the legacy Bitmap merge_index_metadata(index_uuid, "BITMAP") + shard_id model). A physical segment is identified by its own UUID, not by a coordination handle.
  • ❌ Adding a bespoke distributed or merge path for a single index type instead of implementing the hooks behind the canonical API above.
  • ❌ Merging by stitching metadata or sharing a directory across workers, instead of rewriting into a new self-contained segment.

Adding a new index family means implementing these hooks — not inventing a parallel path.

Cross-cutting foundation (index-type agnostic)

The segment model, transaction semantics, commit/validation, and the multi-segment
query plumbing are shared across all families.

  • Clarify logical indices vs physical segments — #6270
  • Index segment commit API — #6209; preserve create-index transaction semantics — #6204
  • Align distributed index build around segments — #6313
  • Preserve index details in index metadata — #6279
  • Enforce fragment_bitmap disjoint coverage at commit time
  • Make append and optimize segment-set-native — Make append and optimize segment-set-native #6398
  • Index reader for segmented index — index reader for segmented index #6399
  • Crash / partial-failure recovery for in-flight distributed builds
  • Ownership + cleanup sweep for stale uncommitted index artifacts (GC)

Vector (IVF_FLAT / IVF_SQ / IVF_PQ / IVF_RQ / IVF_HNSW_*)

IVF family (FLAT/SQ/PQ/RQ) has all four capabilities. HNSW variants build/commit/query
but merge discards the graph (degrades to brute-force IVF), so HNSW relies on the
non-shared-centroid build+commit+query path instead of merge.

  • Distributed vector segment build — #6220
  • Remove staging from distributed vector indexing — #6269
  • Non-shared centroid (per-segment) builds — #6296
  • Distributed IVF_RQ segment builds — #6359
  • Vector query pruning by index segments — #6376
  • HNSW: preserve or rebuild the graph on merge, or formally make
    non-shared-centroid the only HNSW distributed path

FTS / Inverted

Segmented build + merge + commit + multi-segment search with a shared global BM25 scorer
all landed (#6305). Remaining work is the efficiency layer for multi-segment search.

Scalar (BTree / Bitmap / LabelList / NGram / ZoneMap / BloomFilter / RTree)

Commit and multi-segment query are index-type-agnostic and already work for all scalar
families. The gap is distributed build (only BTree/Bitmap today) and merge (none yet —
merge_existing_index_segments accepts only vector/inverted).

Cross-family decisions

  • Define merge modes per family
  • Decide which families need multi-segment query and distributed build support

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions