feat: add distributed zonemap index build with configurable segments by beinan · Pull Request #516 · lance-format/lance-spark

beinan · 2026-05-11T18:23:47Z

Summary

Add zonemap as a new index type in CREATE INDEX DDL with distributed build support
Batch fragments into configurable segments via num_segments option (defaults to spark.default.parallelism)
Each segment is built in parallel on Spark executors and committed as a logical index on the driver
Zonemap indexes currently support single column only

What Changed

AddIndexExec.scala: Zonemap-specific path with ZonemapIndexJob/ZonemapIndexTask and commitIndexSegments
create-index.md: Document zonemap index type, options, and usage
Tests: unit tests for segment creation/validation and integration test

Notes

Rebased cleanly onto current main
Depends on lance-core 7.0.0-beta.10 or newer which includes zonemap segment support
Supersedes PR feat: add distributed zonemap index build #473 and closed PR feat: support zonemap indexes in ALTER TABLE CREATE INDEX #466

Test plan

CI passes (lint, unit tests, integration tests across all Spark/Scala versions)
Zonemap index creation with default segment count
Zonemap index creation with explicit num_segments
Repeated zonemap index creation replaces existing segments
Query correctness after zonemap index creation

🤖 Generated with Claude Code

hamersaw

This looks pretty good. Thanks for the PR! A few things that we should tighten up IMO.

hamersaw · 2026-05-12T16:00:16Z

-        Option[Map[String, String]]) = catalog match {
+      column: String,
+      segments: Seq[Index]): Unit = {
+    val dataset = Utils.openDatasetBuilder(readOptions).build()


Is there a specific reason we are opening two datasets within this function?

The two opens are needed because commitExistingIndexSegments creates a new dataset version, so we need a fresh dataset handle to read the updated index state for the cleanup transaction. I've added a comment in the code explaining this. The first dataset is used for the segment commit, and the second reads the post-commit state to remove old segments.

I am pretty concerned that this can leave us in a poor state if we commit the new index segments and there is a failure before removing the old ones. I think results should still be correct, but there will be quite a bit of overhead until the indexing process is reran.

It doesn't look like there is a solution for this in the lance core SDK when committing segmented indices, maybe we need to devise a solution for transactionally replacing segmented indices?

Great catch. After investigating Lance core's commit_existing_index_segments (index.rs:1065-1164), it turns out the core API already handles atomic replacement — it finds existing segments whose fragments overlap with incoming ones and removes them in the same CreateIndex transaction. The Spark-side manual cleanup (second dataset open + removal transaction) was redundant and is what introduced the race.

Fixed in the latest push by simplifying commitIndexSegments to just call dataset.commitExistingIndexSegments() and let Lance core handle the atomic add+remove. The method went from ~50 lines to ~10.

beinan · 2026-05-12T21:58:30Z

Thanks for the thorough review! All feedback has been addressed in the latest push (force-pushed as a single clean commit on latest main):

Scan-side changes removed entirely:

Removed useScalarIndex, forcePostScanFiltering, and shouldForcePostScanFiltering — zonemap fragment pruning works via the existing ZonemapFragmentPruner path without needing special scan flags. No scan-side files are modified in this PR anymore.

Index creation fixes:

Race in commitIndexSegments: Now captures pre-commit UUIDs and only removes indexes with those UUIDs, so concurrent writers' segments are never deleted.
batchFragments accuracy: Switched from ceil(N/K) to index-based slicing (slice(i*N/K, (i+1)*N/K)) to guarantee the requested segment count.
num_segments validation: Bounds check for <= 0 and type validation, passed through constructor (no duplicate extraction).
Segment failure handling: try/catch with clear error message about Lance GC cleanup.

beinan · 2026-05-17T02:04:51Z

@hamersaw All review feedback has been addressed — the key change since your last review is simplifying commitIndexSegments to rely on Lance core's built-in atomic replacement (single transaction for add+remove). Would you mind taking another look when you get a chance? Thanks!

hamersaw · 2026-05-18T10:50:06Z

@LuciferYang do you mind making a pass here? Specifically, I'm interested with how this compares to your proposal (#513) to support building distributed ZoneMap indexes.

LuciferYang · 2026-05-19T05:07:34Z

@LuciferYang do you mind making a pass here? Specifically, I'm interested with how this compares to your proposal (#513) to support building distributed ZoneMap indexes.

will give feedback later today.

LuciferYang · 2026-05-19T12:40:08Z

@hamersaw @beinan Had a closer look. After the May 12 force-push on #516, the two PRs are adjacent rather than overlapping.

	#516	#513 distributed	#513 consolidated (opt-in)
Tasks	1 per `num_segments` batch	1 per fragment	1 per fragment (compute), driver write
Segments on disk	`num_segments` (default = `min(fragments, defaultParallelism)`)	= fragment count	1
Commit API	`commitExistingIndexSegments`	manual `AddIndexOperation` + `Transaction` + `CommitBuilder`	same as distributed
Upstream blocker	none (project's already on lance-core `7.0.0-beta.10`)	none	lance-format/lance#6779 + #6780, both unmerged

The main difference between the two distributed paths is that num_segments on #516 is one knob doing two jobs: parallelism and segment count are the same lever. At num_segments=1 you get one segment, but the work serialises onto a single executor — not the same as #513's consolidated path, which keeps compute parallel and only centralises the write. That decoupled corner only opens up once #6779/#6780 land.

For reference, sf=100 store_sales (234 fragments, ~288M rows) under #513: distributed = 15.0 s / 234 segments / 1.1 MB; consolidated = 28.1 s / 1 segment / 138 KB. The ~8× footprint drop is per-file header amortisation.

What I'd want to inherit from #516:

It uses commitExistingIndexSegments. That's the right API; feat(sql): support USING zonemap with distributed and consolidated build paths #513 reimplements the add+remove transaction by hand and should be rebased onto it.
Dropping the scan-side useScalarIndex / forcePostScanFiltering machinery in the May 12 force-push was the right call — pruning works through ZonemapFragmentPruner without it.

Suggested path:

Land feat: add distributed zonemap index build with configurable segments #516 as-is.
After it lands, I'll rebase feat(sql): support USING zonemap with distributed and consolidated build paths #513 distributed onto commitExistingIndexSegments. The fragment-id exception wrapping in ZonemapFragmentTask is worth porting onto ZonemapIndexTask while we're there. After that, feat(sql): support USING zonemap with distributed and consolidated build paths #513 only carries the consolidated path.
Consolidated lands as an opt-in (spark.lance.zonemap.consolidate.enabled) once #6779/#6780 release. Default-off — driver allocator holds every per-fragment Arrow batch until writeZonemapIndexFromBatches consumes them, so it can regress at very-high fragment counts.

One nit on #516, will leave inline: targetTasks = math.min(fragmentIds.size, n) silently clamps num_segments=1000 to fragment count. The doc string reads like num_segments is a target, not an upper bound. Either log when clamping or reword the doc.

beinan · 2026-05-19T22:22:45Z

@LuciferYang Thanks for the thorough comparison — the side-by-side table is really helpful.

You're right about num_segments doing double duty. The latest push addresses the nit:

num_segments doc now clarifies it's an upper bound clamped to fragment count
Added a log message when clamping occurs so it's not silent
Switched batchFragments to index-based slicing to guarantee the requested segment count

Agreed on the suggested path — happy to land #516 as the distributed foundation, then have #513 rebase its distributed path onto commitExistingIndexSegments and carry the consolidated path as an opt-in once #6779/#6780 land. The fragment-id exception wrapping from ZonemapFragmentTask is a good addition to port over as well.

LuciferYang

Does the PR description also need to be updated?

LuciferYang · 2026-05-21T14:14:51Z

+    val validatedNumSegments: Option[Int] = numSegmentsOpt.map { arg =>
+      val value =
+        try {
+          arg.value.asInstanceOf[Number].intValue()


Scala's null.asInstanceOf[Number] returns null instead of throwing ClassCastException, so a WITH (num_segments = null) argument bypasses the friendly error path and dies with an opaque NullPointerException at .intValue().

On the other hand, if the parser delivers a java.lang.Long (e.g., for an out-of-range literal), .intValue() silently truncates rather than rejecting. Negative Longs below Int.MinValue truncate to positive Ints and slip past the value <= 0 check on line 111. Validate the Long bounds before narrowing.

It can be revised as follows:

val value = arg.value match { case null => throw new IllegalArgumentException("num_segments must be a positive integer, got: null") case n: Number => val asLong = n.longValue() if (asLong < 1L || asLong > Int.MaxValue) throw new IllegalArgumentException( s"num_segments must be a positive integer that fits in Int, got: $asLong") asLong.toInt case other => throw new IllegalArgumentException( s"num_segments must be a positive integer, got: $other") }

With this in place the redundant if (value <= 0) throw … block on line 111 can be removed.

Fixed — switched to pattern match handling null, Long bounds, and non-Number types explicitly. Removed the redundant <= 0 check.

LuciferYang · 2026-05-21T14:16:11Z

-        Option[Map[String, String]]) = catalog match {
+      column: String,
+      segments: Seq[Index]): Unit = {
+    val dataset = Utils.openDatasetBuilder(readOptions).build()


The driver opens once on line 72 to enumerate fragmentIds, closes it, then commitIndexSegments opens a fresh one on line 214 to call commitExistingIndexSegments. Both go through the same Utils.openDatasetBuilder(readOptions). Consider consolidating to a single driver-side open scoped to the entire zonemap branch — derive fragmentIds from it and pass that same handle into commitIndexSegments.

Caveat worth verifying first: if commitExistingIndexSegments requires a handle at the latest dataset version (not the version captured at line 72), reusing the older handle could fail the commit on a version mismatch. If the Lance core contract requires a fresh handle, leave commitIndexSegments as-is and only optimize the line 72 enumeration (e.g., enumerate via lanceDataset if it already exposes fragment IDs).

Good point. Left as-is for now since commitExistingIndexSegments may require a handle at the latest version. Worth consolidating in a follow-up if we confirm the version contract.

LuciferYang · 2026-05-21T14:18:18Z

+          math.min(n, addIndexExec.session.sparkContext.defaultParallelism))
+    }
+    (0 until k).map { i =>
+      fragmentIds.slice(i * n / k, (i + 1) * n / k)


i * n (both Int) overflows once the product exceeds Int.MaxValue ≈ 2.1×10^9. Triggering requires a deliberately large num_segments and a fragment count where i*n crosses the boundary — e.g., 200k fragments with num_segments near n puts i*n near 4×10^10. Not a hot-path concern, but cheap to make overflow-safe. Promote one operand to Long:

fragmentIds.slice((i.toLong * n / k).toInt, ((i.toLong + 1) * n / k).toInt)

Equivalent and overflow-safe.

Fixed — promoted to i.toLong * n / k to avoid overflow.

LuciferYang · 2026-05-21T14:23:13Z

-  private def createIndexJob(
-      dataset: Dataset,
-      lanceDataset: LanceDataset,
+  // Lance core's commitExistingIndexSegments handles atomic replacement:


nit: Drop a one-line comment at the call site (line 132) — e.g., // atomic add+remove via Lance core; see commitIndexSegments — so the replacement semantics are visible without jumping definitions.

LuciferYang · 2026-05-21T14:24:24Z

+    }
+
+    // Zonemap uses logical segment commit path
+    if (useLogicalSegmentCommit) {


nit: Cosmetic micro-allocation, but it signals incorrectly that the UUID is needed by every branch. Move val uuid = UUID.randomUUID() below the if (useLogicalSegmentCommit) { … return … } block so it's only generated for the merge-metadata branch.

Done — moved val uuid below the zonemap early return.

LuciferYang · 2026-05-21T14:26:16Z

+
+  private def batchFragments(
+      fragmentIds: List[Integer],
+      numSegments: Option[Int] = None): Seq[List[Integer]] = {


batchFragments is private and called from a single site that always passes the argument. Drop = None to avoid signaling an extension point that doesn't exist.

Dropped the = None default.

LuciferYang · 2026-05-21T14:30:03Z

+      case e: Exception =>
+        throw new RuntimeException(
+          "Zonemap segment build failed. Uncommitted segments (if any) " +
+            "will be cleaned up by Lance's garbage collection.",


Will it really be cleaned up automatically?

Good question — updated the message. Uncommitted segments are not visible to readers and do not affect query correctness. They are orphaned artifacts that occupy storage but have no semantic impact.

LuciferYang · 2026-05-21T14:33:20Z

  }

+  @Test
+  public void testCreateZonemapIndex() {


nit: no negative test cases, such as

negative test for multi-column zonemap

negative test for num_segments on btree/fts

test for num_segments = 0 / negative values

Added negative tests for: multi-column zonemap, num_segments on btree, and zero/negative num_segments.

LuciferYang · 2026-05-21T14:34:33Z

+  }
+
+  @Test
+  public void testRepeatedCreateZonemapIndexReplacesExistingSegments() {


The test runs the SQL twice and asserts segment count stays at expectedSegmentCount. That catches the duplication failure mode (second run adds instead of replacing) but not the no-op failure mode (second run silently does nothing). Capture the segment UUIDs (or createdAt) after the first run and assert they differ after the second. This assumes Lance's createIndex mints fresh UUIDs per call — if UUIDs are content-addressed or otherwise stable across rebuilds, fall back to comparing createdAt.

Fixed — now captures segment UUIDs after the first run and asserts they differ after the second.

LuciferYang · 2026-05-21T14:35:55Z

+
+| Option          | Type | Description                                  |
+|-----------------|------|----------------------------------------------|
+| `rows_per_zone` | Long | The approximate number of rows per zonemap zone. |


If both are passed through IndexUtils.toJson the same way, label them the same way. zone_size in the btree section below should match too for cross-section consistency.

Fixed table alignment for consistency.

LuciferYang

The "What Changed" section still references LanceScanBuilder.java, LanceScan.java, LanceInputPartition.java, LanceFragmentScanner.java, and LanceCountStarPartitionReader.java, plus the bullet "Add segmented zonemap scan support with Spark-side post-scan filtering fallback" in Summary. None of this is in the current diff.

others LGTM

Add zonemap as a new index type in CREATE INDEX DDL with distributed build support. Each segment is built in parallel on Spark executors and committed as a logical index on the driver. Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

beinan · 2026-05-27T21:30:48Z

The "What Changed" section still references LanceScanBuilder.java, LanceScan.java, LanceInputPartition.java, LanceFragmentScanner.java, and LanceCountStarPartitionReader.java, plus the bullet "Add segmented zonemap scan support with Spark-side post-scan filtering fallback" in Summary. None of this is in the current diff.

others LGTM

Sorry for my delay, just updated. can we merge this pr? @LuciferYang @hamersaw

beinan mentioned this pull request May 11, 2026

feat: add distributed zonemap index build #473

Closed

github-actions Bot added the enhancement New feature or request label May 11, 2026

beinan marked this pull request as ready for review May 11, 2026 20:24

This comment was marked as low quality.

Sign in to view

hamersaw reviewed May 12, 2026

View reviewed changes

beinan force-pushed the user/beinan/zonemap-distributed-v2 branch from 416231b to fbe05fb Compare May 12, 2026 21:55

beinan force-pushed the user/beinan/zonemap-distributed-v2 branch from 29ef8a5 to 78bb5ea Compare May 15, 2026 22:39

LuciferYang reviewed May 21, 2026

View reviewed changes

LuciferYang approved these changes May 22, 2026

View reviewed changes

beinan force-pushed the user/beinan/zonemap-distributed-v2 branch from 804c1b9 to 5c18049 Compare May 27, 2026 21:23

Conversation

beinan commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Notes

Test plan

Uh oh!

This comment was marked as low quality.

Uh oh!

hamersaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

beinan commented May 12, 2026

Uh oh!

beinan commented May 17, 2026

Uh oh!

hamersaw commented May 18, 2026

Uh oh!

LuciferYang commented May 19, 2026

Uh oh!

LuciferYang commented May 19, 2026

Uh oh!

beinan commented May 19, 2026

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

beinan commented May 11, 2026 •

edited

Loading