Skip to content

feat: add distributed zonemap index build with configurable segments#516

Open
beinan wants to merge 1 commit into
lance-format:mainfrom
beinan:user/beinan/zonemap-distributed-v2
Open

feat: add distributed zonemap index build with configurable segments#516
beinan wants to merge 1 commit into
lance-format:mainfrom
beinan:user/beinan/zonemap-distributed-v2

Conversation

@beinan
Copy link
Copy Markdown
Contributor

@beinan beinan commented May 11, 2026

Summary

  • Add zonemap as a new index type in CREATE INDEX DDL with distributed build support
  • Batch fragments into configurable segments via num_segments option (defaults to spark.default.parallelism)
  • Each segment is built in parallel on Spark executors and committed as a logical index on the driver
  • Zonemap indexes currently support single column only

What Changed

  • AddIndexExec.scala: Zonemap-specific path with ZonemapIndexJob/ZonemapIndexTask and commitIndexSegments
  • create-index.md: Document zonemap index type, options, and usage
  • Tests: unit tests for segment creation/validation and integration test

Notes

Test plan

  • CI passes (lint, unit tests, integration tests across all Spark/Scala versions)
  • Zonemap index creation with default segment count
  • Zonemap index creation with explicit num_segments
  • Repeated zonemap index creation replaces existing segments
  • Query correctness after zonemap index creation

🤖 Generated with Claude Code

@github-actions github-actions Bot added the enhancement New feature or request label May 11, 2026
@beinan beinan marked this pull request as ready for review May 11, 2026 20:24
beinan

This comment was marked as low quality.

Copy link
Copy Markdown
Collaborator

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. Thanks for the PR! A few things that we should tighten up IMO.

Comment thread lance-spark-base_2.12/src/main/java/org/lance/spark/read/LanceScanBuilder.java Outdated
Comment thread lance-spark-base_2.12/src/main/java/org/lance/spark/read/LanceScanBuilder.java Outdated
Comment thread lance-spark-base_2.12/src/main/java/org/lance/spark/read/LanceScanBuilder.java Outdated
Option[Map[String, String]]) = catalog match {
column: String,
segments: Seq[Index]): Unit = {
val dataset = Utils.openDatasetBuilder(readOptions).build()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason we are opening two datasets within this function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two opens are needed because commitExistingIndexSegments creates a new dataset version, so we need a fresh dataset handle to read the updated index state for the cleanup transaction. I've added a comment in the code explaining this. The first dataset is used for the segment commit, and the second reads the post-commit state to remove old segments.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty concerned that this can leave us in a poor state if we commit the new index segments and there is a failure before removing the old ones. I think results should still be correct, but there will be quite a bit of overhead until the indexing process is reran.

It doesn't look like there is a solution for this in the lance core SDK when committing segmented indices, maybe we need to devise a solution for transactionally replacing segmented indices?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. After investigating Lance core's commit_existing_index_segments (index.rs:1065-1164), it turns out the core API already handles atomic replacement — it finds existing segments whose fragments overlap with incoming ones and removes them in the same CreateIndex transaction. The Spark-side manual cleanup (second dataset open + removal transaction) was redundant and is what introduced the race.

Fixed in the latest push by simplifying commitIndexSegments to just call dataset.commitExistingIndexSegments() and let Lance core handle the atomic add+remove. The method went from ~50 lines to ~10.

@beinan beinan force-pushed the user/beinan/zonemap-distributed-v2 branch from 416231b to fbe05fb Compare May 12, 2026 21:55
@beinan
Copy link
Copy Markdown
Contributor Author

beinan commented May 12, 2026

Thanks for the thorough review! All feedback has been addressed in the latest push (force-pushed as a single clean commit on latest main):

Scan-side changes removed entirely:

  • Removed useScalarIndex, forcePostScanFiltering, and shouldForcePostScanFiltering — zonemap fragment pruning works via the existing ZonemapFragmentPruner path without needing special scan flags. No scan-side files are modified in this PR anymore.

Index creation fixes:

  • Race in commitIndexSegments: Now captures pre-commit UUIDs and only removes indexes with those UUIDs, so concurrent writers' segments are never deleted.
  • batchFragments accuracy: Switched from ceil(N/K) to index-based slicing (slice(i*N/K, (i+1)*N/K)) to guarantee the requested segment count.
  • num_segments validation: Bounds check for <= 0 and type validation, passed through constructor (no duplicate extraction).
  • Segment failure handling: try/catch with clear error message about Lance GC cleanup.

@beinan beinan force-pushed the user/beinan/zonemap-distributed-v2 branch from 29ef8a5 to 78bb5ea Compare May 15, 2026 22:39
@beinan
Copy link
Copy Markdown
Contributor Author

beinan commented May 17, 2026

@hamersaw All review feedback has been addressed — the key change since your last review is simplifying commitIndexSegments to rely on Lance core's built-in atomic replacement (single transaction for add+remove). Would you mind taking another look when you get a chance? Thanks!

@hamersaw
Copy link
Copy Markdown
Collaborator

@LuciferYang do you mind making a pass here? Specifically, I'm interested with how this compares to your proposal (#513) to support building distributed ZoneMap indexes.

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang do you mind making a pass here? Specifically, I'm interested with how this compares to your proposal (#513) to support building distributed ZoneMap indexes.

will give feedback later today.

@LuciferYang
Copy link
Copy Markdown
Contributor

@hamersaw @beinan Had a closer look. After the May 12 force-push on #516, the two PRs are adjacent rather than overlapping.

#516 #513 distributed #513 consolidated (opt-in)
Tasks 1 per num_segments batch 1 per fragment 1 per fragment (compute), driver write
Segments on disk num_segments (default = min(fragments, defaultParallelism)) = fragment count 1
Commit API commitExistingIndexSegments manual AddIndexOperation + Transaction + CommitBuilder same as distributed
Upstream blocker none (project's already on lance-core 7.0.0-beta.10) none lance-format/lance#6779 + #6780, both unmerged

The main difference between the two distributed paths is that num_segments on #516 is one knob doing two jobs: parallelism and segment count are the same lever. At num_segments=1 you get one segment, but the work serialises onto a single executor — not the same as #513's consolidated path, which keeps compute parallel and only centralises the write. That decoupled corner only opens up once #6779/#6780 land.

For reference, sf=100 store_sales (234 fragments, ~288M rows) under #513: distributed = 15.0 s / 234 segments / 1.1 MB; consolidated = 28.1 s / 1 segment / 138 KB. The ~8× footprint drop is per-file header amortisation.

What I'd want to inherit from #516:

Suggested path:

  1. Land feat: add distributed zonemap index build with configurable segments #516 as-is.
  2. After it lands, I'll rebase feat(sql): support USING zonemap with distributed and consolidated build paths #513 distributed onto commitExistingIndexSegments. The fragment-id exception wrapping in ZonemapFragmentTask is worth porting onto ZonemapIndexTask while we're there. After that, feat(sql): support USING zonemap with distributed and consolidated build paths #513 only carries the consolidated path.
  3. Consolidated lands as an opt-in (spark.lance.zonemap.consolidate.enabled) once #6779/#6780 release. Default-off — driver allocator holds every per-fragment Arrow batch until writeZonemapIndexFromBatches consumes them, so it can regress at very-high fragment counts.

One nit on #516, will leave inline: targetTasks = math.min(fragmentIds.size, n) silently clamps num_segments=1000 to fragment count. The doc string reads like num_segments is a target, not an upper bound. Either log when clamping or reword the doc.

@beinan
Copy link
Copy Markdown
Contributor Author

beinan commented May 19, 2026

@LuciferYang Thanks for the thorough comparison — the side-by-side table is really helpful.

You're right about num_segments doing double duty. The latest push addresses the nit:

  • num_segments doc now clarifies it's an upper bound clamped to fragment count
  • Added a log message when clamping occurs so it's not silent
  • Switched batchFragments to index-based slicing to guarantee the requested segment count

Agreed on the suggested path — happy to land #516 as the distributed foundation, then have #513 rebase its distributed path onto commitExistingIndexSegments and carry the consolidated path as an opt-in once #6779/#6780 land. The fragment-id exception wrapping from ZonemapFragmentTask is a good addition to port over as well.

Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the PR description also need to be updated?

val validatedNumSegments: Option[Int] = numSegmentsOpt.map { arg =>
val value =
try {
arg.value.asInstanceOf[Number].intValue()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scala's null.asInstanceOf[Number] returns null instead of throwing ClassCastException, so a WITH (num_segments = null) argument bypasses the friendly error path and dies with an opaque NullPointerException at .intValue().

On the other hand, if the parser delivers a java.lang.Long (e.g., for an out-of-range literal), .intValue() silently truncates rather than rejecting. Negative Longs below Int.MinValue truncate to positive Ints and slip past the value <= 0 check on line 111. Validate the Long bounds before narrowing.

It can be revised as follows:

val value = arg.value match {
  case null =>
    throw new IllegalArgumentException("num_segments must be a positive integer, got: null")
  case n: Number =>
    val asLong = n.longValue()
    if (asLong < 1L || asLong > Int.MaxValue) throw new IllegalArgumentException(
      s"num_segments must be a positive integer that fits in Int, got: $asLong")
    asLong.toInt
  case other =>
    throw new IllegalArgumentException(
      s"num_segments must be a positive integer, got: $other")
}

With this in place the redundant if (value <= 0) throw … block on line 111 can be removed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — switched to pattern match handling null, Long bounds, and non-Number types explicitly. Removed the redundant <= 0 check.

Option[Map[String, String]]) = catalog match {
column: String,
segments: Seq[Index]): Unit = {
val dataset = Utils.openDatasetBuilder(readOptions).build()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver opens once on line 72 to enumerate fragmentIds, closes it, then commitIndexSegments opens a fresh one on line 214 to call commitExistingIndexSegments. Both go through the same Utils.openDatasetBuilder(readOptions). Consider consolidating to a single driver-side open scoped to the entire zonemap branch — derive fragmentIds from it and pass that same handle into commitIndexSegments.

Caveat worth verifying first: if commitExistingIndexSegments requires a handle at the latest dataset version (not the version captured at line 72), reusing the older handle could fail the commit on a version mismatch. If the Lance core contract requires a fresh handle, leave commitIndexSegments as-is and only optimize the line 72 enumeration (e.g., enumerate via lanceDataset if it already exposes fragment IDs).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Left as-is for now since commitExistingIndexSegments may require a handle at the latest version. Worth consolidating in a follow-up if we confirm the version contract.

math.min(n, addIndexExec.session.sparkContext.defaultParallelism))
}
(0 until k).map { i =>
fragmentIds.slice(i * n / k, (i + 1) * n / k)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i * n (both Int) overflows once the product exceeds Int.MaxValue ≈ 2.1×10^9. Triggering requires a deliberately large num_segments and a fragment count where i*n crosses the boundary — e.g., 200k fragments with num_segments near n puts i*n near 4×10^10. Not a hot-path concern, but cheap to make overflow-safe. Promote one operand to Long:

fragmentIds.slice((i.toLong * n / k).toInt, ((i.toLong + 1) * n / k).toInt)

Equivalent and overflow-safe.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — promoted to i.toLong * n / k to avoid overflow.

private def createIndexJob(
dataset: Dataset,
lanceDataset: LanceDataset,
// Lance core's commitExistingIndexSegments handles atomic replacement:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Drop a one-line comment at the call site (line 132) — e.g., // atomic add+remove via Lance core; see commitIndexSegments — so the replacement semantics are visible without jumping definitions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

}

// Zonemap uses logical segment commit path
if (useLogicalSegmentCommit) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Cosmetic micro-allocation, but it signals incorrectly that the UUID is needed by every branch. Move val uuid = UUID.randomUUID() below the if (useLogicalSegmentCommit) { … return … } block so it's only generated for the merge-metadata branch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — moved val uuid below the zonemap early return.


private def batchFragments(
fragmentIds: List[Integer],
numSegments: Option[Int] = None): Seq[List[Integer]] = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batchFragments is private and called from a single site that always passes the argument. Drop = None to avoid signaling an extension point that doesn't exist.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped the = None default.

case e: Exception =>
throw new RuntimeException(
"Zonemap segment build failed. Uncommitted segments (if any) " +
"will be cleaned up by Lance's garbage collection.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it really be cleaned up automatically?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — updated the message. Uncommitted segments are not visible to readers and do not affect query correctness. They are orphaned artifacts that occupy storage but have no semantic impact.

}

@Test
public void testCreateZonemapIndex() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no negative test cases, such as

  • negative test for multi-column zonemap
  • negative test for num_segments on btree/fts
  • test for num_segments = 0 / negative values

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added negative tests for: multi-column zonemap, num_segments on btree, and zero/negative num_segments.

}

@Test
public void testRepeatedCreateZonemapIndexReplacesExistingSegments() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test runs the SQL twice and asserts segment count stays at expectedSegmentCount. That catches the duplication failure mode (second run adds instead of replacing) but not the no-op failure mode (second run silently does nothing). Capture the segment UUIDs (or createdAt) after the first run and assert they differ after the second. This assumes Lance's createIndex mints fresh UUIDs per call — if UUIDs are content-addressed or otherwise stable across rebuilds, fall back to comparing createdAt.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — now captures segment UUIDs after the first run and asserts they differ after the second.

Comment thread docs/src/operations/ddl/create-index.md Outdated

| Option | Type | Description |
|-----------------|------|----------------------------------------------|
| `rows_per_zone` | Long | The approximate number of rows per zonemap zone. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both are passed through IndexUtils.toJson the same way, label them the same way. zone_size in the btree section below should match too for cross-section consistency.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed table alignment for consistency.

Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "What Changed" section still references LanceScanBuilder.java, LanceScan.java, LanceInputPartition.java, LanceFragmentScanner.java, and LanceCountStarPartitionReader.java, plus the bullet "Add segmented zonemap scan support with Spark-side post-scan filtering fallback" in Summary. None of this is in the current diff.

others LGTM

Add zonemap as a new index type in CREATE INDEX DDL with distributed build support.
Each segment is built in parallel on Spark executors and committed as a logical index
on the driver.

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
@beinan beinan force-pushed the user/beinan/zonemap-distributed-v2 branch from 804c1b9 to 5c18049 Compare May 27, 2026 21:23
@beinan
Copy link
Copy Markdown
Contributor Author

beinan commented May 27, 2026

The "What Changed" section still references LanceScanBuilder.java, LanceScan.java, LanceInputPartition.java, LanceFragmentScanner.java, and LanceCountStarPartitionReader.java, plus the bullet "Add segmented zonemap scan support with Spark-side post-scan filtering fallback" in Summary. None of this is in the current diff.

others LGTM

Sorry for my delay, just updated. can we merge this pr? @LuciferYang @hamersaw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants