Skip to content

feat(dir): shard directory catalog manifest#7008

Open
jackye1995 wants to merge 11 commits into
lance-format:mainfrom
jackye1995:jack/sharded-dir-catalog
Open

feat(dir): shard directory catalog manifest#7008
jackye1995 wants to merge 11 commits into
lance-format:mainfrom
jackye1995:jack/sharded-dir-catalog

Conversation

@jackye1995
Copy link
Copy Markdown
Contributor

Stacked on #6794. This PR contains only the DirectoryNamespace sharded-catalog portion: manifest_shard_count configuration, _manifest_shard* routing, default single-__manifest compatibility, and focused sharded catalog tests.

Validated with cargo fmt, focused sharded tests, cargo check all-targets, clippy with warnings denied, and diff whitespace checks.

jackye1995 added 11 commits May 15, 2026 00:43
Replace merge-insert/delete manifest mutations with always copy-on-write
full rewrites. Each mutation scans the latest __manifest dataset, streams
transformed rows into a replacement data file, and commits a new version
with replacement scalar indices built inline.

- Migrate metadata column from Utf8 to Lance JSON (LargeBinary)
- Remove base_objects column and LabelList index
- Build BTree (object_id), Bitmap (object_type), and FTS (metadata)
  indices during each streaming rewrite
- Add overwrite-with-replacement-indices commit support in Lance
- Handle concurrency via strict overwrite with full-rewrite retry
- Backward compatible: old schema datasets (Utf8 metadata, base_objects)
  are read correctly and migrated on first write
Binary measures read (list_namespaces, list_tables, describe_table) and
write (create_namespace, create_table) operations at configurable
concurrency levels. Supports --variant and --inline-optimization flags
to compare baseline merge-insert vs copy-on-write implementations.
Multi-process coordinator/worker architecture: coordinator spawns N
child processes each with independent namespace instance (no shared
cache). Supports S3 root paths, cold-read (fresh namespace per op),
warm-read (cached), and write operations. Separate seed mode for
populating manifests with configurable entry count.
…nline cleanup

- Remove FTS background channel asymmetry: accumulate metadata in
  ManifestIndexAccumulator during streaming, build all 3 indices
  (BTree, Bitmap, FTS) uniformly after the stream completes.
- Replace CommitBuilder with direct manifest commit: expose
  write_manifest_file and ManifestWriteConfig as public API, add
  Dataset::commit_handler() accessor, construct Manifest via
  new_from_previous and commit directly.
- Remove inline cleanup on retry: drop cleanup_uncommitted_overwrite_files
  and cleanup_uncommitted_index_uuids, rely on offline GC for orphaned files.
- Add index verification tests: test_manifest_indices_are_complete_and_versioned
  checks all 3 indices are present, versioned, and have fragment bitmaps;
  test_manifest_reads_use_indexed_scans verifies explain plans show
  ScalarIndexQuery for BTree/Bitmap filters and MatchQuery for FTS.
seed-large writes a __manifest Lance table directly with N rows,
bypassing the namespace API. Triggers one CoW rewrite to build
indices. Adds --initial-entries flag to run mode for result tracking.
The mutation lock already serializes local writes. Use get_cached()
on first attempt (no I/O) and get_refreshed() only on retry after
conflict. Make checkout_version on success non-fatal so the return
path doesn't block on I/O if the cache promotion fails.
S3X no-index: 6.8/s create-ns, 6.2/s declare-table at 1K entries.
S3X with-index: 5.7/s create-ns, 5.1/s declare-table at 1K entries.
Indexed point lookup flat from 100K to 1M (9ms warm on S3X).
CoW full rewrite + 3 indices at 1M: 2.0s S3X, 3.2s S3.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added enhancement New feature or request java labels May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant