perf(mem_wal): faster MemTable point lookups (no-epoch skiplist + compact keys) by touch-of-grey · Pull Request #7011 · lance-format/lance

touch-of-grey · 2026-05-31T04:51:22Z

Summary

Speeds up MemWAL MemTable primary-key point lookups to match/beat RocksDB on single-thread latency and write throughput, and roughly doubles N-thread read scaling.

Index data structure

Replace the crossbeam-skiplist backing the in-memory BTree index with a purpose-built single-writer, lock-free-read arena skiplist with no epoch reclamation. The MemTable is append-only with one writer, so reads need no epoch pin — the dominant cost in profiling. Nodes are a single allocation with the key and forward-pointer tower laid out contiguously (RocksDB InlineSkipList style) for cache locality.
Compact small-node keys (the value is recovered/decoded at flush, not stored in the node): FixedKey (order-preserving u64) for Int8..Int64/UInt8..UInt64/Date32/64; BytesKey/InlineBytes (bytes inline for small values such as a 16-byte UUID and short string PKs) for Utf8/LargeUtf8/Binary/LargeBinary/FixedSizeBinary. Other scalar types fall back to the original OrderableScalarValue key.

Point-lookup read path

Seek-and-stop probe (get_newest_visible) honoring the MVCC visibility watermark.
LsmPointLookupPlanner::lookup / lookup_many (batch) plus scanner routing: pk = lit / pk IN (..) filters route to the fast path; everything else uses the existing union/dedup scan.
N-thread fix: stop cloning shared schema Arcs per lookup (borrow via schema_ref, borrow none_target) — that refcount cache line was ping-ponging across cores under concurrency.

Benchmark

mem_wal_kv_point_lookup: head-to-head Lance MemTable vs RocksDB, identical key/value/query sets. RocksDB arm is behind the off-by-default bench-rocksdb feature; --key-type int|uuid.

Results (1M rows, c7i.8xlarge)

Single-thread read: int 1.45µs / uuid 1.61µs — beats RocksDB (1.53 / 1.61µs).
Write: 757k–908k rows/s — beats RocksDB (~694k).
Memory: ~1.4× RocksDB.
N-thread (32t): 0.50× RocksDB (up from 0.28× after the shared-Arc fix). The residual is structural (seek memory bandwidth + Lance's separate columnar value gather), not a contended lock.

Unified bench (mem_wal_kv_point_lookup) times the Lance MemTable point-lookup path and RocksDB in one process with identical keys/values/queries: write throughput, read latency/QPS (1t + Nt), CPU and peak RSS. RocksDB arm gated behind the optional bench-rocksdb feature. run_kv_compare.sh sweeps 100k/500k/1M.

--lance-read-mode fast resolves point lookups by probing the active MemTable BTree index and slicing one row from the BatchStore, bypassing DataFusion plan construction. Measures the lower bound for the current MemTable index vs the plan-based path.

LsmPointLookupPlanner::lookup() resolves single-column point lookups by probing the in-memory memtables' BTree index directly (newest generation first) and slicing the row from the BatchStore, bypassing per-lookup DataFusion plan construction. Falls back to the plan path for on-disk sources (flushed generations, base table), missing BTrees, multi-column keys, or projections with system columns. Benchmark shows this closes the ~300x point-lookup read gap vs RocksDB to within ~2x. Adds bench --lance-read-mode api to measure it.

- Fall back to the coercing plan path when the key's scalar type doesn't exactly match the PK column type (avoids OrderableScalarValue panic). - Probe in O(log batches) via StoredBatch.row_offset instead of allocating the full visible-batch vector each lookup. - Bench: create the RocksDB parent dir; drop the Lance writer (no blocking Drop) instead of leaking it, so --engine both does not inflate RocksDB RSS.

The fast-path eligibility check indexed pk_values[0] before the length was verified, panicking on lookup(&[]). Reorder so the length check short-circuits first; an empty slice now falls through to the plan path's invalid_input error. Regression test added. (Claude cross-review P1.)

Probe in-memory memtables by reference (no source Arc clones, no allocation in the common single-memtable case) instead of collect()-ing every source, and reuse a precomputed canonical schema for the no-projection case. Removes the per-lookup Arc-refcount and allocator contention that capped concurrent-reader QPS scaling. Adds LsmDataSourceCollector::{find_in_memory_newest_first, has_on_disk_sources}.

…ocksDB multi_get) --batch-get N measures fetching N keys per call: Lance does one BTree pass + a single vectorized Arrow take per source batch (concat to one RecordBatch); RocksDB uses multi_get. Sync on both sides (std threads) to compare the batch primitives. Reports keys/sec (1t + Nt) and per-batch latency.

Point lookups did an O(log n) skiplist range scan over an enum-keyed SkipMap — ~66% of single-thread lookup time (perf). Add a DashMap<ScalarValue, newest position> sidecar to BTreeMemIndex: get_eq() answers equality in O(1), falling back to the ordered range scan only when the newest write isn't visible yet (concurrent newer version). The skiplist stays for range queries. probe_memtable and the bench fast/batch paths use get_eq.

…f ScalarValue The equality hash keyed on the large ScalarValue enum: ~190MB for 1M keys and a wide-match compare (12.6% of single-thread, per perf). Introduce EqKey, a compact enum keying on the raw primitive (inline i64 etc.), with a ScalarValue fallback for exotic/null. Shrinks per-key memory and turns hash/compare into a typed op. An index is single-column/single-type so width variants never collide.

Lever lance-format#2 keyed the equality hash on a compact primitive instead of ScalarValue, expecting lower memory + a cheaper compare. EC2 re-profile showed neither: the 12% in the key compare is cache-miss latency loading the key from a random DashMap bucket (1M keys don't fit in cache), not comparison arithmetic or key size; and the +190MB is DashMap's sharded-RawTable structural overhead, not the key payload. Single-thread stayed 0.77us, RSS 528MB. Revert for simplicity; keep the ScalarValue-keyed hash from the prior commit (the real 2.9x win).

… the fast path Phase 1: LsmPointLookupPlanner::lookup_many(keys, projection) resolves N keys in one pass (probe in-memory memtables newest-first by reference, one vectorized Arrow take per source batch, plan fallback for on-disk/uncovered keys), plus plan_point_lookup() wrapping the result as a composable OneShotExec. Phase 2: LsmScanner::create_plan detects point-lookup filter shapes (pk = lit, pk IN (..)) and routes them to the direct BTree point-lookup node instead of the union/dedup scan, with limit applied on top. Any other shape falls through unchanged, so results are identical for non-point queries. This puts the fast path in the real read path, not just the benchmark. Tests: lookup_many (hits/misses/newest-dup/projection/empty), plan_point_lookup execution, scanner routing (pk=lit / pk IN route to OneShotStream, range filter does not), and a multi-level LSM test asserting the routed result == union path.

Under --lance-read-mode api with --batch-get N, the batch read path now calls the production LsmPointLookupPlanner::lookup_many instead of the bench helper, so the batch numbers reflect shipped code.

The probe refactor routed the single-row materialization through gather_rows, which used take() (alloc + copy) instead of the original zero-copy slice, regressing single point lookups (0.77us -> 1.41us) and the common 1-hit-per- source-batch case. gather_rows now slices for a single row and only takes for true multi-row gathers.

A one-element lookup_many (e.g. a routed pk IN (x)) paid the batch grouping overhead (refs vec + hash map + pending) with nothing to amortize. Delegate n==1 to the cheaper single lookup path.

The O(1) equality hash was added to every BTreeMemIndex, so non-PK btree indexes (range filtering, never point-looked-up) paid the dual write + memory cost for a hash they never use. Make it Option-al and gate it on the primary key: btree_from_metadata detects the PK column from the schema's unenforced-primary-key metadata and sets is_primary_key; from_configs builds the hash only then. probe_position range-scans the ordered skiplist when no hash is present (correct, O(log n)). This mirrors the scanner's existing PK-only hashing for flushed-generation block-lists. Tests: PK vs non-PK hash gating, and a no-hash range-fallback point lookup.

…t lookup Drop the per-PK DashMap equality hash (the ~190MB memory cost) and instead make the point lookup a single seek-and-stop on the existing ordered skiplist: BTreeMemIndex::get_newest_visible(value, max_visible_row) does one upper_bound(<= (value, max_visible_row)) and, if the value matches, returns the newest visible position — no range collect, no allocation, no second structure. This mirrors RocksDB's memtable Get (seek to the key, take the newest visible version). Range queries still scan the skiplist. probe_position and the bench fast/batch paths use the new seek-and-stop; the hash, EqKey, and PK-gating are removed.

The MemTable scalar index is append-only with a single writer and many concurrent readers, so it never needs the safe memory reclamation that crossbeam_skiplist pays for. Profiling the point-lookup hot path showed crossbeam's per-operation epoch pin (try_pin_loop) dominating, and worse, contending on the global epoch across threads — the N-thread read-scaling bottleneck. Add a purpose-built single-writer, lock-free-read skiplist (arena_skiplist): nodes are owned for the index generation's lifetime and freed only on drop (when no readers remain), so reads do only Acquire loads — no epoch pin, no fence. Mirrors RocksDB's InlineSkipList. BTreeMemIndex reads through the lock-free reader and serializes writes behind a Mutex that upholds the single-writer invariant (uncontended: the MemTable already serializes writes).

The seek is cache-miss bound: each tower hop loads a node's key. The prior arena node used two heap allocations (node + a separate boxed tower), costing two cache misses per hop and measuring ~2x slower single-thread than the crossbeam baseline despite removing the epoch pin. Lay each node out as a single bump-arena allocation with the key and its forward-pointer tower contiguous ([key][AtomicPtr; height]) — RocksDB's InlineSkipList layout — so a hop is one cache miss, and nodes are contiguous in insertion order. Keys are dropped by walking level 0 before the arena frees its chunks. The traversal invariant (a node is linked at level L only if its height > L) keeps variable-length tower reads in bounds.

The seek is cache-miss bound and the residual gap to RocksDB is node size: our OrderableScalarValue key (~48B) makes a ~72B node, vs RocksDB's raw-byte key (~24B node), which packs better on the bottom-level walk. Add a FixedIntBackend for Int8..Int64/UInt8..UInt64/Date32/Date64 keyed by a compact FixedKey { order-preserving u64, position } (~24B node). The value is not stored in the node — it is decoded from the order-preserving encoding at flush (sign-flip for signed, identity for unsigned). Nulls are tracked in a side vector (off the hot path) and sort first. BTreeMemIndex picks the backend lazily by column type on first insert; non-fixed-int types keep the existing OrderableScalarValue backend.

Extend the small-node design to byte-typed columns (Utf8, LargeUtf8, Binary, LargeBinary, FixedSizeBinary — UUIDs are FixedSizeBinary(16)). The skiplist key is a BytesKey whose bytes live inline in the node for small values (UUID, short string PKs) via InlineBytes, spilling to the heap only for long keys — so the common cases get the single-allocation, small-node win and long keys still work. Lexicographic byte order is the natural order for these types, so the seek is a memcmp and the value is decoded from the key bytes at flush (from_utf8 for strings, raw bytes for binary). Nulls are tracked in a side vector and sort first. BTreeMemIndex routes byte types to this backend; exotic types still use the OrderableScalarValue fallback.

Adds a UUID key type (FixedSizeBinary(16)) alongside the existing i64 key, so the byte/BytesKey backend can be benchmarked head-to-head with RocksDB. Logical keys stay i64; a deterministic 16-byte derivation (scrambled high half for UUID-like scatter, key in the low half for injectivity) maps each to the key column and the RocksDB key. Key construction is centralized in key_scalar / rocks_key so both engines and all read paths share one mapping.

Profiling the 32-thread point-lookup read showed the scaling bottleneck is not a lock but per-lookup clones of shared schema Arcs: cloning/dropping a shared Arc is an atomic on one cache line that ping-pongs across cores under load. gather_rows called RecordBatch::schema() (which clones the shared schema Arc) once per output field; switch to schema_ref() (a borrow) hoisted out of the loop. lookup cloned none_target for the common no-projection case; borrow it instead and clone only once when building the result batch. This removes ~3 of the ~4 shared-schema-Arc clones per lookup.

…ature) Adds an opt-in scalable global allocator to A/B per-lookup alloc/free contention in the N-thread read path. Off by default; enable with --features bench-mimalloc.

…lice Experiment: a zero-copy slice shares the source column's buffer Arc, whose refcount churns under concurrency; take copies into owned buffers with no shared-buffer refcount. Measuring the single-thread vs N-thread tradeoff.

…o-copy slice" Copy-vs-slice measured a clear loss: take regressed single-thread point lookups ~30-38% (losing the win vs RocksDB) with no N-thread gain (int got worse, uuid +8%). This proves the sliced-buffer Arc is not the N-thread bottleneck — the materialization is non-contended CPU. Restore the zero-copy single-row slice. Also drop the now-unused bench-mimalloc allocator option: mimalloc gave no N-thread change vs glibc, confirming allocation is not the contended path.

…kup docs The O(1) equality hash index was replaced by the seek-and-stop skiplist in 27c7bd2 (EqKey, get_eq, DashMap, and PK-gating all removed). Two doc comments still described the deleted hash; update them to the seek-and-stop reality. No code change.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Public BTreeMemIndex doc linked to private Backend/FixedKey, failing rustdoc -D warnings (private_intra_doc_links).

…eatures CI CI builds with every package feature (ALL_FEATURES), which pulled bench-rocksdb and tried to compile bundled librocksdb C++ — the runners have no libclang, so all full-feature jobs failed. Comment out the mem_wal_kv_point_lookup [[bench]] target plus the rocksdb optional dependency and bench-rocksdb feature, and prune them from Cargo.lock. The benchmark source is kept; re-enable by uncommenting the three Cargo.toml blocks.

codecov · 2026-05-31T06:17:45Z

Codecov Report

❌ Patch coverage is 82.92683% with 287 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/mem_wal/index/btree.rs	67.44%	202 Missing and 8 partials ⚠️
.../lance/src/dataset/mem_wal/scanner/point_lookup.rs	90.40%	38 Missing and 12 partials ⚠️
...ust/lance/src/dataset/mem_wal/scanner/collector.rs	60.52%	13 Missing and 2 partials ⚠️
rust/lance/src/dataset/mem_wal/scanner/builder.rs	92.30%	7 Missing and 3 partials ⚠️
.../lance/src/dataset/mem_wal/index/arena_skiplist.rs	99.66%	1 Missing ⚠️
rust/lance/src/dataset/mem_wal/scanner/planner.rs	97.82%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

…sDB SST) Adds a flushed/on-disk read benchmark to mem_wal_kv_point_lookup: the Lance arm writes all rows as one Lance dataset with an on-disk BTree index (the artifact a MemTable flush emits) and point-looks-up via the indexed scan; the RocksDB arm flushes its memtable to a single SST (compaction off, bloom + block cache) and Gets. Bench remains disabled as a CI build target.

Adds --lance-read-mode fast for the flushed benchmark: search the on-disk BTree scalar index directly (ScalarIndex::search) + take the row, bypassing DataFusion plan construction. Diagnostic to isolate how much of the flushed point-lookup cost is the per-lookup DataFusion plan vs the index+take itself. plan mode keeps the scan().filter() path. Bench stays disabled as a CI build target.

Adds --cold for the flushed benchmark: assumes a larger-than-RAM dataset so reads miss the caches and hit NVMe. Caps the RocksDB write buffer (avoids OOM on big datasets), flushes + compacts to one SST, uses a small (128MB) block cache so data blocks miss, and drops the OS page cache before the read phase (both engines, via passwordless sudo). Bench stays disabled as a CI build target.

… RocksDB Splits rows into generations+1 parts: the first N are sealed+flushed to on-disk generations (Lance, force_seal_active) / separate L0 SSTs (RocksDB, flush per chunk, compaction off), and the last stays in the active MemTable. Lookups (api mode) traverse active -> gens newest-first, matching RocksDB memtable -> L0 SSTs. Combine with --cold to read the gens cold from NVMe. Bench stays disabled as a CI build target.

flushed_batch gathers a chunk of keys: direct mode searches the on-disk BTree index per key then issues one take_rows; plan mode runs one DataFusion scan with 'id IN (...)'. Lets --batch-get measure single-flushed batch lookups (RocksDB's single-SST multi_get path already exists). Bench stays disabled in CI.

touch-of-grey added 26 commits May 30, 2026 01:58

test(mem_wal): bench api batch-get drives production lookup_many

a852a8c

Under --lance-read-mode api with --batch-get N, the batch read path now calls the production LsmPointLookupPlanner::lookup_many instead of the bench helper, so the batch numbers reflect shipped code.

perf(mem_wal): short-circuit lookup_many for a single key

903dfc5

A one-element lookup_many (e.g. a routed pk IN (x)) paid the batch grouping overhead (refs vec + hash map + pending) with nothing to amortize. Delegate n==1 to the cheaper single lookup path.

bench(mem_wal): optional mimalloc global allocator (bench-mimalloc fe…

08c1487

…ature) Adds an opt-in scalable global allocator to A/B per-lookup alloc/free contention in the N-thread read path. Off by default; enable with --features bench-mimalloc.

claude Bot reviewed May 31, 2026

View reviewed changes

github-actions Bot added the performance label May 31, 2026

touch-of-grey added 2 commits May 30, 2026 22:11

docs(mem_wal): use code spans not intra-doc links to private btree items

e38b4ac

Public BTreeMemIndex doc linked to private Backend/FixedKey, failing rustdoc -D warnings (private_intra_doc_links).

touch-of-grey added 5 commits May 30, 2026 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(mem_wal): faster MemTable point lookups (no-epoch skiplist + compact keys)#7011

perf(mem_wal): faster MemTable point lookups (no-epoch skiplist + compact keys)#7011
touch-of-grey wants to merge 33 commits into
lance-format:mainfrom
touch-of-grey:LookupPerf

touch-of-grey commented May 31, 2026

Uh oh!

claude Bot left a comment

Uh oh!

codecov Bot commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

touch-of-grey commented May 31, 2026

Summary

Index data structure

Point-lookup read path

Benchmark

Results (1M rows, c7i.8xlarge)

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

codecov Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 31, 2026 •

edited

Loading