Skip to content

perf(mem_wal): faster MemTable point lookups (no-epoch skiplist + compact keys)#7011

Open
touch-of-grey wants to merge 33 commits into
lance-format:mainfrom
touch-of-grey:LookupPerf
Open

perf(mem_wal): faster MemTable point lookups (no-epoch skiplist + compact keys)#7011
touch-of-grey wants to merge 33 commits into
lance-format:mainfrom
touch-of-grey:LookupPerf

Conversation

@touch-of-grey
Copy link
Copy Markdown
Contributor

Summary

Speeds up MemWAL MemTable primary-key point lookups to match/beat RocksDB on single-thread latency and write throughput, and roughly doubles N-thread read scaling.

Index data structure

  • Replace the crossbeam-skiplist backing the in-memory BTree index with a purpose-built single-writer, lock-free-read arena skiplist with no epoch reclamation. The MemTable is append-only with one writer, so reads need no epoch pin — the dominant cost in profiling. Nodes are a single allocation with the key and forward-pointer tower laid out contiguously (RocksDB InlineSkipList style) for cache locality.
  • Compact small-node keys (the value is recovered/decoded at flush, not stored in the node): FixedKey (order-preserving u64) for Int8..Int64/UInt8..UInt64/Date32/64; BytesKey/InlineBytes (bytes inline for small values such as a 16-byte UUID and short string PKs) for Utf8/LargeUtf8/Binary/LargeBinary/FixedSizeBinary. Other scalar types fall back to the original OrderableScalarValue key.

Point-lookup read path

  • Seek-and-stop probe (get_newest_visible) honoring the MVCC visibility watermark.
  • LsmPointLookupPlanner::lookup / lookup_many (batch) plus scanner routing: pk = lit / pk IN (..) filters route to the fast path; everything else uses the existing union/dedup scan.
  • N-thread fix: stop cloning shared schema Arcs per lookup (borrow via schema_ref, borrow none_target) — that refcount cache line was ping-ponging across cores under concurrency.

Benchmark

  • mem_wal_kv_point_lookup: head-to-head Lance MemTable vs RocksDB, identical key/value/query sets. RocksDB arm is behind the off-by-default bench-rocksdb feature; --key-type int|uuid.

Results (1M rows, c7i.8xlarge)

  • Single-thread read: int 1.45µs / uuid 1.61µs — beats RocksDB (1.53 / 1.61µs).
  • Write: 757k–908k rows/s — beats RocksDB (~694k).
  • Memory: ~1.4× RocksDB.
  • N-thread (32t): 0.50× RocksDB (up from 0.28× after the shared-Arc fix). The residual is structural (seek memory bandwidth + Lance's separate columnar value gather), not a contended lock.

Unified bench (mem_wal_kv_point_lookup) times the Lance MemTable point-lookup
path and RocksDB in one process with identical keys/values/queries: write
throughput, read latency/QPS (1t + Nt), CPU and peak RSS. RocksDB arm gated
behind the optional bench-rocksdb feature. run_kv_compare.sh sweeps 100k/500k/1M.
--lance-read-mode fast resolves point lookups by probing the active MemTable
BTree index and slicing one row from the BatchStore, bypassing DataFusion plan
construction. Measures the lower bound for the current MemTable index vs the
plan-based path.
LsmPointLookupPlanner::lookup() resolves single-column point lookups by probing
the in-memory memtables' BTree index directly (newest generation first) and
slicing the row from the BatchStore, bypassing per-lookup DataFusion plan
construction. Falls back to the plan path for on-disk sources (flushed
generations, base table), missing BTrees, multi-column keys, or projections
with system columns. Benchmark shows this closes the ~300x point-lookup read
gap vs RocksDB to within ~2x. Adds bench --lance-read-mode api to measure it.
- Fall back to the coercing plan path when the key's scalar type doesn't
  exactly match the PK column type (avoids OrderableScalarValue panic).
- Probe in O(log batches) via StoredBatch.row_offset instead of allocating
  the full visible-batch vector each lookup.
- Bench: create the RocksDB parent dir; drop the Lance writer (no blocking
  Drop) instead of leaking it, so --engine both does not inflate RocksDB RSS.
The fast-path eligibility check indexed pk_values[0] before the length was
verified, panicking on lookup(&[]). Reorder so the length check short-circuits
first; an empty slice now falls through to the plan path's invalid_input error.
Regression test added. (Claude cross-review P1.)
Probe in-memory memtables by reference (no source Arc clones, no allocation in
the common single-memtable case) instead of collect()-ing every source, and
reuse a precomputed canonical schema for the no-projection case. Removes the
per-lookup Arc-refcount and allocator contention that capped concurrent-reader
QPS scaling. Adds LsmDataSourceCollector::{find_in_memory_newest_first,
has_on_disk_sources}.
…ocksDB multi_get)

--batch-get N measures fetching N keys per call: Lance does one BTree pass +
a single vectorized Arrow take per source batch (concat to one RecordBatch);
RocksDB uses multi_get. Sync on both sides (std threads) to compare the batch
primitives. Reports keys/sec (1t + Nt) and per-batch latency.
Point lookups did an O(log n) skiplist range scan over an enum-keyed SkipMap —
~66% of single-thread lookup time (perf). Add a DashMap<ScalarValue, newest
position> sidecar to BTreeMemIndex: get_eq() answers equality in O(1), falling
back to the ordered range scan only when the newest write isn't visible yet
(concurrent newer version). The skiplist stays for range queries. probe_memtable
and the bench fast/batch paths use get_eq.
…f ScalarValue

The equality hash keyed on the large ScalarValue enum: ~190MB for 1M keys and a
wide-match compare (12.6% of single-thread, per perf). Introduce EqKey, a
compact enum keying on the raw primitive (inline i64 etc.), with a ScalarValue
fallback for exotic/null. Shrinks per-key memory and turns hash/compare into a
typed op. An index is single-column/single-type so width variants never collide.
Lever lance-format#2 keyed the equality hash on a compact primitive instead of ScalarValue,
expecting lower memory + a cheaper compare. EC2 re-profile showed neither: the
12% in the key compare is cache-miss latency loading the key from a random
DashMap bucket (1M keys don't fit in cache), not comparison arithmetic or key
size; and the +190MB is DashMap's sharded-RawTable structural overhead, not the
key payload. Single-thread stayed 0.77us, RSS 528MB. Revert for simplicity;
keep the ScalarValue-keyed hash from the prior commit (the real 2.9x win).
… the fast path

Phase 1: LsmPointLookupPlanner::lookup_many(keys, projection) resolves N keys in
one pass (probe in-memory memtables newest-first by reference, one vectorized
Arrow take per source batch, plan fallback for on-disk/uncovered keys), plus
plan_point_lookup() wrapping the result as a composable OneShotExec.

Phase 2: LsmScanner::create_plan detects point-lookup filter shapes (pk = lit,
pk IN (..)) and routes them to the direct BTree point-lookup node instead of the
union/dedup scan, with limit applied on top. Any other shape falls through
unchanged, so results are identical for non-point queries. This puts the fast
path in the real read path, not just the benchmark.

Tests: lookup_many (hits/misses/newest-dup/projection/empty), plan_point_lookup
execution, scanner routing (pk=lit / pk IN route to OneShotStream, range filter
does not), and a multi-level LSM test asserting the routed result == union path.
Under --lance-read-mode api with --batch-get N, the batch read path now calls
the production LsmPointLookupPlanner::lookup_many instead of the bench helper,
so the batch numbers reflect shipped code.
The probe refactor routed the single-row materialization through gather_rows,
which used take() (alloc + copy) instead of the original zero-copy slice,
regressing single point lookups (0.77us -> 1.41us) and the common 1-hit-per-
source-batch case. gather_rows now slices for a single row and only takes for
true multi-row gathers.
A one-element lookup_many (e.g. a routed pk IN (x)) paid the batch grouping
overhead (refs vec + hash map + pending) with nothing to amortize. Delegate
n==1 to the cheaper single lookup path.
The O(1) equality hash was added to every BTreeMemIndex, so non-PK btree indexes
(range filtering, never point-looked-up) paid the dual write + memory cost for a
hash they never use. Make it Option-al and gate it on the primary key:
btree_from_metadata detects the PK column from the schema's unenforced-primary-key
metadata and sets is_primary_key; from_configs builds the hash only then.
probe_position range-scans the ordered skiplist when no hash is present
(correct, O(log n)). This mirrors the scanner's existing PK-only hashing for
flushed-generation block-lists. Tests: PK vs non-PK hash gating, and a no-hash
range-fallback point lookup.
…t lookup

Drop the per-PK DashMap equality hash (the ~190MB memory cost) and instead make
the point lookup a single seek-and-stop on the existing ordered skiplist:
BTreeMemIndex::get_newest_visible(value, max_visible_row) does one
upper_bound(<= (value, max_visible_row)) and, if the value matches, returns the
newest visible position — no range collect, no allocation, no second structure.
This mirrors RocksDB's memtable Get (seek to the key, take the newest visible
version). Range queries still scan the skiplist. probe_position and the bench
fast/batch paths use the new seek-and-stop; the hash, EqKey, and PK-gating are
removed.
The MemTable scalar index is append-only with a single writer and many
concurrent readers, so it never needs the safe memory reclamation that
crossbeam_skiplist pays for. Profiling the point-lookup hot path showed
crossbeam's per-operation epoch pin (try_pin_loop) dominating, and worse,
contending on the global epoch across threads — the N-thread read-scaling
bottleneck.

Add a purpose-built single-writer, lock-free-read skiplist (arena_skiplist):
nodes are owned for the index generation's lifetime and freed only on drop
(when no readers remain), so reads do only Acquire loads — no epoch pin, no
fence. Mirrors RocksDB's InlineSkipList. BTreeMemIndex reads through the
lock-free reader and serializes writes behind a Mutex that upholds the
single-writer invariant (uncontended: the MemTable already serializes writes).
The seek is cache-miss bound: each tower hop loads a node's key. The prior
arena node used two heap allocations (node + a separate boxed tower), costing
two cache misses per hop and measuring ~2x slower single-thread than the
crossbeam baseline despite removing the epoch pin.

Lay each node out as a single bump-arena allocation with the key and its
forward-pointer tower contiguous ([key][AtomicPtr; height]) — RocksDB's
InlineSkipList layout — so a hop is one cache miss, and nodes are contiguous
in insertion order. Keys are dropped by walking level 0 before the arena frees
its chunks. The traversal invariant (a node is linked at level L only if its
height > L) keeps variable-length tower reads in bounds.
The seek is cache-miss bound and the residual gap to RocksDB is node size:
our OrderableScalarValue key (~48B) makes a ~72B node, vs RocksDB's raw-byte
key (~24B node), which packs better on the bottom-level walk.

Add a FixedIntBackend for Int8..Int64/UInt8..UInt64/Date32/Date64 keyed by a
compact FixedKey { order-preserving u64, position } (~24B node). The value is
not stored in the node — it is decoded from the order-preserving encoding at
flush (sign-flip for signed, identity for unsigned). Nulls are tracked in a
side vector (off the hot path) and sort first. BTreeMemIndex picks the backend
lazily by column type on first insert; non-fixed-int types keep the existing
OrderableScalarValue backend.
Extend the small-node design to byte-typed columns (Utf8, LargeUtf8, Binary,
LargeBinary, FixedSizeBinary — UUIDs are FixedSizeBinary(16)). The skiplist key
is a BytesKey whose bytes live inline in the node for small values (UUID, short
string PKs) via InlineBytes, spilling to the heap only for long keys — so the
common cases get the single-allocation, small-node win and long keys still work.

Lexicographic byte order is the natural order for these types, so the seek is a
memcmp and the value is decoded from the key bytes at flush (from_utf8 for
strings, raw bytes for binary). Nulls are tracked in a side vector and sort
first. BTreeMemIndex routes byte types to this backend; exotic types still use
the OrderableScalarValue fallback.
Adds a UUID key type (FixedSizeBinary(16)) alongside the existing i64 key, so
the byte/BytesKey backend can be benchmarked head-to-head with RocksDB. Logical
keys stay i64; a deterministic 16-byte derivation (scrambled high half for
UUID-like scatter, key in the low half for injectivity) maps each to the key
column and the RocksDB key. Key construction is centralized in key_scalar /
rocks_key so both engines and all read paths share one mapping.
Profiling the 32-thread point-lookup read showed the scaling bottleneck is not
a lock but per-lookup clones of shared schema Arcs: cloning/dropping a shared
Arc is an atomic on one cache line that ping-pongs across cores under load.

gather_rows called RecordBatch::schema() (which clones the shared schema Arc)
once per output field; switch to schema_ref() (a borrow) hoisted out of the
loop. lookup cloned none_target for the common no-projection case; borrow it
instead and clone only once when building the result batch. This removes ~3 of
the ~4 shared-schema-Arc clones per lookup.
…ature)

Adds an opt-in scalable global allocator to A/B per-lookup alloc/free contention
in the N-thread read path. Off by default; enable with --features bench-mimalloc.
…lice

Experiment: a zero-copy slice shares the source column's buffer Arc, whose
refcount churns under concurrency; take copies into owned buffers with no
shared-buffer refcount. Measuring the single-thread vs N-thread tradeoff.
…o-copy slice"

Copy-vs-slice measured a clear loss: take regressed single-thread point lookups
~30-38% (losing the win vs RocksDB) with no N-thread gain (int got worse, uuid
+8%). This proves the sliced-buffer Arc is not the N-thread bottleneck — the
materialization is non-contended CPU. Restore the zero-copy single-row slice.

Also drop the now-unused bench-mimalloc allocator option: mimalloc gave no
N-thread change vs glibc, confirming allocation is not the contended path.
…kup docs

The O(1) equality hash index was replaced by the seek-and-stop skiplist in
27c7bd2 (EqKey, get_eq, DashMap, and PK-gating all removed). Two doc comments
still described the deleted hash; update them to the seek-and-stop reality. No
code change.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Public BTreeMemIndex doc linked to private Backend/FixedKey, failing rustdoc
-D warnings (private_intra_doc_links).
…eatures CI

CI builds with every package feature (ALL_FEATURES), which pulled bench-rocksdb
and tried to compile bundled librocksdb C++ — the runners have no libclang, so
all full-feature jobs failed. Comment out the mem_wal_kv_point_lookup [[bench]]
target plus the rocksdb optional dependency and bench-rocksdb feature, and prune
them from Cargo.lock. The benchmark source is kept; re-enable by uncommenting the
three Cargo.toml blocks.
…sDB SST)

Adds a flushed/on-disk read benchmark to mem_wal_kv_point_lookup: the Lance arm
writes all rows as one Lance dataset with an on-disk BTree index (the artifact a
MemTable flush emits) and point-looks-up via the indexed scan; the RocksDB arm
flushes its memtable to a single SST (compaction off, bloom + block cache) and
Gets. Bench remains disabled as a CI build target.
Adds --lance-read-mode fast for the flushed benchmark: search the on-disk BTree
scalar index directly (ScalarIndex::search) + take the row, bypassing DataFusion
plan construction. Diagnostic to isolate how much of the flushed point-lookup
cost is the per-lookup DataFusion plan vs the index+take itself. plan mode keeps
the scan().filter() path. Bench stays disabled as a CI build target.
Adds --cold for the flushed benchmark: assumes a larger-than-RAM dataset so
reads miss the caches and hit NVMe. Caps the RocksDB write buffer (avoids OOM on
big datasets), flushes + compacts to one SST, uses a small (128MB) block cache
so data blocks miss, and drops the OS page cache before the read phase (both
engines, via passwordless sudo). Bench stays disabled as a CI build target.
… RocksDB

Splits rows into generations+1 parts: the first N are sealed+flushed to on-disk
generations (Lance, force_seal_active) / separate L0 SSTs (RocksDB, flush per
chunk, compaction off), and the last stays in the active MemTable. Lookups
(api mode) traverse active -> gens newest-first, matching RocksDB memtable -> L0
SSTs. Combine with --cold to read the gens cold from NVMe. Bench stays disabled
as a CI build target.
flushed_batch gathers a chunk of keys: direct mode searches the on-disk BTree
index per key then issues one take_rows; plan mode runs one DataFusion scan with
'id IN (...)'. Lets --batch-get measure single-flushed batch lookups (RocksDB's
single-SST multi_get path already exists). Bench stays disabled in CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant