perf(mem_wal): faster MemTable point lookups (no-epoch skiplist + compact keys)#7011
Open
touch-of-grey wants to merge 33 commits into
Open
perf(mem_wal): faster MemTable point lookups (no-epoch skiplist + compact keys)#7011touch-of-grey wants to merge 33 commits into
touch-of-grey wants to merge 33 commits into
Conversation
Unified bench (mem_wal_kv_point_lookup) times the Lance MemTable point-lookup path and RocksDB in one process with identical keys/values/queries: write throughput, read latency/QPS (1t + Nt), CPU and peak RSS. RocksDB arm gated behind the optional bench-rocksdb feature. run_kv_compare.sh sweeps 100k/500k/1M.
--lance-read-mode fast resolves point lookups by probing the active MemTable BTree index and slicing one row from the BatchStore, bypassing DataFusion plan construction. Measures the lower bound for the current MemTable index vs the plan-based path.
LsmPointLookupPlanner::lookup() resolves single-column point lookups by probing the in-memory memtables' BTree index directly (newest generation first) and slicing the row from the BatchStore, bypassing per-lookup DataFusion plan construction. Falls back to the plan path for on-disk sources (flushed generations, base table), missing BTrees, multi-column keys, or projections with system columns. Benchmark shows this closes the ~300x point-lookup read gap vs RocksDB to within ~2x. Adds bench --lance-read-mode api to measure it.
- Fall back to the coercing plan path when the key's scalar type doesn't exactly match the PK column type (avoids OrderableScalarValue panic). - Probe in O(log batches) via StoredBatch.row_offset instead of allocating the full visible-batch vector each lookup. - Bench: create the RocksDB parent dir; drop the Lance writer (no blocking Drop) instead of leaking it, so --engine both does not inflate RocksDB RSS.
The fast-path eligibility check indexed pk_values[0] before the length was verified, panicking on lookup(&[]). Reorder so the length check short-circuits first; an empty slice now falls through to the plan path's invalid_input error. Regression test added. (Claude cross-review P1.)
Probe in-memory memtables by reference (no source Arc clones, no allocation in
the common single-memtable case) instead of collect()-ing every source, and
reuse a precomputed canonical schema for the no-projection case. Removes the
per-lookup Arc-refcount and allocator contention that capped concurrent-reader
QPS scaling. Adds LsmDataSourceCollector::{find_in_memory_newest_first,
has_on_disk_sources}.
…ocksDB multi_get) --batch-get N measures fetching N keys per call: Lance does one BTree pass + a single vectorized Arrow take per source batch (concat to one RecordBatch); RocksDB uses multi_get. Sync on both sides (std threads) to compare the batch primitives. Reports keys/sec (1t + Nt) and per-batch latency.
Point lookups did an O(log n) skiplist range scan over an enum-keyed SkipMap — ~66% of single-thread lookup time (perf). Add a DashMap<ScalarValue, newest position> sidecar to BTreeMemIndex: get_eq() answers equality in O(1), falling back to the ordered range scan only when the newest write isn't visible yet (concurrent newer version). The skiplist stays for range queries. probe_memtable and the bench fast/batch paths use get_eq.
…f ScalarValue The equality hash keyed on the large ScalarValue enum: ~190MB for 1M keys and a wide-match compare (12.6% of single-thread, per perf). Introduce EqKey, a compact enum keying on the raw primitive (inline i64 etc.), with a ScalarValue fallback for exotic/null. Shrinks per-key memory and turns hash/compare into a typed op. An index is single-column/single-type so width variants never collide.
Lever lance-format#2 keyed the equality hash on a compact primitive instead of ScalarValue, expecting lower memory + a cheaper compare. EC2 re-profile showed neither: the 12% in the key compare is cache-miss latency loading the key from a random DashMap bucket (1M keys don't fit in cache), not comparison arithmetic or key size; and the +190MB is DashMap's sharded-RawTable structural overhead, not the key payload. Single-thread stayed 0.77us, RSS 528MB. Revert for simplicity; keep the ScalarValue-keyed hash from the prior commit (the real 2.9x win).
… the fast path Phase 1: LsmPointLookupPlanner::lookup_many(keys, projection) resolves N keys in one pass (probe in-memory memtables newest-first by reference, one vectorized Arrow take per source batch, plan fallback for on-disk/uncovered keys), plus plan_point_lookup() wrapping the result as a composable OneShotExec. Phase 2: LsmScanner::create_plan detects point-lookup filter shapes (pk = lit, pk IN (..)) and routes them to the direct BTree point-lookup node instead of the union/dedup scan, with limit applied on top. Any other shape falls through unchanged, so results are identical for non-point queries. This puts the fast path in the real read path, not just the benchmark. Tests: lookup_many (hits/misses/newest-dup/projection/empty), plan_point_lookup execution, scanner routing (pk=lit / pk IN route to OneShotStream, range filter does not), and a multi-level LSM test asserting the routed result == union path.
Under --lance-read-mode api with --batch-get N, the batch read path now calls the production LsmPointLookupPlanner::lookup_many instead of the bench helper, so the batch numbers reflect shipped code.
The probe refactor routed the single-row materialization through gather_rows, which used take() (alloc + copy) instead of the original zero-copy slice, regressing single point lookups (0.77us -> 1.41us) and the common 1-hit-per- source-batch case. gather_rows now slices for a single row and only takes for true multi-row gathers.
A one-element lookup_many (e.g. a routed pk IN (x)) paid the batch grouping overhead (refs vec + hash map + pending) with nothing to amortize. Delegate n==1 to the cheaper single lookup path.
The O(1) equality hash was added to every BTreeMemIndex, so non-PK btree indexes (range filtering, never point-looked-up) paid the dual write + memory cost for a hash they never use. Make it Option-al and gate it on the primary key: btree_from_metadata detects the PK column from the schema's unenforced-primary-key metadata and sets is_primary_key; from_configs builds the hash only then. probe_position range-scans the ordered skiplist when no hash is present (correct, O(log n)). This mirrors the scanner's existing PK-only hashing for flushed-generation block-lists. Tests: PK vs non-PK hash gating, and a no-hash range-fallback point lookup.
…t lookup Drop the per-PK DashMap equality hash (the ~190MB memory cost) and instead make the point lookup a single seek-and-stop on the existing ordered skiplist: BTreeMemIndex::get_newest_visible(value, max_visible_row) does one upper_bound(<= (value, max_visible_row)) and, if the value matches, returns the newest visible position — no range collect, no allocation, no second structure. This mirrors RocksDB's memtable Get (seek to the key, take the newest visible version). Range queries still scan the skiplist. probe_position and the bench fast/batch paths use the new seek-and-stop; the hash, EqKey, and PK-gating are removed.
The MemTable scalar index is append-only with a single writer and many concurrent readers, so it never needs the safe memory reclamation that crossbeam_skiplist pays for. Profiling the point-lookup hot path showed crossbeam's per-operation epoch pin (try_pin_loop) dominating, and worse, contending on the global epoch across threads — the N-thread read-scaling bottleneck. Add a purpose-built single-writer, lock-free-read skiplist (arena_skiplist): nodes are owned for the index generation's lifetime and freed only on drop (when no readers remain), so reads do only Acquire loads — no epoch pin, no fence. Mirrors RocksDB's InlineSkipList. BTreeMemIndex reads through the lock-free reader and serializes writes behind a Mutex that upholds the single-writer invariant (uncontended: the MemTable already serializes writes).
The seek is cache-miss bound: each tower hop loads a node's key. The prior arena node used two heap allocations (node + a separate boxed tower), costing two cache misses per hop and measuring ~2x slower single-thread than the crossbeam baseline despite removing the epoch pin. Lay each node out as a single bump-arena allocation with the key and its forward-pointer tower contiguous ([key][AtomicPtr; height]) — RocksDB's InlineSkipList layout — so a hop is one cache miss, and nodes are contiguous in insertion order. Keys are dropped by walking level 0 before the arena frees its chunks. The traversal invariant (a node is linked at level L only if its height > L) keeps variable-length tower reads in bounds.
The seek is cache-miss bound and the residual gap to RocksDB is node size:
our OrderableScalarValue key (~48B) makes a ~72B node, vs RocksDB's raw-byte
key (~24B node), which packs better on the bottom-level walk.
Add a FixedIntBackend for Int8..Int64/UInt8..UInt64/Date32/Date64 keyed by a
compact FixedKey { order-preserving u64, position } (~24B node). The value is
not stored in the node — it is decoded from the order-preserving encoding at
flush (sign-flip for signed, identity for unsigned). Nulls are tracked in a
side vector (off the hot path) and sort first. BTreeMemIndex picks the backend
lazily by column type on first insert; non-fixed-int types keep the existing
OrderableScalarValue backend.
Extend the small-node design to byte-typed columns (Utf8, LargeUtf8, Binary, LargeBinary, FixedSizeBinary — UUIDs are FixedSizeBinary(16)). The skiplist key is a BytesKey whose bytes live inline in the node for small values (UUID, short string PKs) via InlineBytes, spilling to the heap only for long keys — so the common cases get the single-allocation, small-node win and long keys still work. Lexicographic byte order is the natural order for these types, so the seek is a memcmp and the value is decoded from the key bytes at flush (from_utf8 for strings, raw bytes for binary). Nulls are tracked in a side vector and sort first. BTreeMemIndex routes byte types to this backend; exotic types still use the OrderableScalarValue fallback.
Adds a UUID key type (FixedSizeBinary(16)) alongside the existing i64 key, so the byte/BytesKey backend can be benchmarked head-to-head with RocksDB. Logical keys stay i64; a deterministic 16-byte derivation (scrambled high half for UUID-like scatter, key in the low half for injectivity) maps each to the key column and the RocksDB key. Key construction is centralized in key_scalar / rocks_key so both engines and all read paths share one mapping.
Profiling the 32-thread point-lookup read showed the scaling bottleneck is not a lock but per-lookup clones of shared schema Arcs: cloning/dropping a shared Arc is an atomic on one cache line that ping-pongs across cores under load. gather_rows called RecordBatch::schema() (which clones the shared schema Arc) once per output field; switch to schema_ref() (a borrow) hoisted out of the loop. lookup cloned none_target for the common no-projection case; borrow it instead and clone only once when building the result batch. This removes ~3 of the ~4 shared-schema-Arc clones per lookup.
…ature) Adds an opt-in scalable global allocator to A/B per-lookup alloc/free contention in the N-thread read path. Off by default; enable with --features bench-mimalloc.
…lice Experiment: a zero-copy slice shares the source column's buffer Arc, whose refcount churns under concurrency; take copies into owned buffers with no shared-buffer refcount. Measuring the single-thread vs N-thread tradeoff.
…o-copy slice" Copy-vs-slice measured a clear loss: take regressed single-thread point lookups ~30-38% (losing the win vs RocksDB) with no N-thread gain (int got worse, uuid +8%). This proves the sliced-buffer Arc is not the N-thread bottleneck — the materialization is non-contended CPU. Restore the zero-copy single-row slice. Also drop the now-unused bench-mimalloc allocator option: mimalloc gave no N-thread change vs glibc, confirming allocation is not the contended path.
…kup docs The O(1) equality hash index was replaced by the seek-and-stop skiplist in 27c7bd2 (EqKey, get_eq, DashMap, and PK-gating all removed). Two doc comments still described the deleted hash; update them to the seek-and-stop reality. No code change.
Public BTreeMemIndex doc linked to private Backend/FixedKey, failing rustdoc -D warnings (private_intra_doc_links).
…eatures CI CI builds with every package feature (ALL_FEATURES), which pulled bench-rocksdb and tried to compile bundled librocksdb C++ — the runners have no libclang, so all full-feature jobs failed. Comment out the mem_wal_kv_point_lookup [[bench]] target plus the rocksdb optional dependency and bench-rocksdb feature, and prune them from Cargo.lock. The benchmark source is kept; re-enable by uncommenting the three Cargo.toml blocks.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…sDB SST) Adds a flushed/on-disk read benchmark to mem_wal_kv_point_lookup: the Lance arm writes all rows as one Lance dataset with an on-disk BTree index (the artifact a MemTable flush emits) and point-looks-up via the indexed scan; the RocksDB arm flushes its memtable to a single SST (compaction off, bloom + block cache) and Gets. Bench remains disabled as a CI build target.
Adds --lance-read-mode fast for the flushed benchmark: search the on-disk BTree scalar index directly (ScalarIndex::search) + take the row, bypassing DataFusion plan construction. Diagnostic to isolate how much of the flushed point-lookup cost is the per-lookup DataFusion plan vs the index+take itself. plan mode keeps the scan().filter() path. Bench stays disabled as a CI build target.
Adds --cold for the flushed benchmark: assumes a larger-than-RAM dataset so reads miss the caches and hit NVMe. Caps the RocksDB write buffer (avoids OOM on big datasets), flushes + compacts to one SST, uses a small (128MB) block cache so data blocks miss, and drops the OS page cache before the read phase (both engines, via passwordless sudo). Bench stays disabled as a CI build target.
… RocksDB Splits rows into generations+1 parts: the first N are sealed+flushed to on-disk generations (Lance, force_seal_active) / separate L0 SSTs (RocksDB, flush per chunk, compaction off), and the last stays in the active MemTable. Lookups (api mode) traverse active -> gens newest-first, matching RocksDB memtable -> L0 SSTs. Combine with --cold to read the gens cold from NVMe. Bench stays disabled as a CI build target.
flushed_batch gathers a chunk of keys: direct mode searches the on-disk BTree index per key then issues one take_rows; plan mode runs one DataFusion scan with 'id IN (...)'. Lets --batch-get measure single-flushed batch lookups (RocksDB's single-SST multi_get path already exists). Bench stays disabled in CI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Speeds up MemWAL MemTable primary-key point lookups to match/beat RocksDB on single-thread latency and write throughput, and roughly doubles N-thread read scaling.
Index data structure
InlineSkipListstyle) for cache locality.FixedKey(order-preservingu64) forInt8..Int64/UInt8..UInt64/Date32/64;BytesKey/InlineBytes(bytes inline for small values such as a 16-byte UUID and short string PKs) forUtf8/LargeUtf8/Binary/LargeBinary/FixedSizeBinary. Other scalar types fall back to the originalOrderableScalarValuekey.Point-lookup read path
get_newest_visible) honoring the MVCC visibility watermark.LsmPointLookupPlanner::lookup/lookup_many(batch) plus scanner routing:pk = lit/pk IN (..)filters route to the fast path; everything else uses the existing union/dedup scan.Arcs per lookup (borrow viaschema_ref, borrownone_target) — that refcount cache line was ping-ponging across cores under concurrency.Benchmark
mem_wal_kv_point_lookup: head-to-head Lance MemTable vs RocksDB, identical key/value/query sets. RocksDB arm is behind the off-by-defaultbench-rocksdbfeature;--key-type int|uuid.Results (1M rows, c7i.8xlarge)
Arcfix). The residual is structural (seek memory bandwidth + Lance's separate columnar value gather), not a contended lock.