perf(mem_wal): match hnswlib throughput via runtime AVX-512 f32 distance#7009
Merged
jackye1995 merged 2 commits intoMay 30, 2026
Merged
Conversation
The shipped binary targets target-cpu=haswell, so the autovectorized f32 L2/dot in lance-linalg only ever emit AVX2 even on AVX-512 CPUs, while a -march=native HNSW competitor uses AVX-512. Add runtime-dispatched l2_f32/dot_f32 (target_feature avx512f 16-wide kernels gated by SIMD_SUPPORT, AVX2 fallback via the existing autovectorized path) and route the in-memory MemWAL HNSW distance through them. Brings the MemWAL HNSW to parity with hnswlib on insert and search on AVX-512 hardware, with comparable recall and ~44% lower memory, keeping the AVX2 path for other CPUs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…L HNSW Add a parity-suite driver (Lance HNSW primitive vs hnswlib across 100k/500k/1M, capturing throughput and peak RSS) and a --query-repeats option so the query phase runs long enough to measure reliably. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@jackye1995 ping for review. Summary: the MemWAL HNSW was never algorithmically slower than hnswlib (equal-or-better recall, ~44% less memory); the only gap was that the shipped |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
jackye1995
pushed a commit
that referenced
this pull request
May 31, 2026
…nch (#7010) ## Summary Adds **sustained-duration measurement** to the MemWAL HNSW parity bench so it reports steady-state throughput under continuous load rather than a short burst. This follows up the AVX-512 distance work in #7009. Changes (bench-only, no library changes): - `--insert-seconds` / `--query-seconds`: run the write (graph build) and read (query) workloads in a loop for a fixed wall-clock duration; report aggregate throughput over all passes (`insert_passes` / `query_passes`). - `insert_core` breakdown: times the insertion itself separately from per-build graph allocation + teardown. - Both knobs added to the Lance bench and the hnswlib reference bench; `run_parity_suite.sh` gains `INSERT_SECONDS` / `QUERY_SECONDS`. Motivation: a sub-second query window gave noisy/optimistic numbers and hid AVX-512 frequency throttling. Measuring 30 s of continuous load makes read/write parity (and where it doesn't hold) reproducible. ## Latest perf results (merged main, c7i.12xlarge, 48 threads, dim=1024, m=12, ef=64, k=10) Sustained 30 s read + 30 s write per size; AVX-512 throttles 3.78 GHz → ~2.5 GHz under all-core load (affects both impls). Read (query_qps), Lance / hnswlib: | rows | ratio | |------|------| | 100k | 1.01 | | 500k | 0.995 | | 1M | 0.996 | Write — insertion compute only (`insert_core`), Lance / hnswlib: | rows | ratio | |------|------| | 100k | 0.99 | | 500k | 0.98 | | 1M | 0.96 | Write — end-to-end incl. per-build graph alloc + teardown: | rows | ratio | |------|------| | 100k | 0.96 | | 500k | 0.89 | | 1M | 0.87 | Takeaways the improved bench makes visible: - **Read is at parity** under sustained throttled load (confirms #7009 holds; the burst window wasn't hiding a regression). - **Insertion compute is at parity** — AVX-512 distance keeps pace even while downclocked. - The end-to-end write gap at scale is **entirely graph allocation/teardown** (Lance's per-node `Vec`/`Mutex`/`Arc` vs hnswlib's flat arrays), not the algorithm — and it's allocator-sensitive: with mimalloc/jemalloc as the global allocator Lance is actually faster than hnswlib (≈1.08–1.25×). No in-tree change is warranted; using a modern allocator for the memtable workload closes it. cc @jackye1995 — please review. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes the in-memory MemWAL HNSW (
rust/lance/src/dataset/mem_wal/hnsw/) as fast as hnswlib on insert and search.Decomposing the gap against hnswlib showed the index was never algorithmically slower — recall was equal-or-better and memory ~44% lower at every size. The only real gap was SIMD width: the shipped binary targets
target-cpu=haswell, sof32::l2/f32::dotonly autovectorize to AVX2, while a-march=nativehnswlib uses AVX-512 on capable CPUs (atarget-cpu=nativecontrol build confirmed parity).This adds runtime-dispatched
l2_f32/dot_f32inlance-linalg(#[target_feature(avx512f)]16-wide kernels gated bySIMD_SUPPORT, AVX2 fallback via the existing autovectorized path — same pattern asdot_u8) and routes the MemWAL HNSW distance through them, so the shipped build uses AVX-512 at runtime.Results (c7i.12xlarge, 48 threads, dim=1024, m=12, ef=64, k=10)
Shipped build with this change vs hnswlib:
vs the pre-change shipped build: insert @1m +14%, query @1m +57%.
perf stat@1m: cycles within 1% (Lance 0.99×), 41% fewer instructions (AVX-512 density + zero-copy Arrow vs hnswlib's per-vector memcpy), ~44% less RSS.Net: matches or beats hnswlib on insert and query, with equivalent CPU cycles, lower memory, and comparable recall.
Changes
lance-linalg: newl2_f32/dot_f32runtime AVX-512 dispatch (+ unit tests asserting they match the scalar reference across 16-multiple and tail lengths).mem_wal/hnsw/storage.rs: routecompute_f32_distance(L2 + Dot) through the dispatchers.benches/mem_wal/vector/hnsw/: parity-suite driver and--query-repeatsfor a stable query window.Follow-ups: AVX-512 cosine for the memtable; adopt the same dispatch in the broader
f32::l2/f32::dotfor all vector search.cc @jackye1995 — please review.