[Feature] Add Distributed Posting Router for SPANN by TerrenceZhangX · Pull Request #448 · microsoft/SPTAG

TerrenceZhangX · 2026-05-07T13:47:48Z

Scale results

Dataset: SIFT1B bigann_base.u8bin, 128d UInt8, L2. SPANN 2-layer index,
4 search threads, 4 insert threads, top-K=5, 200 queries.

1M base + 1M insert

Metric	1node	2node	Scale
Build time (s)	74.2	91.3	0.81×
Pre-insert QPS	429.3	696.3	1.62×
Pre-insert mean latency (ms)	9.26	8.63	—
Pre-insert p99 (ms)	27.31	29.18	—
Post-insert QPS	425.0	708.1	1.67×
Insert throughput (vec/s)	~900	1793	—
Recall@5 (pre-insert)	0.984	0.978	—
Recall@5 (post-insert)	0.983	0.984	—

Notes:

Build is slower on 2node at 1M scale: the cross-node coordination
overhead (head-sync RPCs, control plane) dominates when there are only
~40k head vectors to build.
Search scales well already at 1M; recall is unchanged.

100M base + 1M insert

Metric	1node	2node	Scale
Build time (s)	15292	16264	0.94×
Pre-insert QPS	183.3	360.5	1.97×
Pre-insert mean latency (ms)	21.56	21.92	—
Pre-insert p99 (ms)	32.81	39.55	—
Post-insert QPS (round 2)	183.2	337.2	1.84×
Insert throughput (vec/s)	738	1285	1.74×
Recall@5 (pre-insert)	0.912	0.904	—
Recall@5 (post-insert)	0.912	0.904	—

Notes:

Search scales near-linearly (1.97×) at the target scale — the per-node
query partition + remote KV reads are well-balanced.
Insert scales sublinearly (1.74×), expected: every insert that
promotes/splits a head triggers head-index sync across nodes, which is the
current bottleneck.
Build is essentially flat (0.94×): build is dominated by per-node local
graph construction; 2node has additional head-sync but no work split, so
near-no scaling here is expected for the current single-builder design.
Recall is stable across configurations (within 0.01).

…c, benchmarks Core routing (PostingRouter.h): - Hash routing: GetOwner uses headID %% NumNodes for deterministic assignment - RemoteLock RPC for cross-node Merge serialization (try_lock + retry) - BatchAppend, HeadSync, InsertBatch packet types and handlers - TCP-based server/client for inter-node communication ExtraDynamicSearcher.h integration: - EnableRouter/AdoptRouter for index lifecycle management - Split: BroadcastHeadSync after creating/deleting heads - MergePostings: Cross-node lock for neighbor headID on different node - MergePostings: BroadcastHeadSync for deleted head after merge - Reassign: Route Append to owner node + FlushRemoteAppends - AddIndex: Route appends to owner node via QueueRemoteAppend - SetHeadSyncCallback: Wire up HeadSync + RemoteLock callbacks Infrastructure: - IExtraSearcher/Index/VectorIndex: Add routing virtual method chain - Options/ParameterDefinitionList: RouterEnabled, RouterLocalNodeIndex, RouterNodeAddrs, RouterNodeStores config params - CMakeLists: Link Socket sources and Boost into SPTAGLibStatic - Connection.cpp: Safe remote_endpoint() with error_code (no throw) - Packet.h: Append, BatchAppend, InsertBatch, HeadSync, RemoteLock types - SPFreshTest.cpp: ApplyRouterParams, FlushRemoteAppends, WorkerNode test - Benchmark configs: 100k/1m/10m x 1/2/3 node - run_scale_benchmarks.sh: Automated benchmark runner - docker/tikv: TiKV cluster docker-compose + config Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix index dir creation: create parent dir only (not spann_index subdir) so the build code creates spann_index and the safety check passes - Clear checkpoint after build phase so driver re-runs all insert batches - Add VectorIndex::AdoptRouter to transfer router between batch clones instead of creating new TCP server per batch (port conflict fix) - Fix ExtraDynamicSearcher::AdoptRouter to override IExtraSearcher interface 100k results (routing works, ~50/50 local/remote split): 1-node steady-state: 90.2 vps 2-node steady-state: 80.5 vps 3-node steady-state: 77.7 vps No scaling at 100k due to small batch size (100 vectors) and shared TiKV. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

12 slides covering: problem statement, solution architecture, full flow comparison (single vs distributed), hash routing, append write path, split/merge/reassign routing, HeadSync broadcast, 3-node sequence diagram, design decisions, network protocol, config, and 100k results. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix ini thread params: replace NumThreads with NumSearchThreads/NumInsertThreads - Add 100M benchmark ini files (1-node, 2-node, 3-node) - Update data paths to /mnt/data_disk/sift1b in all ini files - Add BENCHMARK_GUIDE_SCALE_COMPUTE.md (English) - Add BENCHMARK_RESULTS_SCALE_COMPUTE.md with 100K/1M/10M results - Update docker-compose and tikv.toml for 3-PD/3-TiKV cluster - Update run_scale_benchmarks.sh with multi-scale orchestration - Add .gitignore entries for generated benchmark artifacts

- Fix FullSearch routing for multi-node search (per-node build) - Update 10M benchmark: insert throughput 2-node 1.65x, 3-node 1.98x - Search latency 10M: 2-node -35%, 3-node -50% vs 1-node - Near-linear insert scaling across all data sizes (100K, 1M, 10M) - Update benchmark configs, test harness, and scale benchmark script

- Delete all 24 benchmark INI files from Test/ - Replace section 5 in BENCHMARK_GUIDE_SCALE_COMPUTE.md with complete deterministic generation rules (scale table, topology rules, template, Python generator script) - INI files can be regenerated on demand via the guide

- Embed full docker-compose.yml and tikv.toml contents in BENCHMARK_GUIDE_SCALE_COMPUTE.md section 4.1 - Remove docker/tikv/ from git tracking (files stay on disk) - Use <NVME_DIR> placeholder instead of absolute paths Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Bug fixes: - SPANNIndex.cpp: Remove redundant SortResult() in FullSearchCallback that corrupted remote search results (heapsort on already-sorted data) - TestDataGenerator.cpp: Fix EvaluateRecall truth NN stride from 1 to K Feature: - SPFreshTest.cpp: Add BuildOnly parameter to skip insert batches Benchmark results (Float32/dim64, 10M scale): - 1-node: 93.8 vps, 2-node: 200.0 vps (2.13x), 3-node: 271.4 vps (2.89x) - Recall stable within each config after double-sort fix Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Measure RPC round-trip time (sendTime → future.get()) and assign per-query latency for remote search results. Previously p50/min latency showed 0 for remote queries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Revert TiKVVersionMap node-scoped prefix optimization (vc:{nodeIndex}:...) that broke cross-node version check correctness. Version map now uses shared namespace (vc:{layer}:...) so all nodes can read/write the same version data. Node-scoped optimization deferred to future branch. Also add explanatory comments for AdoptRouter and HeadSync broadcast. Restore commented-out debug log lines. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

These were only used by the reverted node-scoped version map. LocalToGlobalVID is kept (used in insert path for VID uniqueness). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

InsertVectors now uses dual path: per-vector multi-threaded insert for single-node (original behavior), bulk AddIndex for router-enabled multi-node (amortizes RPC overhead via batched remote appends). Also add comment on AddIndex explaining caller-side shard partitioning and LocalToGlobalVID purpose. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Restore the original vidIdx counter loop for post-heap version filtering instead of the candidateIndices array approach. Both are functionally equivalent but the original pattern is simpler. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Restore 'For quantized index, pass GetFeatureDim()' comment - Remove else { func(); } branch that was not in the original code Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use ConvertToString(valueType) and INI parameters to build the perftest_* filenames, matching TestDataGenerator::RunLargeBatches convention. Moves filename construction out of the per-batch loop. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Document the 3-file INI pattern for multi-node benchmarks: - _build.ini: Rebuild=true, no Router (build phase) - _driver.ini: Rebuild=false, Router enabled (driver/n0) - _n{i}.ini: worker nodes (n1, n2, ...) Update Python generator and shell script to match. Remove the sed-based approach of patching _n0.ini at runtime. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move Router parameters (RouterEnabled, RouterLocalNodeIndex, RouterNodeAddrs, RouterNodeStores) from DefineSSDParameter to DefineRouterParameter with its own [Router] section in: - ParameterDefinitionList.h: new DefineRouterParameter macro - Options.h: SetParameter/GetParameter handle 'Router' section - SPANNIndex.cpp: SaveConfig outputs [Router] block - SPFreshTest.cpp: read [Router] INI section, ApplyRouterParams uses 'Router' section - Benchmark guide: updated INI template and Python generator Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Distributed scale results belong in BENCHMARK_RESULTS_SCALE_COMPUTE.md, not the single-node 10M results file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

DefineRouterParameter was nested inside #ifdef DefineSSDParameter, causing router parameters to be silently ignored. Moved the #endif to the correct position so DefineRouterParameter is at top-level scope. Also removed a debug log line from the DefineRouterParameter macro in Options.h. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Implement benchmark-level query distribution where each node independently searches its contiguous partition of the query set, coordinated by barrier files (same mechanism as insert distribution). This replaces the previous RPC-based approach, eliminating RPC overhead and serial head search. Key changes in SPFreshTest.cpp: - BenchmarkQueryPerformance: partition queries across nodes, use barrier files for synchronization, compute QPS = totalQueries / max(wallTime) - WorkerNode: unified command loop handling both search and insert commands via shared index directory Results (10M Float32, 200 queries, TopK=5): - 1-node: 194 QPS baseline - 2-node: 404 QPS (2.08x speedup, super-linear due to cache effects) - 3-node: 488 QPS (2.52x speedup) - Insert scaling: 1→2→3 node = 119→211→314 vec/s (1.77x, 2.64x) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

1. Fix remote lock leak in MergePostings (ExtraDynamicSearcher.h) - RAII RemoteLockGuard ensures remote lock is released on all exit paths (continue/return/exception), preventing distributed deadlock 2. Fix buffer overflow in BatchRouteSearch (SPANNIndex.cpp) - Validate response array sizes before accessing result vectors - Fall back to local search on size mismatch 3. Fix missing send-failure callback in SendRemoteLock (PostingRouter.h) - Add failure callback to complete the promise on send error, matching the pattern used by all other SendPacket call sites - Prevents 5-second stall on every send failure 4. Normalize atomic operation in SendRemoteLock (PostingRouter.h) - Change m_nextResourceId++ to fetch_add(1) for consistency 5. Fix uninitialized workerTime in barrier coordination (SPFreshTest.cpp) - Initialize workerTime and validate ifstream read - Skip worker timing on parse failure instead of using garbage value Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove BatchRouteSearch, SetFullSearchCallback, GetSearchNodeCount, and all supporting RPC infrastructure (SearchPostingRequest/Response, FullSearchBatchRequest/Response structs, search callbacks, handler methods, and related member variables) from PostingRouter, SPANNIndex, Index, VectorIndex, ExtraDynamicSearcher, SPFresh, and Packet. Tests use barrier-based distributed search exclusively; the RPC-based search routing is dead code. Existing SearchRequest/SearchResponse packet types are preserved as they are used by the pre-existing Aggregator/Client/Server code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Critical fixes: - Fix integer overflow in DecodeVectorSearchResponse (ExtraTiKVController.h) Prevents OOB reads from corrupt TiKV responses with large numResults. - Fix m_mergeJobsInFlight counter underflow in MergePostings retry paths (ExtraDynamicSearcher.h) Add increment before re-enqueued MergeAsyncJob to match the unconditional decrement in exec(). High fixes: - Add FlushRemoteAppends after Split reassignment (ExtraDynamicSearcher.h) Ensures queued remote appends are sent after CollectReAssign in Split(). - Fix data race on m_nodeAddrs in ConnectToPeer (PostingRouter.h) Snapshot address under m_connMutex before retry loop. - Fix BroadcastHeadSync reading m_nodeAddrs without lock (PostingRouter.h) Snapshot node count under m_connMutex before iterating. Medium fixes: - Fix m_storeToNodes race in AddNode - move inside m_connMutex scope. - Fix unvalidated entryCount in HandleHeadSyncRequest with buffer-end tracking to prevent overruns from corrupt packets. - Add buffer-end tracking in BatchRemoteAppendRequest::Read to catch overruns during per-item deserialization. - Make m_asyncStatus atomic to fix race between async jobs and Checkpoint. Use exchange() for atomic read-and-reset in Checkpoint. - Make shared ErrorCode ret atomic in LoadIndex and WriteDownAllPostingToDB parallel loops. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Bug 32e cut HandleBatchAppendRequest/Put receiver fan-out from 4 to 2, which was the wrong direction: the worker's overload comes from its local AppendThread/InsertThread pools (~40 threads), not from the receiver pool (4 threads, ~5% of total). Halving the receiver pool only doubled per-chunk processing time on the receiving side, pushing the driver's 180s wait_for past timeout and causing v30/v31 BATCH 1 end-of-bulk hangs. This commit rebalances within the worker's 48-thread budget: - AppendThreadNum: 32 -> 16 (template ini) - NumInsertThreads: 8 -> 4 (template ini) - HandleBatchAppendRequest: 2 -> 8 (RemotePostingOps.h) - HandleBatchPutRequest: 2 -> 8 (RemotePostingOps.h) Net active TiKV concurrency on worker drops from ~42 to ~32, while receiver throughput quadruples. A 50k-item BatchAppend chunk now drains in ~62s typical (vs ~250s with pool=2), comfortably under the 180s wait_for timeout in SendBatchRemoteAppendChunk. Also adds LL_Info instrumentation around SendBatchRemoteAppendChunk's wait_for so we can confirm slow-vs-deadlock at end-of-bulk. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Distributed BATCH 1 recall stuck at 0.62 in v32 even though latency and BATCH 0 recall met targets. Diagnosis from v32 logs: driver layer-1 BKTree: 117K samples, RMWs=451, splits=0 worker layer-1 BKTree: 657K samples, RMWs=4,197,126, splits=273 global alive heads: 200K Driver's BKTree (and underlying m_pSamples) is missing ~83K of the 200K alive heads — its query-time head index is blind to ~40% of the corpus, which matches the observed recall drop. Underlying mechanism is that BroadcastHeadSync was fire-and-forget: when the per-peer SendPacket completion reported failure, entries were logged at LL_Debug and dropped forever. Under v32's worker overload, many sends fail and the peer's headIndex / m_pSamples diverge. This commit: 1. RemotePostingOps: counters + per-peer retry queue - Atomic counters for broadcast entries, send OK / fail, recv entries, applied Add/Delete, retry enqueued / succeeded / dropped. - Per-peer HeadSyncBacklog (cap 256K entries) holds entries when a send completes with success=false; a dedicated retry thread periodically (every 500ms, tunable via SPTAG_HEADSYNC_RETRY_INTERVAL_MS) re-broadcasts up to 1024 entries per peer. - DumpHeadSyncStats() called at every layer-N ALL_DONE boundary + every 30s from the retry thread, so we can diff sender/receiver and detect any remaining gap. 2. ExtraDynamicSearcher Stage 2B fallback (GatherBarrier silent drop) - When a remote FetchPostings batch fails (rf.ok=false), instead of silently skipping ~rf.items.size() postings, fall back to local TiKV reads for those headIDs. Driver and worker share the same TiKV cluster so the routing was an optimization, not a correctness boundary. - Adds [v33] LL_Warning log + per-batch fallback hit counter so we can see when the path triggers. 3. ExtraDynamicSearcher worker FetchPostingsCallback diagnostic - Distinguish ErrorCode::Key_NotFound (legitimate empty posting) from any other failure inside GetPostingFromDB; log LL_Warning with fail / notFound counts so transient TiKV errors are no longer indistinguishable from empty-key misses. 4. WorkerNode forward methods for DumpHeadSyncStats / GetHeadSyncBacklogSize / DrainHeadSyncBacklog / NoteHeadSyncApplyAdd / NoteHeadSyncApplyDelete. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Root cause of BATCH 1 recall regression (v32 0.62 / v33 0.60 vs target 0.78): RemotePostingOps held SINGLE-SLOT callbacks (m_appendCallback, m_headSyncCallback, m_remoteLockCallback, m_putPostingCallback, m_fetchPostingsCallback). With Layers=2, each layer's ExtraDynamicSearcher::SetWorker overwrites the previous; layer 1's SetWorker runs last so all five callbacks captured layer-1's `this` and m_layer=1. Every incoming RemoteAppend/HeadSync/Lock/Put/Fetch event from peers — regardless of which layer it originated from — routed to layer-1's lambda and called AddHeadIndex(..., m_layer+1=2, ...) which routes to the BKT (Index.h:246-269 fallback). Layer-0 head VIDs polluted the BKT (m_pSamples is append-only, so churn caused unbounded growth: worker BKT 327K mid-batch tracking v32's 657K final). Forensics confirm HeadSync delivery is 100% correct (v33 counters: broadcast 6720, send_ok 6720, send_fail 0, recv 533580). The 'loss' suspected from v32 was a counter artifact (dispatcher counted as a peer in m_nodeAddrs leading to double-count on send). Fix is therefore NOT delivery — it is dispatch routing. Generic per-layer design (supports >2 layers): - DistributedProtocol.h: add std::int32_t m_layer = 0 to all 5 wire formats. Bumped MirrorVersion 0->1 on the four formats with a preamble (RemoteAppendRequest, RemoteLockRequest, RemotePutPostingRequest, RemoteFetchPostingsRequest) with conditional read on mirrorVer >= 1 so legacy packets default to layer=0. HeadSyncEntry has no preamble; appended unconditionally (driver and worker rebuilt monolithically). - RemotePostingOps.h: replaced single-slot callbacks with std::vector<Callback> indexed by layer. EnsureLayerSlot_NoLock(layer) lazily grows. Owners array is std::vector<std::atomic<const void*>> with manual element migration on growth (atomic isn't movable). All 5 Handle*Request dispatch sites look up by req.m_layer; missing callbacks emit LL_Warning rather than crash. - WorkerNode.h: pass-throughs take int layer first arg. Default layer-agnostic Put/Fetch installed at layer 0. Lookup helpers fall back to layer 0 for PutPosting/FetchPostings only (layer-agnostic raw m_db ops; build-receiver mode still works without ES binding). Append/HeadSync/RemoteLock have NO fallback (layer semantics are required for correct routing). - ExtraDynamicSearcher.h: SetWorker passes m_layer to all 5 setters, ClaimCallbackOwnership and ClearCallbacksIfOwner. All call sites set req.m_layer = m_layer before sending (4 QueueRemoteAppend, 2 HeadSync entry construction blocks, SendRemoteFetchPostings, SendRemoteLock RAII guard). This makes the per-layer callback routing the source of truth: the dispatch table is keyed by layer, the captured `this` in each lambda is the right instance for that layer, and the wire format carries the layer so the receiver can route correctly without inferring it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…it lock storm Root cause: - AppendBatchAsync / Append() had no ownership check. For splits with ~119 reassign-target heads, ~50% are remote-owned. The lock pipeline in AddIndexAsyncSingleKey acquired m_rwLocks[headID] for ALL of them (held across Pass 1-4, blocking 16 concurrent split workers), then did MultiGet against LOCAL TiKV (returning NotFound for remote heads), then Pass-4 serial Append() retries that wrote the new bytes under the remote head's key, creating orphan data the real owner never sees. - Effect: 99-225s per split (avg) vs 2.2s on 1-node = 45-100x slower. Fix (two complementary guards): 1. AppendBatchAsync entry-time pre-filter: when worker is enabled, partition headAppends into local vs remote; remote -> QueueRemoteAppend immediately; local-only subset flows through existing lock pipeline. 2. Append() entry-time defensive check mirroring the same routing. Catches direct callers that bypass AppendBatchAsync (ReassignAsync, recursive Split, Pass-4 sync-retry fallback). Consistent with existing ownership checks in Split (line 996), MergePostings (line 1714), AddIndex distributed path (line 3782). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… storm In 2-node sharded mode, each node only writes half of layer-1 posting data locally. This makes layer-1 postings unnaturally small (<10 entries), so AsyncMergeInSearch trigger at SearchIndex (realNum<=mergeThreshold) fires on nearly every visited layer-1 head: 1-node BATCH 1: 182 layer-1 merges total 2-node BATCH 1: 14M (driver) + 18M (worker) = 32M layer-1 merges That floods m_splitThreadPool with noop merge jobs (Bug-11 fast path drops 50% as remote-owned), starving real layer-0 split/append work. Layer 1+ in SPANN holds BKT-structural data; merge-on-small is meaningless there. Fix: in distributed mode gate MergeAsync to (a) leaf layer (m_layer==0) (b) locally-owned head. 1-node behavior preserved. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…sync In sharded distributed mode, AddIndex was synchronously calling Append() per vector for the ~50% of selections owned by the local node. Each Append() does a TiKV Get+Put roundtrip (~8ms), and AddIndex runs on a single insert thread, capping pre-split insert throughput at ~125/s/node vs 466/s for the 1-node baseline (which batches all appends via AppendBatchAsync and fan-outs through m_splitThreadPool). This change mirrors the 1-node path: accumulate local-owned appends into a headID->buffer map and call AppendBatchAsync once at end-of-batch. Remote-owned appends continue to flow through QueueRemoteAppend. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move 2-node from independent-PD-per-node to a single PD-raft group with max-replicas=1. Compute nodes become stateless TiKV clients: PD routes each region read/write to whichever store owns it, so the distributed routing layer no longer needs cross-compute fetch RPCs. Files run_distributed.sh - tikv_start: build a shared initial-cluster and pd-endpoints list; every PD/TiKV joins the same raft group. - PD readiness now waits for the expected member count, not just /pd/api/v1/members reachability. - Fix max-replicas POST: previous code used GET-only /config/replicate and silently failed. Now POSTs to /pd/api/v1/config, verifies via GET, retries up to 5x. (Still racey if PD raft is slow to settle; manual workaround is to POST after Phase 0.) - tikv_switch_to_nocache mirrors the shared pd-endpoints layout. SPFreshTest.cpp - BenchmarkFromConfig now sets TiKVPDAddresses to the FULL endpoint list (was per-worker local PD only). Same for the build-distribute receiver path. cluster_2node.conf - Documentation block on shared raft mode. tikv.toml (server tuning, v41) - apply-pool-size 4 -> 32, store-pool-size 4 -> 16: apply pool was the primary write-amp source (TiKV pegged at ~8/96 cores while client ops queued). Moves us to ~10 cores per node under burst. - grpc-concurrency 16 -> 32, grpc-memory-pool-quota 8GB -> 16GB. - scheduler-worker-pool-size 16 (new). - readpool.unified max=32 min=8 (new): cap reads so they don't steal CPU from writes. - max-background-flushes=8, raft-write-batch-size=1MB (new). - defaultcf write-buffer-size 512MB -> 1GB, max-write-buffer-number 5 -> 8, level0 slowdown/stop 28/40 -> 40/60. ExtraDynamicSearcher.h (HeadSync top-only + remote-fetch removal) - Broadcast HeadSync for split/merge only when the head update targets the in-memory BKT (m_headIndex->GetDiskIndex(m_layer+1) == nullptr). Lower-layer head updates already write to shared TiKV via the local DiskIndex chain and are visible to every compute, so broadcasting them duplicates writes and can pick a different posting if peer BKT state diverges. - MultiGet/MultiScan no longer fan out SendRemoteFetchPostings; everything goes through the local PD-routed TiKVIO path. WorkerNode.h (auto-flush + per-node inflight cap) - QueueRemoteAppend auto-flushes per node once the per-node queue reaches kAutoFlushThreshold=50000 (was: hold everything until end-of-batch FlushRemoteAppends, then serial drain). Up to kMaxInflightPerNode=4 chunks may be in flight per node so a producer burst (split fan-out, reassign wave) can saturate the receiver bg-executor pool. - FlushRemoteAppends waits for any straggler auto-flush to drain before sending the final tail. Per-node mutex on the end-of-batch sender keeps tail-to-same-node sends ordered. RemotePostingOps.h - kChunkSize history block; current setting 50000. v42 (10k) was throughput-best (906/s) but during-insert p50 222ms; v43 (50k) trades throughput (-22% -> 704/s) for during-insert p50 (-36% -> 141ms) and post-insert r1 QPS 47 -> 85. v44 (100k) blew up tail drain: a single 100k chunk took 116s on the receiver -> 40+ min end-of-batch drain (vs 8 min at 50k). 50k is the sweet spot. - Per-chunk sub-worker fan-out comment: kept at 8 (v39 baseline); combined with cap=4 inflight yields TiKV-side concurrency 4*8=32. benchmark_insert_dominant_2node.ini - VersionCacheTTLMs=0, VersionCacheMaxChunks=0 for nocache mode. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Background ExtraDynamicSearcher::AddIndex had two near-identical per-vector loops: one for the distributed (router-enabled) path with cross-node ownership routing, and one for the single-node fallback with the WAL hook. PutAssignment was only called in the second loop, so the in-flight queue / RPC pipeline in the distributed path bypassed the WAL even when EnableWAL=true. Change Collapse the two loops into one. RNGSelection, version bump, Serialize, and the WAL PutAssignment happen exactly once per vector, before any routing decision. The routing branch (m_worker->GetOwner -> QueueRemoteAppend) lives inside the per-replica inner loop and only fires when m_worker && m_worker->IsEnabled(). Local-owned (and non-routed) heads accumulate into a single headAppends map and are flushed via AppendBatchAsync at the end. VID encoding routed mode serializes LocalToGlobalVID(VID) for cross-node uniqueness; non-routed mode keeps the local VID. Same behavior as before, just expressed via a single payloadVID local. Durability semantics Gated by m_opt->m_enableWAL (default false in ParameterDefinitionList.h), so this commit is a no-op at runtime for existing benchmarks. When EnableWAL=true is set, AddIndex Success now implies that every vector — including remote-owned ones — has been persisted to the local PersistentBuffer (RocksDB write, default sync=false) before the producer returns. The in-memory append queue and BatchAppendChunk RPC pipeline can then be replayed from the WAL on restart. Note: PersistentBuffer's underlying RocksDBIO::Put still uses default WriteOptions (sync=false), so this protects against producer crash but not power loss. Upgrading to options.sync=true is a follow-up and should be measured against a durable-throughput target separately. Replay caveat PersistentBuffer's existing replay (StartToScan/NextToScan) was written for local-VID-only records. When EnableWAL=true is turned on for a distributed run, replay will need to be made VID-mode aware (decode the serialized record and re-dispatch via the same routed/non-routed split this commit just unified). Left as a follow-up since the immediate goal was to have the code path present without measuring overhead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ibuted-to-tikv Merged SearchStats latency breakdown support into distributed query paths. Both distributed (multi-node dispatch) and single-node search now use spannIndex->SearchIndex(result, &searchStats) for per-query latency breakdown. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…batch version checks) Conflicts resolved: - .gitignore: kept both sides' ignore patterns - Test/src/SPFreshTest.cpp: kept HEAD's distributed-mode BenchmarkQueryPerformance routing (dispatcher broadcast + per-node query partition); HEAD already has the SearchBreakdown JSON output block that qiazh added at a different location, so qiazh's duplicate was dropped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

gRPC dedupes subchannels by (address, channel-args). With identical args across all kStubPoolSize stubs, every stub multiplexed onto a SINGLE TCP socket — serviced by ONE grpc-server worker thread regardless of the TiKV-side grpc-concurrency setting. Under high client concurrency this single server thread saturates near 85% CPU, queues all incoming RPCs, and the 100ms client deadline expires while requests sit in the server completion queue (manifests as a flood of "Deadline Exceeded" even though RocksDB engine_get p99 is <100us and there is no write stall). Setting GRPC_ARG_CHANNEL_POOL_DOMAIN to a per-stub unique string ("sptag-stub-" + index) puts each channel in its own subchannel pool, forcing a separate TCP connection per stub. After this fix: * TCP connections to TiKV: 1 → 96 (48 stubs × {TiKV, PD}) * grpc-server-0 CPU: 84% → 3.6% (load balanced across all 32 threads) * Deadline Exceeded / retry exhausted / read postings fail: all 0 during a 26-min insert_dominant bench (1M base + 1M insert). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When MergePostings runs locally and the surviving head wins over a remote candidate's posting, the initiator now sends a RemoteDeletePosting to the owner node which deletes its posting, removes the local head entry, and broadcasts a HeadSync(Delete) to peers so other BKTs drop the head. Idempotent on the owner side (already-deleted returns Success). Also adds HeadSync aggregation during RefineIndex: each MergePostings firing its own single-entry BroadcastHeadSync produces |mergelist| * (numNodes-1) tiny RPCs at RefineIndex time. The new m_refineHeadSyncBuffer collects them so RefineIndex flushes one batched broadcast after all submitted merges drain. Files: * Packet.h - DeletePostingRequest/Response opcodes (0x10) * DistributedProtocol.h - RemoteDeletePostingRequest/Response structs * RemotePostingOps.h - DeletePostingCallback registration + send path * WorkerNode.h - SendRemoteDeletePosting / SetDeletePostingCallback * ExtraDynamicSearcher.h - integrate delete-posting callback, HeadSync aggregation, RefineIndex flush Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add benchmark_10m_template.ini: 9M base + 1M insert (1.2GB head index, ~13 min build). Useful intermediate between insert_dominant (1M+1M, ~10 min total) and 100m (99M+1M, ~5h total) for validating 1node→2node scaling without paying the full 100M build cost. * Fix benchmark_100m_template.ini + benchmark_insert_dominant_template.ini dataset paths from /mnt/data/sift1b/{base.1B,query.public.10K}.u8bin (reference machine layout) to /mnt/nvme/sift1b/{bigann_base,query.10K}.u8bin (our layout). insert_dominant scale fixed at 1M base + 1M insert (the 1M+10M variant was too long to drive iteration). * cluster_{2,3}node.conf: document image-ref overrides (tikv_image, pd_image, helper_image) for environments that cannot push to the registry — pull-only ACR usage now self-explained. * run_distributed.sh (+181/-40): assorted benchmark driver fixes including better TiKV start/stop ordering, SKIP_HEAD_BUILD support at 100M scale, and improved logging. * .gitignore: ignore generated benchmark_*_{1,2,3}node.ini snapshots (created at runtime from templates) and tikv.toml.v*bak backups. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

These three docs predate the current TiKV-backed distributed design and have diverged from the actual implementation: * distributed-bugs.md - early bug list, no longer accurate * distributed-flow-diagrams.md - stride-sharding-era flow diagrams * distributed-job-routing.md - superseded job-routing design The authoritative distributed design now lives in docs/TiKVDistributedVersionMapDesign.md (kept). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The build-receiver path in SPFreshTest had a fixed BuildReceiverTimeout (default 7200s) used as a defensive cap so a worker would not hang forever if the driver crashed or its TCP connection silently dropped. That timeout had to be retuned every time the build duration grew (it fired prematurely at 100M scale where BKT+head build alone takes ~5h), which is both bug-prone and the wrong scaling axis -- worker lifetime should track driver liveness, not absolute wall-clock build time. Replace it with an explicit driver-to-worker Heartbeat dispatch: - DispatchCommand::Type::Heartbeat (=3) joins the existing Search / Insert / Stop enum. BroadcastDispatchCommand treats it like Stop: no PendingDispatch state, no DispatchResult sent back, no log spam. - DispatcherNode owns a dedicated heartbeat pump thread that issues a Heartbeat broadcast every HeartbeatIntervalSec seconds (30 s by default). StopHeartbeat() joins within ~100 ms. - The driver starts the pump in BuildOnly+Distributed mode after the ring is synced and stops it just before broadcasting Stop, so the pump's lifetime is exactly the build window during which workers are blocked in their dispatch callback. - The worker BUILD_DISTRIBUTE_RECEIVER block polls the stop promise in 5 s slices and tracks lastHeartbeatNs as an atomic. If no heartbeat arrives for HeartbeatTimeoutSec (default 180 s) it exits with an error message; otherwise Stop reception exits cleanly. Net effect: at any build duration the worker exits within 180 s of true driver death and never bails on a healthy long build. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The C++ heartbeat watchdog (previous commit) bounds worker lifetime to ~180 s after the driver dies, which is correct but conservative. A shell-side watchdog spawned from run_distributed.sh closes that gap faster: as soon as the driver process is gone we tear down the local SSH wrappers and issue a name-based terminate on the remote SPTAGTest workers, so a crashed driver does not leak workers across iterations. - start_driver_watchdog forks a subshell that polls the driver PID every 5 s. On exit it signals each WORKER_SSH_PID and issues a remote name-based terminate via SSH on every worker host (with a hard fallback after 5 s). - Both phase-1 (build) and phase-3 (run) driver launches invoke it immediately after starting workers, and stop_driver_watchdog runs after the driver's wait returns, so the watchdog is alive exactly when there is an actual driver to watch. - The INT/TERM trap also calls stop_driver_watchdog so Ctrl-C never leaves an orphan watchdog hanging. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

MaggieQi · 2026-05-15T09:19:22Z

+                        SPTAGLIB_LOG(Helper::LogLevel::LL_Error,
+                            "[Bug 25] Search post-fetch: out-of-range VID:%d (count:%d, deficit:%d > %d, numWorkers:%d); records will be skipped\n",
+                            maxVid, currentCount, deficit, maxAllowed, numWorkers);
+                    } else if (m_versionMap->AddBatch(deficit) != ErrorCode::Success) {


why search need to be changed?

Each node's local m_count tracks only its own AddIndex progress, not peers'. When search reads a TiKV posting whose bytes were written by a peer running ahead of us, those record VIDs can fall above our local m_count. GetVersion short-circuits any VID >= m_count to 0xfe, so without growing m_count first the post-heap filter would drop those records as deleted and recall would silently regress. This loop grows m_count just enough to cover the largest VID we actually fetched

MaggieQi · 2026-05-15T09:20:44Z

+                //     at the trigger site avoids the round trip entirely.
+                // Single-node behavior (m_worker disabled) is preserved.
+                if (m_opt->m_asyncMergeInSearch && realNum <= m_mergeThreshold) {
+                    if (m_worker && m_worker->IsEnabled()) {


Even though it is not local and not layer 0, it still needs to do mergeasync.

Agree. This is over-simplification. Fixed to mergeAsync trigger based on headID owner.

…ug tags * ExtraDynamicSearcher: drop the per-record SetVersionBatch loop in the TiKV search path. With the LRU chunk cache removed from the distributed versionMap design, BatchGetVersions reads chunks directly from TiKV and already sees peer-written versions, so re-writing those bytes from search is pure overhead and exposes the chunked-RMW lost-update race. Keep the AddBatch capacity catch-up: TiKVVersionMap::BatchGetVersions short-circuits any VID >= m_count to 0xfe, and m_count is a process- local atomic that does not auto-refresh from peers; without growing it first, the post-heap filter silently drops peer-allocated VIDs and recall regresses. Comment rewritten to focus on why search needs to do this at all. * Rename LocalToGlobalVID to AllocateGlobalVID. The function is the cross-node global-ID allocator (stripes new VIDs across compute workers), not a passive mapping; the new name reflects intent. * Drop [Bug NN] tags from comments and log messages in ExtraDynamicSearcher.h and Distributed/NetworkNode.h. These were iteration markers during distributed bring-up; with the design now stable the tags read as patches rather than design. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Search-side AsyncMergeInSearch previously gated on layer==0 && isLocal: remote-owned heads under m_mergeThreshold were silently dropped. The 'layer 0 only' rationale (each node only writes half a layer-1 posting) doesn't hold on shared TiKV with DBKey routing -- the owner stores the full posting, so layer-1 observations are real underfull signals worth acting on. And the isLocal gate threw away peer observations entirely: MergeAsync is local-only, MergePostings drops non-owner jobs at entry, and no other channel (RefineIndex only runs on SaveIndex) was telling the owner about underfullness in a search-heavy workload. Add a fire-and-forget MergeRequest packet (PacketType 0x11, mirrors HeadSyncRequest semantics -- no response, no retry queue) so peers can notify the owner. Search trigger now routes by owner: - target.isLocal -> MergeAsync (unchanged) - target remote -> QueueRemoteMerge(node, layer, headID) WorkerNode batches hints per-target with (layer, headID) set-dedup; auto-flush at 8192 entries (merge hints are non-urgent so the larger window dedupes harder and amortizes TCP overhead). End-of-batch FlushRemoteMerges() is called from SPFreshTest at every insert-batch boundary and after every search round so no hint is dropped at shutdown. Receiver's HandleMergeRequest fans out to the per-layer MergeCallback registered by ExtraDynamicSearcher::Configure, which just calls MergeAsync(headID) locally -- MergePostings then enforces the same owner / size / dedup invariants as the in-process path. New stats (DumpMergeRequestStats, printed periodic / ALL_DONE / shutdown alongside the HeadSync stats): send_ok send_fail recv_hints recv_dropped Verified on insert_dominant_2node (1M base + 1M insert): 47/93 hints delivered each way with zero send_fail / recv_dropped, total merge job count stays ~450 across build+insert+search (no flood: layer-1 merges stay in the hundreds, not millions). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two pure-noise differences vs origin/main that drifted in from older commits. Removing them shrinks the branch diff without changing behavior. - Around the per-vector dedup loop: revert the blank-line / debug comment ordering back to main's layout. The line content (the only the ordering of the commented-out DEBUG log and the blank line is restored. - Drop the unused 'postingsForSplit' ConcurrentSet introduced by commit 291f3f0. It is only ever inserted into and never read, so the entire variable plus its single insert call is dead code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The fresh-build / resume-mode split was added by 77ed787 (SPANN distributed: sharded build/search). The resume-mode branch is dead in practice — when we resume from a checkpoint, p_localToGlobal is empty (R() == 0) and the outer 'if (p_localToGlobal.R() > 0)' guard already skips the whole block. The else branch only ran in invalid call combinations and just re-wrote the same mapping the build path would have produced. Removing it restores the single-loop pre-distributed shape and shrinks the diff vs main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Removes the cross-node PutPosting routing in WriteDownAllPostingToDB along with the protocol (RemotePutPosting*, packet 0x0D/0x0E), WorkerNode queue/flush, RemotePostingOps callbacks, the BUILD_DISTRIBUTE_RECEIVER worker mode, and the matching Phase-1 worker launch in run_distributed.sh. With a shared TiKV cluster (PD-routed writes), the driver writes every posting directly via TiKVIO -- no TCP hop to a peer that just re-Puts. Build now runs as true single-node; workers come up only in the run phase. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Removes 4 gratuitous blank-line additions/deletions in main-side code (no semantic change) to reduce review noise. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

zhol01825 and others added 30 commits April 7, 2026 07:33

remove passing disablereassign in Split and Merge jobs

a1904d6

Set dir to nvme disk to avoid disk full

4cb8886

Remove docs/DISTRIBUTED_API.md from repo

ea12fb2

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove unused GlobalToLocalVID and IsLocalVID

cb60167

These were only used by the reverted node-scoped version map. LocalToGlobalVID is kept (used in insert path for VID uniqueness). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Restore deleted comment, remove unnecessary else branch

90cbe8f

- Restore 'For quantized index, pass GetFeatureDim()' comment - Remove else { func(); } branch that was not in the original code Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Revert BENCHMARK_RESULTS_10M.md to base branch version

9fe21d2

Distributed scale results belong in BENCHMARK_RESULTS_SCALE_COMPUTE.md, not the single-node 10M results file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

add delete posting if addhead fail

be09800

fix split bug

291f3f0

fix split gc

5cedf60

fix separate DB per layer and remove aggressive split

983b10e

TerrenceZhangX and others added 11 commits May 10, 2026 05:11

update configs with search only layer 0

605ae54

Add SPFresh latency breakdown and batch version checks

1df0b8f

TerrenceZhangX marked this pull request as ready for review May 14, 2026 10:54

TerrenceZhangX requested review from MaggieQi and zhol01825 May 14, 2026 10:54

TerrenceZhangX and others added 7 commits May 14, 2026 13:10

MaggieQi requested changes May 15, 2026

View reviewed changes

TerrenceZhangX and others added 3 commits May 18, 2026 02:08

Fix hash_func and remote RPC lock pool

321388c

TerrenceZhangX requested a review from zqxjjj May 18, 2026 06:59

TerrenceZhangX and others added 4 commits May 18, 2026 07:12

Cleanup: drop stray blank-line noise vs main

b3b2f2b

Removes 4 gratuitous blank-line additions/deletions in main-side code (no semantic change) to reduce review noise. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add Distributed Posting Router for SPANN#448

[Feature] Add Distributed Posting Router for SPANN#448
TerrenceZhangX wants to merge 128 commits into
users/qiazh/pre-merge-tikv-bugfixfrom
users/zhangt/merge-distributed-to-tikv

TerrenceZhangX commented May 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

MaggieQi May 15, 2026

Uh oh!

TerrenceZhangX May 18, 2026

Uh oh!

MaggieQi May 15, 2026

Uh oh!

TerrenceZhangX May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

TerrenceZhangX commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scale results

1M base + 1M insert

100M base + 1M insert

Uh oh!

Uh oh!

MaggieQi May 15, 2026

Choose a reason for hiding this comment

Uh oh!

TerrenceZhangX May 18, 2026

Choose a reason for hiding this comment

Uh oh!

MaggieQi May 15, 2026

Choose a reason for hiding this comment

Uh oh!

TerrenceZhangX May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TerrenceZhangX commented May 7, 2026 •

edited

Loading