Skip to content

[Feature] Add Distributed Posting Router for SPANN#448

Open
TerrenceZhangX wants to merge 128 commits into
users/qiazh/pre-merge-tikv-bugfixfrom
users/zhangt/merge-distributed-to-tikv
Open

[Feature] Add Distributed Posting Router for SPANN#448
TerrenceZhangX wants to merge 128 commits into
users/qiazh/pre-merge-tikv-bugfixfrom
users/zhangt/merge-distributed-to-tikv

Conversation

@TerrenceZhangX
Copy link
Copy Markdown

@TerrenceZhangX TerrenceZhangX commented May 7, 2026

Scale results

Dataset: SIFT1B bigann_base.u8bin, 128d UInt8, L2. SPANN 2-layer index,
4 search threads, 4 insert threads, top-K=5, 200 queries.

1M base + 1M insert

Metric 1node 2node Scale
Build time (s) 74.2 91.3 0.81×
Pre-insert QPS 429.3 696.3 1.62×
Pre-insert mean latency (ms) 9.26 8.63
Pre-insert p99 (ms) 27.31 29.18
Post-insert QPS 425.0 708.1 1.67×
Insert throughput (vec/s) ~900 1793
Recall@5 (pre-insert) 0.984 0.978
Recall@5 (post-insert) 0.983 0.984

Notes:

  • Build is slower on 2node at 1M scale: the cross-node coordination
    overhead (head-sync RPCs, control plane) dominates when there are only
    ~40k head vectors to build.
  • Search scales well already at 1M; recall is unchanged.

100M base + 1M insert

Metric 1node 2node Scale
Build time (s) 15292 16264 0.94×
Pre-insert QPS 183.3 360.5 1.97×
Pre-insert mean latency (ms) 21.56 21.92
Pre-insert p99 (ms) 32.81 39.55
Post-insert QPS (round 2) 183.2 337.2 1.84×
Insert throughput (vec/s) 738 1285 1.74×
Recall@5 (pre-insert) 0.912 0.904
Recall@5 (post-insert) 0.912 0.904

Notes:

  • Search scales near-linearly (1.97×) at the target scale — the per-node
    query partition + remote KV reads are well-balanced.
  • Insert scales sublinearly (1.74×), expected: every insert that
    promotes/splits a head triggers head-index sync across nodes, which is the
    current bottleneck.
  • Build is essentially flat (0.94×): build is dominated by per-node local
    graph construction; 2node has additional head-sync but no work split, so
    near-no scaling here is expected for the current single-builder design.
  • Recall is stable across configurations (within 0.01).

zhol01825 and others added 30 commits April 7, 2026 07:33
…c, benchmarks

Core routing (PostingRouter.h):
- Hash routing: GetOwner uses headID %% NumNodes for deterministic assignment
- RemoteLock RPC for cross-node Merge serialization (try_lock + retry)
- BatchAppend, HeadSync, InsertBatch packet types and handlers
- TCP-based server/client for inter-node communication

ExtraDynamicSearcher.h integration:
- EnableRouter/AdoptRouter for index lifecycle management
- Split: BroadcastHeadSync after creating/deleting heads
- MergePostings: Cross-node lock for neighbor headID on different node
- MergePostings: BroadcastHeadSync for deleted head after merge
- Reassign: Route Append to owner node + FlushRemoteAppends
- AddIndex: Route appends to owner node via QueueRemoteAppend
- SetHeadSyncCallback: Wire up HeadSync + RemoteLock callbacks

Infrastructure:
- IExtraSearcher/Index/VectorIndex: Add routing virtual method chain
- Options/ParameterDefinitionList: RouterEnabled, RouterLocalNodeIndex,
  RouterNodeAddrs, RouterNodeStores config params
- CMakeLists: Link Socket sources and Boost into SPTAGLibStatic
- Connection.cpp: Safe remote_endpoint() with error_code (no throw)
- Packet.h: Append, BatchAppend, InsertBatch, HeadSync, RemoteLock types
- SPFreshTest.cpp: ApplyRouterParams, FlushRemoteAppends, WorkerNode test
- Benchmark configs: 100k/1m/10m x 1/2/3 node
- run_scale_benchmarks.sh: Automated benchmark runner
- docker/tikv: TiKV cluster docker-compose + config

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix index dir creation: create parent dir only (not spann_index subdir)
  so the build code creates spann_index and the safety check passes
- Clear checkpoint after build phase so driver re-runs all insert batches
- Add VectorIndex::AdoptRouter to transfer router between batch clones
  instead of creating new TCP server per batch (port conflict fix)
- Fix ExtraDynamicSearcher::AdoptRouter to override IExtraSearcher interface

100k results (routing works, ~50/50 local/remote split):
  1-node steady-state: 90.2 vps
  2-node steady-state: 80.5 vps
  3-node steady-state: 77.7 vps
No scaling at 100k due to small batch size (100 vectors) and shared TiKV.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
12 slides covering: problem statement, solution architecture, full flow
comparison (single vs distributed), hash routing, append write path,
split/merge/reassign routing, HeadSync broadcast, 3-node sequence
diagram, design decisions, network protocol, config, and 100k results.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix ini thread params: replace NumThreads with NumSearchThreads/NumInsertThreads
- Add 100M benchmark ini files (1-node, 2-node, 3-node)
- Update data paths to /mnt/data_disk/sift1b in all ini files
- Add BENCHMARK_GUIDE_SCALE_COMPUTE.md (English)
- Add BENCHMARK_RESULTS_SCALE_COMPUTE.md with 100K/1M/10M results
- Update docker-compose and tikv.toml for 3-PD/3-TiKV cluster
- Update run_scale_benchmarks.sh with multi-scale orchestration
- Add .gitignore entries for generated benchmark artifacts
- Fix FullSearch routing for multi-node search (per-node build)
- Update 10M benchmark: insert throughput 2-node 1.65x, 3-node 1.98x
- Search latency 10M: 2-node -35%, 3-node -50% vs 1-node
- Near-linear insert scaling across all data sizes (100K, 1M, 10M)
- Update benchmark configs, test harness, and scale benchmark script
- Delete all 24 benchmark INI files from Test/
- Replace section 5 in BENCHMARK_GUIDE_SCALE_COMPUTE.md with complete
  deterministic generation rules (scale table, topology rules, template,
  Python generator script)
- INI files can be regenerated on demand via the guide
- Embed full docker-compose.yml and tikv.toml contents in
  BENCHMARK_GUIDE_SCALE_COMPUTE.md section 4.1
- Remove docker/tikv/ from git tracking (files stay on disk)
- Use <NVME_DIR> placeholder instead of absolute paths

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bug fixes:
- SPANNIndex.cpp: Remove redundant SortResult() in FullSearchCallback that
  corrupted remote search results (heapsort on already-sorted data)
- TestDataGenerator.cpp: Fix EvaluateRecall truth NN stride from 1 to K

Feature:
- SPFreshTest.cpp: Add BuildOnly parameter to skip insert batches

Benchmark results (Float32/dim64, 10M scale):
- 1-node: 93.8 vps, 2-node: 200.0 vps (2.13x), 3-node: 271.4 vps (2.89x)
- Recall stable within each config after double-sort fix

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Measure RPC round-trip time (sendTime → future.get()) and assign
per-query latency for remote search results. Previously p50/min
latency showed 0 for remote queries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert TiKVVersionMap node-scoped prefix optimization (vc:{nodeIndex}:...)
that broke cross-node version check correctness. Version map now uses
shared namespace (vc:{layer}:...) so all nodes can read/write the same
version data. Node-scoped optimization deferred to future branch.

Also add explanatory comments for AdoptRouter and HeadSync broadcast.
Restore commented-out debug log lines.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These were only used by the reverted node-scoped version map.
LocalToGlobalVID is kept (used in insert path for VID uniqueness).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
InsertVectors now uses dual path: per-vector multi-threaded insert
for single-node (original behavior), bulk AddIndex for router-enabled
multi-node (amortizes RPC overhead via batched remote appends).

Also add comment on AddIndex explaining caller-side shard partitioning
and LocalToGlobalVID purpose.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore the original vidIdx counter loop for post-heap version
filtering instead of the candidateIndices array approach. Both are
functionally equivalent but the original pattern is simpler.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore 'For quantized index, pass GetFeatureDim()' comment
- Remove else { func(); } branch that was not in the original code

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use ConvertToString(valueType) and INI parameters to build the
perftest_* filenames, matching TestDataGenerator::RunLargeBatches
convention. Moves filename construction out of the per-batch loop.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document the 3-file INI pattern for multi-node benchmarks:
- _build.ini: Rebuild=true, no Router (build phase)
- _driver.ini: Rebuild=false, Router enabled (driver/n0)
- _n{i}.ini: worker nodes (n1, n2, ...)

Update Python generator and shell script to match. Remove the
sed-based approach of patching _n0.ini at runtime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move Router parameters (RouterEnabled, RouterLocalNodeIndex,
RouterNodeAddrs, RouterNodeStores) from DefineSSDParameter to
DefineRouterParameter with its own [Router] section in:
- ParameterDefinitionList.h: new DefineRouterParameter macro
- Options.h: SetParameter/GetParameter handle 'Router' section
- SPANNIndex.cpp: SaveConfig outputs [Router] block
- SPFreshTest.cpp: read [Router] INI section, ApplyRouterParams
  uses 'Router' section
- Benchmark guide: updated INI template and Python generator

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Distributed scale results belong in BENCHMARK_RESULTS_SCALE_COMPUTE.md,
not the single-node 10M results file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
DefineRouterParameter was nested inside #ifdef DefineSSDParameter, causing
router parameters to be silently ignored. Moved the #endif to the correct
position so DefineRouterParameter is at top-level scope. Also removed a
debug log line from the DefineRouterParameter macro in Options.h.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement benchmark-level query distribution where each node independently
searches its contiguous partition of the query set, coordinated by barrier
files (same mechanism as insert distribution). This replaces the previous
RPC-based approach, eliminating RPC overhead and serial head search.

Key changes in SPFreshTest.cpp:
- BenchmarkQueryPerformance: partition queries across nodes, use barrier
  files for synchronization, compute QPS = totalQueries / max(wallTime)
- WorkerNode: unified command loop handling both search and insert commands
  via shared index directory

Results (10M Float32, 200 queries, TopK=5):
- 1-node: 194 QPS baseline
- 2-node: 404 QPS (2.08x speedup, super-linear due to cache effects)
- 3-node: 488 QPS (2.52x speedup)
- Insert scaling: 1→2→3 node = 119→211→314 vec/s (1.77x, 2.64x)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Fix remote lock leak in MergePostings (ExtraDynamicSearcher.h)
   - RAII RemoteLockGuard ensures remote lock is released on all exit
     paths (continue/return/exception), preventing distributed deadlock

2. Fix buffer overflow in BatchRouteSearch (SPANNIndex.cpp)
   - Validate response array sizes before accessing result vectors
   - Fall back to local search on size mismatch

3. Fix missing send-failure callback in SendRemoteLock (PostingRouter.h)
   - Add failure callback to complete the promise on send error,
     matching the pattern used by all other SendPacket call sites
   - Prevents 5-second stall on every send failure

4. Normalize atomic operation in SendRemoteLock (PostingRouter.h)
   - Change m_nextResourceId++ to fetch_add(1) for consistency

5. Fix uninitialized workerTime in barrier coordination (SPFreshTest.cpp)
   - Initialize workerTime and validate ifstream read
   - Skip worker timing on parse failure instead of using garbage value

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove BatchRouteSearch, SetFullSearchCallback, GetSearchNodeCount, and all
supporting RPC infrastructure (SearchPostingRequest/Response,
FullSearchBatchRequest/Response structs, search callbacks, handler methods,
and related member variables) from PostingRouter, SPANNIndex, Index,
VectorIndex, ExtraDynamicSearcher, SPFresh, and Packet.

Tests use barrier-based distributed search exclusively; the RPC-based search
routing is dead code. Existing SearchRequest/SearchResponse packet types are
preserved as they are used by the pre-existing Aggregator/Client/Server code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Critical fixes:
- Fix integer overflow in DecodeVectorSearchResponse (ExtraTiKVController.h)
  Prevents OOB reads from corrupt TiKV responses with large numResults.
- Fix m_mergeJobsInFlight counter underflow in MergePostings retry paths
  (ExtraDynamicSearcher.h) Add increment before re-enqueued MergeAsyncJob
  to match the unconditional decrement in exec().

High fixes:
- Add FlushRemoteAppends after Split reassignment (ExtraDynamicSearcher.h)
  Ensures queued remote appends are sent after CollectReAssign in Split().
- Fix data race on m_nodeAddrs in ConnectToPeer (PostingRouter.h)
  Snapshot address under m_connMutex before retry loop.
- Fix BroadcastHeadSync reading m_nodeAddrs without lock (PostingRouter.h)
  Snapshot node count under m_connMutex before iterating.

Medium fixes:
- Fix m_storeToNodes race in AddNode - move inside m_connMutex scope.
- Fix unvalidated entryCount in HandleHeadSyncRequest with buffer-end
  tracking to prevent overruns from corrupt packets.
- Add buffer-end tracking in BatchRemoteAppendRequest::Read to catch
  overruns during per-item deserialization.
- Make m_asyncStatus atomic to fix race between async jobs and Checkpoint.
  Use exchange() for atomic read-and-reset in Checkpoint.
- Make shared ErrorCode ret atomic in LoadIndex and WriteDownAllPostingToDB
  parallel loops.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TerrenceZhangX and others added 11 commits May 10, 2026 05:11
Bug 32e cut HandleBatchAppendRequest/Put receiver fan-out from 4 to 2,
which was the wrong direction: the worker's overload comes from its
local AppendThread/InsertThread pools (~40 threads), not from the
receiver pool (4 threads, ~5% of total). Halving the receiver pool
only doubled per-chunk processing time on the receiving side, pushing
the driver's 180s wait_for past timeout and causing v30/v31 BATCH 1
end-of-bulk hangs.

This commit rebalances within the worker's 48-thread budget:
- AppendThreadNum:           32 -> 16  (template ini)
- NumInsertThreads:           8 -> 4   (template ini)
- HandleBatchAppendRequest:   2 -> 8   (RemotePostingOps.h)
- HandleBatchPutRequest:      2 -> 8   (RemotePostingOps.h)

Net active TiKV concurrency on worker drops from ~42 to ~32, while
receiver throughput quadruples. A 50k-item BatchAppend chunk now
drains in ~62s typical (vs ~250s with pool=2), comfortably under the
180s wait_for timeout in SendBatchRemoteAppendChunk.

Also adds LL_Info instrumentation around SendBatchRemoteAppendChunk's
wait_for so we can confirm slow-vs-deadlock at end-of-bulk.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Distributed BATCH 1 recall stuck at 0.62 in v32 even though latency and
BATCH 0 recall met targets. Diagnosis from v32 logs:

  driver layer-1 BKTree:  117K samples, RMWs=451,        splits=0
  worker layer-1 BKTree:  657K samples, RMWs=4,197,126,  splits=273
  global alive heads:     200K

Driver's BKTree (and underlying m_pSamples) is missing ~83K of the 200K
alive heads — its query-time head index is blind to ~40% of the corpus,
which matches the observed recall drop. Underlying mechanism is that
BroadcastHeadSync was fire-and-forget: when the per-peer SendPacket
completion reported failure, entries were logged at LL_Debug and
dropped forever. Under v32's worker overload, many sends fail and the
peer's headIndex / m_pSamples diverge.

This commit:

1. RemotePostingOps: counters + per-peer retry queue
   - Atomic counters for broadcast entries, send OK / fail, recv
     entries, applied Add/Delete, retry enqueued / succeeded /
     dropped.
   - Per-peer HeadSyncBacklog (cap 256K entries) holds entries when
     a send completes with success=false; a dedicated retry thread
     periodically (every 500ms, tunable via SPTAG_HEADSYNC_RETRY_INTERVAL_MS)
     re-broadcasts up to 1024 entries per peer.
   - DumpHeadSyncStats() called at every layer-N ALL_DONE boundary +
     every 30s from the retry thread, so we can diff sender/receiver
     and detect any remaining gap.

2. ExtraDynamicSearcher Stage 2B fallback (GatherBarrier silent drop)
   - When a remote FetchPostings batch fails (rf.ok=false), instead
     of silently skipping ~rf.items.size() postings, fall back to
     local TiKV reads for those headIDs. Driver and worker share
     the same TiKV cluster so the routing was an optimization, not
     a correctness boundary.
   - Adds [v33] LL_Warning log + per-batch fallback hit counter so
     we can see when the path triggers.

3. ExtraDynamicSearcher worker FetchPostingsCallback diagnostic
   - Distinguish ErrorCode::Key_NotFound (legitimate empty posting)
     from any other failure inside GetPostingFromDB; log
     LL_Warning with fail / notFound counts so transient TiKV
     errors are no longer indistinguishable from empty-key misses.

4. WorkerNode forward methods for DumpHeadSyncStats /
   GetHeadSyncBacklogSize / DrainHeadSyncBacklog /
   NoteHeadSyncApplyAdd / NoteHeadSyncApplyDelete.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause of BATCH 1 recall regression (v32 0.62 / v33 0.60 vs target
0.78): RemotePostingOps held SINGLE-SLOT callbacks (m_appendCallback,
m_headSyncCallback, m_remoteLockCallback, m_putPostingCallback,
m_fetchPostingsCallback). With Layers=2, each layer's
ExtraDynamicSearcher::SetWorker overwrites the previous; layer 1's
SetWorker runs last so all five callbacks captured layer-1's `this`
and m_layer=1. Every incoming RemoteAppend/HeadSync/Lock/Put/Fetch
event from peers — regardless of which layer it originated from —
routed to layer-1's lambda and called AddHeadIndex(..., m_layer+1=2,
...) which routes to the BKT (Index.h:246-269 fallback). Layer-0 head
VIDs polluted the BKT (m_pSamples is append-only, so churn caused
unbounded growth: worker BKT 327K mid-batch tracking v32's 657K final).

Forensics confirm HeadSync delivery is 100% correct (v33 counters:
broadcast 6720, send_ok 6720, send_fail 0, recv 533580). The 'loss'
suspected from v32 was a counter artifact (dispatcher counted as a
peer in m_nodeAddrs leading to double-count on send). Fix is therefore
NOT delivery — it is dispatch routing.

Generic per-layer design (supports >2 layers):
- DistributedProtocol.h: add std::int32_t m_layer = 0 to all 5 wire
  formats. Bumped MirrorVersion 0->1 on the four formats with a
  preamble (RemoteAppendRequest, RemoteLockRequest,
  RemotePutPostingRequest, RemoteFetchPostingsRequest) with conditional
  read on mirrorVer >= 1 so legacy packets default to layer=0.
  HeadSyncEntry has no preamble; appended unconditionally (driver and
  worker rebuilt monolithically).
- RemotePostingOps.h: replaced single-slot callbacks with
  std::vector<Callback> indexed by layer. EnsureLayerSlot_NoLock(layer)
  lazily grows. Owners array is std::vector<std::atomic<const void*>>
  with manual element migration on growth (atomic isn't movable). All
  5 Handle*Request dispatch sites look up by req.m_layer; missing
  callbacks emit LL_Warning rather than crash.
- WorkerNode.h: pass-throughs take int layer first arg. Default
  layer-agnostic Put/Fetch installed at layer 0. Lookup helpers fall
  back to layer 0 for PutPosting/FetchPostings only (layer-agnostic
  raw m_db ops; build-receiver mode still works without ES binding).
  Append/HeadSync/RemoteLock have NO fallback (layer semantics are
  required for correct routing).
- ExtraDynamicSearcher.h: SetWorker passes m_layer to all 5 setters,
  ClaimCallbackOwnership and ClearCallbacksIfOwner. All call sites
  set req.m_layer = m_layer before sending (4 QueueRemoteAppend, 2
  HeadSync entry construction blocks, SendRemoteFetchPostings,
  SendRemoteLock RAII guard).

This makes the per-layer callback routing the source of truth: the
dispatch table is keyed by layer, the captured `this` in each lambda
is the right instance for that layer, and the wire format carries
the layer so the receiver can route correctly without inferring it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…it lock storm

Root cause:
- AppendBatchAsync / Append() had no ownership check. For splits with
  ~119 reassign-target heads, ~50% are remote-owned. The lock pipeline
  in AddIndexAsyncSingleKey acquired m_rwLocks[headID] for ALL of them
  (held across Pass 1-4, blocking 16 concurrent split workers), then
  did MultiGet against LOCAL TiKV (returning NotFound for remote heads),
  then Pass-4 serial Append() retries that wrote the new bytes under the
  remote head's key, creating orphan data the real owner never sees.
- Effect: 99-225s per split (avg) vs 2.2s on 1-node = 45-100x slower.

Fix (two complementary guards):
1. AppendBatchAsync entry-time pre-filter: when worker is enabled,
   partition headAppends into local vs remote; remote -> QueueRemoteAppend
   immediately; local-only subset flows through existing lock pipeline.
2. Append() entry-time defensive check mirroring the same routing.
   Catches direct callers that bypass AppendBatchAsync (ReassignAsync,
   recursive Split, Pass-4 sync-retry fallback).

Consistent with existing ownership checks in Split (line 996),
MergePostings (line 1714), AddIndex distributed path (line 3782).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… storm

In 2-node sharded mode, each node only writes half of layer-1 posting
data locally. This makes layer-1 postings unnaturally small (<10
entries), so AsyncMergeInSearch trigger at SearchIndex
(realNum<=mergeThreshold) fires on nearly every visited layer-1 head:
  1-node BATCH 1: 182 layer-1 merges total
  2-node BATCH 1: 14M (driver) + 18M (worker) = 32M layer-1 merges

That floods m_splitThreadPool with noop merge jobs (Bug-11 fast path
drops 50% as remote-owned), starving real layer-0 split/append work.
Layer 1+ in SPANN holds BKT-structural data; merge-on-small is
meaningless there.

Fix: in distributed mode gate MergeAsync to (a) leaf layer (m_layer==0)
(b) locally-owned head. 1-node behavior preserved.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sync

In sharded distributed mode, AddIndex was synchronously calling Append()
per vector for the ~50% of selections owned by the local node. Each
Append() does a TiKV Get+Put roundtrip (~8ms), and AddIndex runs on a
single insert thread, capping pre-split insert throughput at ~125/s/node
vs 466/s for the 1-node baseline (which batches all appends via
AppendBatchAsync and fan-outs through m_splitThreadPool).

This change mirrors the 1-node path: accumulate local-owned appends into
a headID->buffer map and call AppendBatchAsync once at end-of-batch.
Remote-owned appends continue to flow through QueueRemoteAppend.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move 2-node from independent-PD-per-node to a single PD-raft group with
max-replicas=1. Compute nodes become stateless TiKV clients: PD routes
each region read/write to whichever store owns it, so the distributed
routing layer no longer needs cross-compute fetch RPCs.

Files

run_distributed.sh
  - tikv_start: build a shared initial-cluster and pd-endpoints list;
    every PD/TiKV joins the same raft group.
  - PD readiness now waits for the expected member count, not just
    /pd/api/v1/members reachability.
  - Fix max-replicas POST: previous code used GET-only /config/replicate
    and silently failed. Now POSTs to /pd/api/v1/config, verifies via
    GET, retries up to 5x. (Still racey if PD raft is slow to settle;
    manual workaround is to POST after Phase 0.)
  - tikv_switch_to_nocache mirrors the shared pd-endpoints layout.

SPFreshTest.cpp
  - BenchmarkFromConfig now sets TiKVPDAddresses to the FULL endpoint
    list (was per-worker local PD only). Same for the build-distribute
    receiver path.

cluster_2node.conf
  - Documentation block on shared raft mode.

tikv.toml (server tuning, v41)
  - apply-pool-size 4 -> 32, store-pool-size 4 -> 16: apply pool was the
    primary write-amp source (TiKV pegged at ~8/96 cores while client
    ops queued). Moves us to ~10 cores per node under burst.
  - grpc-concurrency 16 -> 32, grpc-memory-pool-quota 8GB -> 16GB.
  - scheduler-worker-pool-size 16 (new).
  - readpool.unified max=32 min=8 (new): cap reads so they don't steal
    CPU from writes.
  - max-background-flushes=8, raft-write-batch-size=1MB (new).
  - defaultcf write-buffer-size 512MB -> 1GB, max-write-buffer-number
    5 -> 8, level0 slowdown/stop 28/40 -> 40/60.

ExtraDynamicSearcher.h (HeadSync top-only + remote-fetch removal)
  - Broadcast HeadSync for split/merge only when the head update targets
    the in-memory BKT (m_headIndex->GetDiskIndex(m_layer+1) == nullptr).
    Lower-layer head updates already write to shared TiKV via the local
    DiskIndex chain and are visible to every compute, so broadcasting
    them duplicates writes and can pick a different posting if peer BKT
    state diverges.
  - MultiGet/MultiScan no longer fan out SendRemoteFetchPostings;
    everything goes through the local PD-routed TiKVIO path.

WorkerNode.h (auto-flush + per-node inflight cap)
  - QueueRemoteAppend auto-flushes per node once the per-node queue
    reaches kAutoFlushThreshold=50000 (was: hold everything until
    end-of-batch FlushRemoteAppends, then serial drain). Up to
    kMaxInflightPerNode=4 chunks may be in flight per node so a
    producer burst (split fan-out, reassign wave) can saturate the
    receiver bg-executor pool.
  - FlushRemoteAppends waits for any straggler auto-flush to drain
    before sending the final tail. Per-node mutex on the end-of-batch
    sender keeps tail-to-same-node sends ordered.

RemotePostingOps.h
  - kChunkSize history block; current setting 50000. v42 (10k) was
    throughput-best (906/s) but during-insert p50 222ms; v43 (50k)
    trades throughput (-22% -> 704/s) for during-insert p50 (-36% ->
    141ms) and post-insert r1 QPS 47 -> 85. v44 (100k) blew up tail
    drain: a single 100k chunk took 116s on the receiver -> 40+ min
    end-of-batch drain (vs 8 min at 50k). 50k is the sweet spot.
  - Per-chunk sub-worker fan-out comment: kept at 8 (v39 baseline);
    combined with cap=4 inflight yields TiKV-side concurrency 4*8=32.

benchmark_insert_dominant_2node.ini
  - VersionCacheTTLMs=0, VersionCacheMaxChunks=0 for nocache mode.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Background

ExtraDynamicSearcher::AddIndex had two near-identical per-vector loops:
one for the distributed (router-enabled) path with cross-node ownership
routing, and one for the single-node fallback with the WAL hook.
PutAssignment was only called in the second loop, so the in-flight
queue / RPC pipeline in the distributed path bypassed the WAL even when
EnableWAL=true.

Change

Collapse the two loops into one. RNGSelection, version bump, Serialize,
and the WAL PutAssignment happen exactly once per vector, before any
routing decision. The routing branch (m_worker->GetOwner ->
QueueRemoteAppend) lives inside the per-replica inner loop and only
fires when m_worker && m_worker->IsEnabled(). Local-owned (and
non-routed) heads accumulate into a single headAppends map and are
flushed via AppendBatchAsync at the end.

VID encoding

routed mode serializes LocalToGlobalVID(VID) for cross-node uniqueness;
non-routed mode keeps the local VID. Same behavior as before, just
expressed via a single payloadVID local.

Durability semantics

Gated by m_opt->m_enableWAL (default false in
ParameterDefinitionList.h), so this commit is a no-op at runtime for
existing benchmarks. When EnableWAL=true is set, AddIndex Success now
implies that every vector — including remote-owned ones — has been
persisted to the local PersistentBuffer (RocksDB write, default
sync=false) before the producer returns. The in-memory append queue
and BatchAppendChunk RPC pipeline can then be replayed from the WAL on
restart.

Note: PersistentBuffer's underlying RocksDBIO::Put still uses default
WriteOptions (sync=false), so this protects against producer crash but
not power loss. Upgrading to options.sync=true is a follow-up and
should be measured against a durable-throughput target separately.

Replay caveat

PersistentBuffer's existing replay (StartToScan/NextToScan) was written
for local-VID-only records. When EnableWAL=true is turned on for a
distributed run, replay will need to be made VID-mode aware (decode the
serialized record and re-dispatch via the same routed/non-routed split
this commit just unified). Left as a follow-up since the immediate goal
was to have the code path present without measuring overhead.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ibuted-to-tikv

Merged SearchStats latency breakdown support into distributed query paths.
Both distributed (multi-node dispatch) and single-node search now use
spannIndex->SearchIndex(result, &searchStats) for per-query latency breakdown.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@TerrenceZhangX TerrenceZhangX marked this pull request as ready for review May 14, 2026 10:54
TerrenceZhangX and others added 7 commits May 14, 2026 13:10
…batch version checks)

Conflicts resolved:
  - .gitignore: kept both sides' ignore patterns
  - Test/src/SPFreshTest.cpp: kept HEAD's distributed-mode BenchmarkQueryPerformance
    routing (dispatcher broadcast + per-node query partition); HEAD already has
    the SearchBreakdown JSON output block that qiazh added at a different
    location, so qiazh's duplicate was dropped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
gRPC dedupes subchannels by (address, channel-args). With identical args
across all kStubPoolSize stubs, every stub multiplexed onto a SINGLE TCP
socket — serviced by ONE grpc-server worker thread regardless of the
TiKV-side grpc-concurrency setting. Under high client concurrency this
single server thread saturates near 85% CPU, queues all incoming RPCs,
and the 100ms client deadline expires while requests sit in the server
completion queue (manifests as a flood of "Deadline Exceeded" even
though RocksDB engine_get p99 is <100us and there is no write stall).

Setting GRPC_ARG_CHANNEL_POOL_DOMAIN to a per-stub unique string
("sptag-stub-" + index) puts each channel in its own subchannel pool,
forcing a separate TCP connection per stub. After this fix:
  * TCP connections to TiKV: 1 → 96 (48 stubs × {TiKV, PD})
  * grpc-server-0 CPU: 84% → 3.6% (load balanced across all 32 threads)
  * Deadline Exceeded / retry exhausted / read postings fail: all 0
during a 26-min insert_dominant bench (1M base + 1M insert).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When MergePostings runs locally and the surviving head wins over a remote
candidate's posting, the initiator now sends a RemoteDeletePosting to the
owner node which deletes its posting, removes the local head entry, and
broadcasts a HeadSync(Delete) to peers so other BKTs drop the head.
Idempotent on the owner side (already-deleted returns Success).

Also adds HeadSync aggregation during RefineIndex: each MergePostings
firing its own single-entry BroadcastHeadSync produces
|mergelist| * (numNodes-1) tiny RPCs at RefineIndex time. The new
m_refineHeadSyncBuffer collects them so RefineIndex flushes one
batched broadcast after all submitted merges drain.

Files:
  * Packet.h           - DeletePostingRequest/Response opcodes (0x10)
  * DistributedProtocol.h - RemoteDeletePostingRequest/Response structs
  * RemotePostingOps.h - DeletePostingCallback registration + send path
  * WorkerNode.h       - SendRemoteDeletePosting / SetDeletePostingCallback
  * ExtraDynamicSearcher.h - integrate delete-posting callback,
                             HeadSync aggregation, RefineIndex flush

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
  * Add benchmark_10m_template.ini: 9M base + 1M insert (1.2GB head index,
    ~13 min build). Useful intermediate between insert_dominant (1M+1M,
    ~10 min total) and 100m (99M+1M, ~5h total) for validating 1node→2node
    scaling without paying the full 100M build cost.

  * Fix benchmark_100m_template.ini + benchmark_insert_dominant_template.ini
    dataset paths from /mnt/data/sift1b/{base.1B,query.public.10K}.u8bin
    (reference machine layout) to /mnt/nvme/sift1b/{bigann_base,query.10K}.u8bin
    (our layout). insert_dominant scale fixed at 1M base + 1M insert (the
    1M+10M variant was too long to drive iteration).

  * cluster_{2,3}node.conf: document image-ref overrides (tikv_image,
    pd_image, helper_image) for environments that cannot push to the
    registry — pull-only ACR usage now self-explained.

  * run_distributed.sh (+181/-40): assorted benchmark driver fixes
    including better TiKV start/stop ordering, SKIP_HEAD_BUILD support
    at 100M scale, and improved logging.

  * .gitignore: ignore generated benchmark_*_{1,2,3}node.ini snapshots
    (created at runtime from templates) and tikv.toml.v*bak backups.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These three docs predate the current TiKV-backed distributed design and
have diverged from the actual implementation:

  * distributed-bugs.md          - early bug list, no longer accurate
  * distributed-flow-diagrams.md - stride-sharding-era flow diagrams
  * distributed-job-routing.md   - superseded job-routing design

The authoritative distributed design now lives in
docs/TiKVDistributedVersionMapDesign.md (kept).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The build-receiver path in SPFreshTest had a fixed BuildReceiverTimeout
(default 7200s) used as a defensive cap so a worker would not hang
forever if the driver crashed or its TCP connection silently dropped.
That timeout had to be retuned every time the build duration grew (it
fired prematurely at 100M scale where BKT+head build alone takes ~5h),
which is both bug-prone and the wrong scaling axis -- worker lifetime
should track driver liveness, not absolute wall-clock build time.

Replace it with an explicit driver-to-worker Heartbeat dispatch:

  - DispatchCommand::Type::Heartbeat (=3) joins the existing Search /
    Insert / Stop enum. BroadcastDispatchCommand treats it like Stop:
    no PendingDispatch state, no DispatchResult sent back, no log spam.
  - DispatcherNode owns a dedicated heartbeat pump thread that issues
    a Heartbeat broadcast every HeartbeatIntervalSec seconds (30 s by
    default). StopHeartbeat() joins within ~100 ms.
  - The driver starts the pump in BuildOnly+Distributed mode after the
    ring is synced and stops it just before broadcasting Stop, so the
    pump's lifetime is exactly the build window during which workers
    are blocked in their dispatch callback.
  - The worker BUILD_DISTRIBUTE_RECEIVER block polls the stop promise
    in 5 s slices and tracks lastHeartbeatNs as an atomic. If no
    heartbeat arrives for HeartbeatTimeoutSec (default 180 s) it exits
    with an error message; otherwise Stop reception exits cleanly.

Net effect: at any build duration the worker exits within 180 s of
true driver death and never bails on a healthy long build.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The C++ heartbeat watchdog (previous commit) bounds worker lifetime
to ~180 s after the driver dies, which is correct but conservative.
A shell-side watchdog spawned from run_distributed.sh closes that gap
faster: as soon as the driver process is gone we tear down the local
SSH wrappers and issue a name-based terminate on the remote SPTAGTest
workers, so a crashed driver does not leak workers across iterations.

  - start_driver_watchdog forks a subshell that polls the driver PID
    every 5 s. On exit it signals each WORKER_SSH_PID and issues a
    remote name-based terminate via SSH on every worker host (with
    a hard fallback after 5 s).
  - Both phase-1 (build) and phase-3 (run) driver launches invoke it
    immediately after starting workers, and stop_driver_watchdog runs
    after the driver's wait returns, so the watchdog is alive exactly
    when there is an actual driver to watch.
  - The INT/TERM trap also calls stop_driver_watchdog so Ctrl-C never
    leaves an orphan watchdog hanging.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread AnnService/inc/Core/Common/FineGrainedLock.h Outdated
SPTAGLIB_LOG(Helper::LogLevel::LL_Error,
"[Bug 25] Search post-fetch: out-of-range VID:%d (count:%d, deficit:%d > %d, numWorkers:%d); records will be skipped\n",
maxVid, currentCount, deficit, maxAllowed, numWorkers);
} else if (m_versionMap->AddBatch(deficit) != ErrorCode::Success) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why search need to be changed?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each node's local m_count tracks only its own AddIndex progress, not peers'. When search reads a TiKV posting whose bytes were written by a peer running ahead of us, those record VIDs can fall above our local m_count. GetVersion short-circuits any VID >= m_count to 0xfe, so without growing m_count first the post-heap filter would drop those records as deleted and recall would silently regress. This loop grows m_count just enough to cover the largest VID we actually fetched

// at the trigger site avoids the round trip entirely.
// Single-node behavior (m_worker disabled) is preserved.
if (m_opt->m_asyncMergeInSearch && realNum <= m_mergeThreshold) {
if (m_worker && m_worker->IsEnabled()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it is not local and not layer 0, it still needs to do mergeasync.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. This is over-simplification. Fixed to mergeAsync trigger based on headID owner.

TerrenceZhangX and others added 3 commits May 18, 2026 02:08
…ug tags

* ExtraDynamicSearcher: drop the per-record SetVersionBatch loop in the
  TiKV search path. With the LRU chunk cache removed from the distributed
  versionMap design, BatchGetVersions reads chunks directly from TiKV and
  already sees peer-written versions, so re-writing those bytes from
  search is pure overhead and exposes the chunked-RMW lost-update race.
  Keep the AddBatch capacity catch-up: TiKVVersionMap::BatchGetVersions
  short-circuits any VID >= m_count to 0xfe, and m_count is a process-
  local atomic that does not auto-refresh from peers; without growing
  it first, the post-heap filter silently drops peer-allocated VIDs and
  recall regresses. Comment rewritten to focus on why search needs to
  do this at all.

* Rename LocalToGlobalVID to AllocateGlobalVID. The function is the
  cross-node global-ID allocator (stripes new VIDs across compute
  workers), not a passive mapping; the new name reflects intent.

* Drop [Bug NN] tags from comments and log messages in
  ExtraDynamicSearcher.h and Distributed/NetworkNode.h. These were
  iteration markers during distributed bring-up; with the design now
  stable the tags read as patches rather than design.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Search-side AsyncMergeInSearch previously gated on layer==0 && isLocal:
remote-owned heads under m_mergeThreshold were silently dropped. The
'layer 0 only' rationale (each node only writes half a layer-1 posting)
doesn't hold on shared TiKV with DBKey routing -- the owner stores the
full posting, so layer-1 observations are real underfull signals worth
acting on. And the isLocal gate threw away peer observations entirely:
MergeAsync is local-only, MergePostings drops non-owner jobs at entry,
and no other channel (RefineIndex only runs on SaveIndex) was telling
the owner about underfullness in a search-heavy workload.

Add a fire-and-forget MergeRequest packet (PacketType 0x11, mirrors
HeadSyncRequest semantics -- no response, no retry queue) so peers can
notify the owner. Search trigger now routes by owner:
  - target.isLocal  -> MergeAsync (unchanged)
  - target remote   -> QueueRemoteMerge(node, layer, headID)

WorkerNode batches hints per-target with (layer, headID) set-dedup;
auto-flush at 8192 entries (merge hints are non-urgent so the larger
window dedupes harder and amortizes TCP overhead). End-of-batch
FlushRemoteMerges() is called from SPFreshTest at every insert-batch
boundary and after every search round so no hint is dropped at
shutdown. Receiver's HandleMergeRequest fans out to the per-layer
MergeCallback registered by ExtraDynamicSearcher::Configure, which
just calls MergeAsync(headID) locally -- MergePostings then enforces
the same owner / size / dedup invariants as the in-process path.

New stats (DumpMergeRequestStats, printed periodic / ALL_DONE /
shutdown alongside the HeadSync stats):
  send_ok    send_fail    recv_hints    recv_dropped

Verified on insert_dominant_2node (1M base + 1M insert): 47/93 hints
delivered each way with zero send_fail / recv_dropped, total merge job
count stays ~450 across build+insert+search (no flood: layer-1 merges
stay in the hundreds, not millions).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@TerrenceZhangX TerrenceZhangX requested a review from zqxjjj May 18, 2026 06:59
TerrenceZhangX and others added 4 commits May 18, 2026 07:12
Two pure-noise differences vs origin/main that drifted in from older
commits. Removing them shrinks the branch diff without changing
behavior.

  - Around the per-vector dedup loop: revert the blank-line / debug
    comment ordering back to main's layout. The line content (the
    only the ordering of the commented-out DEBUG log and the blank
    line is restored.

  - Drop the unused 'postingsForSplit' ConcurrentSet introduced by
    commit 291f3f0. It is only ever inserted into and never read,
    so the entire variable plus its single insert call is dead code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The fresh-build / resume-mode split was added by 77ed787 (SPANN
distributed: sharded build/search). The resume-mode branch is dead in
practice — when we resume from a checkpoint, p_localToGlobal is empty
(R() == 0) and the outer 'if (p_localToGlobal.R() > 0)' guard already
skips the whole block. The else branch only ran in invalid call
combinations and just re-wrote the same mapping the build path would
have produced. Removing it restores the single-loop pre-distributed
shape and shrinks the diff vs main.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Removes the cross-node PutPosting routing in WriteDownAllPostingToDB
along with the protocol (RemotePutPosting*, packet 0x0D/0x0E),
WorkerNode queue/flush, RemotePostingOps callbacks, the BUILD_DISTRIBUTE_RECEIVER
worker mode, and the matching Phase-1 worker launch in run_distributed.sh.

With a shared TiKV cluster (PD-routed writes), the driver writes every
posting directly via TiKVIO -- no TCP hop to a peer that just re-Puts.
Build now runs as true single-node; workers come up only in the run phase.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Removes 4 gratuitous blank-line additions/deletions in main-side code
(no semantic change) to reduce review noise.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants