You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The RoCE/RDMA path (daqiri_bench_rdma + libdaqiri_rdma) intermittently collapses
from line rate (~97 Gb/s) to near-zero (~0.001–0.7 Gb/s) on a true over-the-wire
loopback. The collapse is not a clean function of message size — it is a race:
the same configuration runs healthy in some cells and cliffs in others. Large
messages (1–8 MB) are usually fine; sub-1 MB messages cliff almost every time, and
even an 8 MB run can cliff at a specific pace target. Throughput collapses to ~11–230
ops/s with zero CQE errors and zero drops, so the QP/wire are healthy — the stall
is software-side, on the bench/manager op-rate path.
This is the issue previously tracked internally as the "RDMA pps cliff" (follow-up #5).
Environment
DGX Spark (GB10), single ConnectX-7, two cabled ports in a netns wire loopback
(dq_wire_client 10.250.0.1 / dq_wire_server 10.250.0.2, RoCEv2, RC).
daqiri_bench_rdma run split-process: --mode server and --mode client, each in
its own network namespace (see scripts/setup_spark_wire_loopback_netns.sh, examples/daqiri_bench_rdma_tx_rx_spark_netns_{server,client}.yaml).
pps falls as messages shrink (it should rise) — i.e. a per-op stall, not an
efficiency loss.
Pace sweep at 8 MB (bench-results/20260605T195211Z-rdma-drop-curve):
target gbps
achieved
drops
1 / 5 / 10
0.99 / 4.97 / 9.97
0
25
0.695 (cliff)
0
50 / 75 / 100
49.98 / 74.98 / 96.40
0
unpaced
96.89
0
target=25 collapsed while its neighbours were perfect — confirms the cliff is intermittent / a race, not deterministic by shape.
Ruled out (as the explanation for the cliff variance)
CQE errors / drops — zero in every affected cell; the QP and wire are healthy.
get_rx_burst "No server RX ring found" criticals as a predictor — the count is
~26,000×/run in every cell, healthy and cliffed alike (e.g. 26,037 in the 96.4 Gb/s
cell vs 25,968 in the 0.695 Gb/s cell). Because the count does not vary with
throughput, it does not explain why one cell cliffs and its neighbour doesn't.
Important: this only rules the spam out as a predictor of the per-cell variance —
it does not mean the spam is harmless or unrelated. The underlying condition (the
server's RX ring lookup returning null in steady state) is itself wrong and may share a
root cause with the cliff (see below).
The error spam is itself a bug (not valid behavior)
get_rx_burst emits CRITICAL ~866×/second in steady state on a healthy run. A
hot-path CRITICAL is wrong by definition — that level means "something is broken" — and
it has two concrete problems:
Wrong lookup.rx_rings_map_[conn_id] (server branch src/managers/rdma/daqiri_rdma_mgr.cpp:1777, client branch :1787) uses operator[],
which inserts a null entry on a miss instead of reporting absence. It should be .find(). The null-insert is also actively harmful: it can poison rdma_get_server_conn_id's gate at :854, which checks rx_rings_map_.find(client_id) == end(). Once operator[] has inserted a null-valued
entry, that key now exists, so the gate passes and hands back a conn_id whose ring is
null → guaranteed miss on the next get_rx_burst, repeated every drain attempt.
Wrong level / control flow. Returning INVALID_PARAMETER makes the bench's drain
loop break immediately (examples/rdma_bench.cpp:133), so the server stops draining
RX completions for that iteration. If a missing/absent ring is ever a legitimate
transient (polled before the ring is wired up, or a role with no RX ring), the correct
response is a quiet NOT_READY, not a CRITICAL + INVALID_PARAMETER. At minimum the
log should be DEBUG and/or rate-limited.
Suspected area
The collapse looks like a credit/completion-draining stall in the bench worker loop
(examples/rdma_bench.cpp:130-166) and/or the manager's RX completion path
(src/managers/rdma/daqiri_rdma_mgr.cpp), where the server worker fails to drain
RX completions for a connection and stops returning credits, throttling the op-rate.
The race-like dependence on timing (per-cell, per-target) points at the
connection/ring lifecycle across the CM thread, worker threads, and threads_mutex_
(rdma_get_server_conn_id gating at ~daqiri_rdma_mgr.cpp:854; ring populate at :375, erase at :1417).
These two symptoms — the steady-state null RX ring and the intermittent cliff — may
share one root cause. A server that cannot reliably resolve its RX ring to drain
completions is exactly the kind of thing that stalls the op-rate; the constant ~26k count
only tells us the spam frequency doesn't vary with throughput, not that the underlying
null-ring condition is unrelated to the stall. The investigation should start by
explaining why rx_rings_map_ yields a null ring at all (is qp_params.rx_ring
null when stored at :375? is the entry erased/re-created mid-run at :1417? is the
poisoned-operator[] gate at :854 handing back a dead conn_id?) and only then decide
whether the cliff is the same bug or a second one.
Defects to address (likely related)
(Contained, safe quick-win) Fix the get_rx_burst lookup and logging
(src/managers/rdma/daqiri_rdma_mgr.cpp:1777, client branch :1787): use .find()
instead of rx_rings_map_[conn_id] so a miss does not insert a null entry (which
poisons the :854 gate), and stop emitting CRITICAL ~866×/s on the hot path —
return NOT_READY quietly for a legitimately-absent ring and/or demote to DEBUG.
This can land on its own, ahead of the deeper investigation.
(Investigation, may be the same bug) Explain the steady-state null RX ring and the
intermittent throughput cliff together. Needs a dedicated trace (per-iteration timing
in the worker loop, rx_rings_map_ contents vs the conn_id handed out, completion-queue
occupancy, credit accounting). The previously-leading "ring-miss predicts the cliff"
hypothesis is disproven (constant count), but the null-ring lifecycle is the most
promising starting point.
Acceptance criteria
Message-size sweep (rdma sweep) shows monotonic-or-better pps as message size
decreases; no cell collapses to <1 Gb/s without a corresponding drop/error cause.
Pace sweep (rdma drop-curve) tracks every target (incl. 25) within tolerance,
repeatably across runs.
get_rx_burst no longer emits steady-state CRITICAL spam; lookups use .find().
Summary
The RoCE/RDMA path (
daqiri_bench_rdma+libdaqiri_rdma) intermittently collapsesfrom line rate (~97 Gb/s) to near-zero (~0.001–0.7 Gb/s) on a true over-the-wire
loopback. The collapse is not a clean function of message size — it is a race:
the same configuration runs healthy in some cells and cliffs in others. Large
messages (1–8 MB) are usually fine; sub-1 MB messages cliff almost every time, and
even an 8 MB run can cliff at a specific pace target. Throughput collapses to ~11–230
ops/s with zero CQE errors and zero drops, so the QP/wire are healthy — the stall
is software-side, on the bench/manager op-rate path.
This is the issue previously tracked internally as the "RDMA pps cliff" (follow-up #5).
Environment
(
dq_wire_client10.250.0.1 /dq_wire_server10.250.0.2, RoCEv2, RC).daqiri_bench_rdmarun split-process:--mode serverand--mode client, each inits own network namespace (see
scripts/setup_spark_wire_loopback_netns.sh,examples/daqiri_bench_rdma_tx_rx_spark_netns_{server,client}.yaml).kind: host_pinned,num_bufs: 20. Driver:examples/run_spark_bench.sh rdma.15-bench-spark-perf-doc, library built after #113 - Fix RDMA cross-host SEND timeout and shutdown deadlock #115 (cross-host SENDtimeout + shutdown deadlock fix). Crash/0-byte symptoms from the stale pre-#113 - Fix RDMA cross-host SEND timeout and shutdown deadlock #115
library are a separate, already-resolved problem — this is observed on the fixed lib.
Reproduction
Evidence (2026-06-05)
Message-size sweep (
bench-results/20260605T194748Z-rdma-sweep, batch 1, unpaced):pps falls as messages shrink (it should rise) — i.e. a per-op stall, not an
efficiency loss.
Pace sweep at 8 MB (
bench-results/20260605T195211Z-rdma-drop-curve):target=25collapsed while its neighbours were perfect — confirms the cliff isintermittent / a race, not deterministic by shape.
Ruled out (as the explanation for the cliff variance)
CQE errors / drops — zero in every affected cell; the QP and wire are healthy.
get_rx_burst"No server RX ring found" criticals as a predictor — the count is~26,000×/run in every cell, healthy and cliffed alike (e.g. 26,037 in the 96.4 Gb/s
cell vs 25,968 in the 0.695 Gb/s cell). Because the count does not vary with
throughput, it does not explain why one cell cliffs and its neighbour doesn't.
Important: this only rules the spam out as a predictor of the per-cell variance —
it does not mean the spam is harmless or unrelated. The underlying condition (the
server's RX ring lookup returning null in steady state) is itself wrong and may share a
root cause with the cliff (see below).
The error spam is itself a bug (not valid behavior)
get_rx_burstemitsCRITICAL~866×/second in steady state on a healthy run. Ahot-path
CRITICALis wrong by definition — that level means "something is broken" — andit has two concrete problems:
rx_rings_map_[conn_id](server branchsrc/managers/rdma/daqiri_rdma_mgr.cpp:1777, client branch:1787) usesoperator[],which inserts a null entry on a miss instead of reporting absence. It should be
.find(). The null-insert is also actively harmful: it can poisonrdma_get_server_conn_id's gate at:854, which checksrx_rings_map_.find(client_id) == end(). Onceoperator[]has inserted a null-valuedentry, that key now exists, so the gate passes and hands back a conn_id whose ring is
null → guaranteed miss on the next
get_rx_burst, repeated every drain attempt.INVALID_PARAMETERmakes the bench's drainloop
breakimmediately (examples/rdma_bench.cpp:133), so the server stops drainingRX completions for that iteration. If a missing/absent ring is ever a legitimate
transient (polled before the ring is wired up, or a role with no RX ring), the correct
response is a quiet
NOT_READY, not aCRITICAL+INVALID_PARAMETER. At minimum thelog should be
DEBUGand/or rate-limited.Suspected area
The collapse looks like a credit/completion-draining stall in the bench worker loop
(
examples/rdma_bench.cpp:130-166) and/or the manager's RX completion path(
src/managers/rdma/daqiri_rdma_mgr.cpp), where the server worker fails to drainRX completions for a connection and stops returning credits, throttling the op-rate.
The race-like dependence on timing (per-cell, per-target) points at the
connection/ring lifecycle across the CM thread, worker threads, and
threads_mutex_(
rdma_get_server_conn_idgating at ~daqiri_rdma_mgr.cpp:854; ring populate at:375, erase at:1417).These two symptoms — the steady-state null RX ring and the intermittent cliff — may
share one root cause. A server that cannot reliably resolve its RX ring to drain
completions is exactly the kind of thing that stalls the op-rate; the constant ~26k count
only tells us the spam frequency doesn't vary with throughput, not that the underlying
null-ring condition is unrelated to the stall. The investigation should start by
explaining why
rx_rings_map_yields a null ring at all (isqp_params.rx_ringnull when stored at
:375? is the entry erased/re-created mid-run at:1417? is thepoisoned-
operator[]gate at:854handing back a dead conn_id?) and only then decidewhether the cliff is the same bug or a second one.
Defects to address (likely related)
get_rx_burstlookup and logging(
src/managers/rdma/daqiri_rdma_mgr.cpp:1777, client branch:1787): use.find()instead of
rx_rings_map_[conn_id]so a miss does not insert a null entry (whichpoisons the
:854gate), and stop emittingCRITICAL~866×/s on the hot path —return
NOT_READYquietly for a legitimately-absent ring and/or demote toDEBUG.This can land on its own, ahead of the deeper investigation.
intermittent throughput cliff together. Needs a dedicated trace (per-iteration timing
in the worker loop,
rx_rings_map_contents vs the conn_id handed out, completion-queueoccupancy, credit accounting). The previously-leading "ring-miss predicts the cliff"
hypothesis is disproven (constant count), but the null-ring lifecycle is the most
promising starting point.
Acceptance criteria
rdma sweep) shows monotonic-or-better pps as message sizedecreases; no cell collapses to <1 Gb/s without a corresponding drop/error cause.
rdma drop-curve) tracks every target (incl. 25) within tolerance,repeatably across runs.
get_rx_burstno longer emits steady-stateCRITICALspam; lookups use.find().