Skip to content

[BUG] RDMA bench: intermittent throughput cliff at small messages / low pace targets #126

@RamyaGuru

Description

@RamyaGuru

Summary

The RoCE/RDMA path (daqiri_bench_rdma + libdaqiri_rdma) intermittently collapses
from line rate (~97 Gb/s) to near-zero (~0.001–0.7 Gb/s) on a true over-the-wire
loopback. The collapse is not a clean function of message size — it is a race:
the same configuration runs healthy in some cells and cliffs in others. Large
messages (1–8 MB) are usually fine; sub-1 MB messages cliff almost every time, and
even an 8 MB run can cliff at a specific pace target. Throughput collapses to ~11–230
ops/s with zero CQE errors and zero drops, so the QP/wire are healthy — the stall
is software-side, on the bench/manager op-rate path.

This is the issue previously tracked internally as the "RDMA pps cliff" (follow-up #5).

Environment

  • DGX Spark (GB10), single ConnectX-7, two cabled ports in a netns wire loopback
    (dq_wire_client 10.250.0.1 / dq_wire_server 10.250.0.2, RoCEv2, RC).
  • daqiri_bench_rdma run split-process: --mode server and --mode client, each in
    its own network namespace (see scripts/setup_spark_wire_loopback_netns.sh,
    examples/daqiri_bench_rdma_tx_rx_spark_netns_{server,client}.yaml).
  • MRs kind: host_pinned, num_bufs: 20. Driver: examples/run_spark_bench.sh rdma.
  • Branch 15-bench-spark-perf-doc, library built after #113 - Fix RDMA cross-host SEND timeout and shutdown deadlock #115 (cross-host SEND
    timeout + shutdown deadlock fix). Crash/0-byte symptoms from the stale pre-#113 - Fix RDMA cross-host SEND timeout and shutdown deadlock #115
    library are a separate, already-resolved problem — this is observed on the fixed lib.

Reproduction

# bring up the netns wire loopback (in the privileged container, as root)
./scripts/setup_spark_wire_loopback_netns.sh up && ./scripts/setup_spark_wire_loopback_netns.sh verify

# message-size sweep (batch 1, unpaced) -- sub-1 MB cells cliff
./examples/run_spark_bench.sh rdma sweep

# pace sweep at 8 MB -- an intermittent target collapses
./examples/run_spark_bench.sh rdma drop-curve

Evidence (2026-06-05)

Message-size sweep (bench-results/20260605T194748Z-rdma-sweep, batch 1, unpaced):

message_size gbps pps drops
8 MB 97.49 1,523 0
1 MB 96.92 11,553 0
64 KB 0.121 231 0
8 KB 0.005 74 0
4 KB 0.001 33 0

pps falls as messages shrink (it should rise) — i.e. a per-op stall, not an
efficiency loss.

Pace sweep at 8 MB (bench-results/20260605T195211Z-rdma-drop-curve):

target gbps achieved drops
1 / 5 / 10 0.99 / 4.97 / 9.97 0
25 0.695 (cliff) 0
50 / 75 / 100 49.98 / 74.98 / 96.40 0
unpaced 96.89 0

target=25 collapsed while its neighbours were perfect — confirms the cliff is
intermittent / a race, not deterministic by shape.

Ruled out (as the explanation for the cliff variance)

  • CQE errors / drops — zero in every affected cell; the QP and wire are healthy.

  • get_rx_burst "No server RX ring found" criticals as a predictor — the count is
    ~26,000×/run in every cell, healthy and cliffed alike (e.g. 26,037 in the 96.4 Gb/s
    cell vs 25,968 in the 0.695 Gb/s cell). Because the count does not vary with
    throughput, it does not explain why one cell cliffs and its neighbour doesn't.

    Important: this only rules the spam out as a predictor of the per-cell variance
    it does not mean the spam is harmless or unrelated. The underlying condition (the
    server's RX ring lookup returning null in steady state) is itself wrong and may share a
    root cause with the cliff (see below).

The error spam is itself a bug (not valid behavior)

get_rx_burst emits CRITICAL ~866×/second in steady state on a healthy run. A
hot-path CRITICAL is wrong by definition — that level means "something is broken" — and
it has two concrete problems:

  1. Wrong lookup. rx_rings_map_[conn_id] (server branch
    src/managers/rdma/daqiri_rdma_mgr.cpp:1777, client branch :1787) uses operator[],
    which inserts a null entry on a miss instead of reporting absence. It should be
    .find(). The null-insert is also actively harmful: it can poison
    rdma_get_server_conn_id's gate at :854
    , which checks
    rx_rings_map_.find(client_id) == end(). Once operator[] has inserted a null-valued
    entry, that key now exists, so the gate passes and hands back a conn_id whose ring is
    null → guaranteed miss on the next get_rx_burst, repeated every drain attempt.
  2. Wrong level / control flow. Returning INVALID_PARAMETER makes the bench's drain
    loop break immediately (examples/rdma_bench.cpp:133), so the server stops draining
    RX completions for that iteration. If a missing/absent ring is ever a legitimate
    transient (polled before the ring is wired up, or a role with no RX ring), the correct
    response is a quiet NOT_READY, not a CRITICAL + INVALID_PARAMETER. At minimum the
    log should be DEBUG and/or rate-limited.

Suspected area

The collapse looks like a credit/completion-draining stall in the bench worker loop
(examples/rdma_bench.cpp:130-166) and/or the manager's RX completion path
(src/managers/rdma/daqiri_rdma_mgr.cpp), where the server worker fails to drain
RX completions for a connection and stops returning credits, throttling the op-rate.
The race-like dependence on timing (per-cell, per-target) points at the
connection/ring lifecycle across the CM thread, worker threads, and threads_mutex_
(rdma_get_server_conn_id gating at ~daqiri_rdma_mgr.cpp:854; ring populate at
:375, erase at :1417).

These two symptoms — the steady-state null RX ring and the intermittent cliff — may
share one root cause.
A server that cannot reliably resolve its RX ring to drain
completions is exactly the kind of thing that stalls the op-rate; the constant ~26k count
only tells us the spam frequency doesn't vary with throughput, not that the underlying
null-ring condition is unrelated to the stall. The investigation should start by
explaining why rx_rings_map_ yields a null ring at all (is qp_params.rx_ring
null when stored at :375? is the entry erased/re-created mid-run at :1417? is the
poisoned-operator[] gate at :854 handing back a dead conn_id?) and only then decide
whether the cliff is the same bug or a second one.

Defects to address (likely related)

  1. (Contained, safe quick-win) Fix the get_rx_burst lookup and logging
    (src/managers/rdma/daqiri_rdma_mgr.cpp:1777, client branch :1787): use .find()
    instead of rx_rings_map_[conn_id] so a miss does not insert a null entry (which
    poisons the :854 gate), and stop emitting CRITICAL ~866×/s on the hot path —
    return NOT_READY quietly for a legitimately-absent ring and/or demote to DEBUG.
    This can land on its own, ahead of the deeper investigation.
  2. (Investigation, may be the same bug) Explain the steady-state null RX ring and the
    intermittent throughput cliff together. Needs a dedicated trace (per-iteration timing
    in the worker loop, rx_rings_map_ contents vs the conn_id handed out, completion-queue
    occupancy, credit accounting). The previously-leading "ring-miss predicts the cliff"
    hypothesis is disproven (constant count), but the null-ring lifecycle is the most
    promising starting point.

Acceptance criteria

  • Message-size sweep (rdma sweep) shows monotonic-or-better pps as message size
    decreases; no cell collapses to <1 Gb/s without a corresponding drop/error cause.
  • Pace sweep (rdma drop-curve) tracks every target (incl. 25) within tolerance,
    repeatably across runs.
  • get_rx_burst no longer emits steady-state CRITICAL spam; lookups use .find().

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions