[BUG]  RDMA bench: intermittent throughput cliff at small messages / low pace targets

## Summary

The RoCE/RDMA path (`daqiri_bench_rdma` + `libdaqiri_rdma`) intermittently collapses
from line rate (~97 Gb/s) to near-zero (~0.001–0.7 Gb/s) on a true over-the-wire
loopback. The collapse is **not** a clean function of message size — it is a race:
the same configuration runs healthy in some cells and cliffs in others. Large
messages (1–8 MB) are usually fine; sub-1 MB messages cliff almost every time, and
even an 8 MB run can cliff at a specific pace target. Throughput collapses to ~11–230
ops/s with **zero CQE errors and zero drops**, so the QP/wire are healthy — the stall
is software-side, on the bench/manager op-rate path.

This is the issue previously tracked internally as the "RDMA pps cliff" (follow-up #5).

## Environment

- DGX Spark (GB10), single ConnectX-7, two cabled ports in a netns wire loopback
  (`dq_wire_client` 10.250.0.1 / `dq_wire_server` 10.250.0.2, RoCEv2, RC).
- `daqiri_bench_rdma` run split-process: `--mode server` and `--mode client`, each in
  its own network namespace (see `scripts/setup_spark_wire_loopback_netns.sh`,
  `examples/daqiri_bench_rdma_tx_rx_spark_netns_{server,client}.yaml`).
- MRs `kind: host_pinned`, `num_bufs: 20`. Driver: `examples/run_spark_bench.sh rdma`.
- Branch `15-bench-spark-perf-doc`, library built **after** #115 (cross-host SEND
  timeout + shutdown deadlock fix). Crash/0-byte symptoms from the stale pre-#115
  library are a separate, already-resolved problem — this is observed on the fixed lib.

## Reproduction

```bash
# bring up the netns wire loopback (in the privileged container, as root)
./scripts/setup_spark_wire_loopback_netns.sh up && ./scripts/setup_spark_wire_loopback_netns.sh verify

# message-size sweep (batch 1, unpaced) -- sub-1 MB cells cliff
./examples/run_spark_bench.sh rdma sweep

# pace sweep at 8 MB -- an intermittent target collapses
./examples/run_spark_bench.sh rdma drop-curve
```

## Evidence (2026-06-05)

**Message-size sweep** (`bench-results/20260605T194748Z-rdma-sweep`, batch 1, unpaced):

| message_size | gbps  | pps    | drops |
|--------------|-------|--------|-------|
| 8 MB         | 97.49 | 1,523  | 0     |
| 1 MB         | 96.92 | 11,553 | 0     |
| 64 KB        | 0.121 | 231    | 0     |
| 8 KB         | 0.005 | 74     | 0     |
| 4 KB         | 0.001 | 33     | 0     |

pps *falls* as messages shrink (it should rise) — i.e. a per-op stall, not an
efficiency loss.

**Pace sweep at 8 MB** (`bench-results/20260605T195211Z-rdma-drop-curve`):

| target gbps | achieved | drops |
|-------------|----------|-------|
| 1 / 5 / 10  | 0.99 / 4.97 / 9.97 | 0 |
| **25**      | **0.695** (cliff) | 0 |
| 50 / 75 / 100 | 49.98 / 74.98 / 96.40 | 0 |
| unpaced     | 96.89 | 0 |

`target=25` collapsed while its neighbours were perfect — confirms the cliff is
**intermittent / a race**, not deterministic by shape.

## Ruled out (as the *explanation for the cliff variance*)

- **CQE errors / drops** — zero in every affected cell; the QP and wire are healthy.
- **`get_rx_burst` "No server RX ring found" criticals as a *predictor*** — the count is
  ~26,000×/run in **every** cell, healthy and cliffed alike (e.g. 26,037 in the 96.4 Gb/s
  cell vs 25,968 in the 0.695 Gb/s cell). Because the count does **not** vary with
  throughput, it does not *explain why one cell cliffs and its neighbour doesn't*.

  **Important:** this only rules the spam out as a *predictor of the per-cell variance* —
  it does **not** mean the spam is harmless or unrelated. The underlying condition (the
  server's RX ring lookup returning null in steady state) is itself wrong and may share a
  root cause with the cliff (see below).

## The error spam is itself a bug (not valid behavior)

`get_rx_burst` emits `CRITICAL` ~866×/second in steady state on a healthy run. A
hot-path `CRITICAL` is wrong by definition — that level means "something is broken" — and
it has two concrete problems:

1. **Wrong lookup.** `rx_rings_map_[conn_id]` (server branch
   `src/managers/rdma/daqiri_rdma_mgr.cpp:1777`, client branch `:1787`) uses `operator[]`,
   which **inserts a null entry on a miss** instead of reporting absence. It should be
   `.find()`. The null-insert is also actively harmful: it can **poison
   `rdma_get_server_conn_id`'s gate at `:854`**, which checks
   `rx_rings_map_.find(client_id) == end()`. Once `operator[]` has inserted a null-valued
   entry, that key now *exists*, so the gate passes and hands back a conn_id whose ring is
   null → guaranteed miss on the next `get_rx_burst`, repeated every drain attempt.
2. **Wrong level / control flow.** Returning `INVALID_PARAMETER` makes the bench's drain
   loop `break` immediately (`examples/rdma_bench.cpp:133`), so the server stops draining
   RX completions for that iteration. If a missing/absent ring is ever a legitimate
   transient (polled before the ring is wired up, or a role with no RX ring), the correct
   response is a quiet `NOT_READY`, not a `CRITICAL` + `INVALID_PARAMETER`. At minimum the
   log should be `DEBUG` and/or rate-limited.

## Suspected area

The collapse looks like a credit/completion-draining stall in the bench worker loop
(`examples/rdma_bench.cpp:130-166`) and/or the manager's RX completion path
(`src/managers/rdma/daqiri_rdma_mgr.cpp`), where the server worker fails to drain
RX completions for a connection and stops returning credits, throttling the op-rate.
The race-like dependence on timing (per-cell, per-target) points at the
connection/ring lifecycle across the CM thread, worker threads, and `threads_mutex_`
(`rdma_get_server_conn_id` gating at ~`daqiri_rdma_mgr.cpp:854`; ring populate at
`:375`, erase at `:1417`).

**These two symptoms — the steady-state null RX ring and the intermittent cliff — may
share one root cause.** A server that cannot reliably resolve its RX ring to drain
completions is exactly the kind of thing that stalls the op-rate; the constant ~26k count
only tells us the *spam frequency* doesn't vary with throughput, not that the underlying
null-ring condition is unrelated to the stall. The investigation should start by
explaining **why `rx_rings_map_` yields a null ring at all** (is `qp_params.rx_ring`
null when stored at `:375`? is the entry erased/re-created mid-run at `:1417`? is the
poisoned-`operator[]` gate at `:854` handing back a dead conn_id?) and only then decide
whether the cliff is the same bug or a second one.

## Defects to address (likely related)

1. **(Contained, safe quick-win)** Fix the `get_rx_burst` lookup and logging
   (`src/managers/rdma/daqiri_rdma_mgr.cpp:1777`, client branch `:1787`): use `.find()`
   instead of `rx_rings_map_[conn_id]` so a miss does not insert a null entry (which
   poisons the `:854` gate), and stop emitting `CRITICAL` ~866×/s on the hot path —
   return `NOT_READY` quietly for a legitimately-absent ring and/or demote to `DEBUG`.
   This can land on its own, ahead of the deeper investigation.
2. **(Investigation, may be the same bug)** Explain the steady-state null RX ring and the
   intermittent throughput cliff together. Needs a dedicated trace (per-iteration timing
   in the worker loop, `rx_rings_map_` contents vs the conn_id handed out, completion-queue
   occupancy, credit accounting). The previously-leading "ring-miss predicts the cliff"
   hypothesis is disproven (constant count), but the null-ring lifecycle is the most
   promising starting point.

## Acceptance criteria

- Message-size sweep (`rdma sweep`) shows monotonic-or-better pps as message size
  decreases; no cell collapses to <1 Gb/s without a corresponding drop/error cause.
- Pace sweep (`rdma drop-curve`) tracks every target (incl. 25) within tolerance,
  repeatably across runs.
- `get_rx_burst` no longer emits steady-state `CRITICAL` spam; lookups use `.find()`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RDMA bench: intermittent throughput cliff at small messages / low pace targets #126

Summary

Environment

Reproduction

Evidence (2026-06-05)

Ruled out (as the explanation for the cliff variance)

The error spam is itself a bug (not valid behavior)

Suspected area

Defects to address (likely related)

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

message_size	gbps	pps
8 MB	97.49	1,523
1 MB	96.92	11,553
64 KB	0.121	231
8 KB	0.005	74
4 KB	0.001	33

target gbps	achieved	drops
1 / 5 / 10	0.99 / 4.97 / 9.97	0
25	0.695 (cliff)	0
50 / 75 / 100	49.98 / 74.98 / 96.40	0
unpaced	96.89	0

[BUG] RDMA bench: intermittent throughput cliff at small messages / low pace targets #126

Description

Summary

Environment

Reproduction

Evidence (2026-06-05)

Ruled out (as the explanation for the cliff variance)

The error spam is itself a bug (not valid behavior)

Suspected area

Defects to address (likely related)

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions