#113 - Fix RDMA cross-host SEND timeout and shutdown deadlock by chloecrozier · Pull Request #115 · NVIDIA/daqiri

chloecrozier · 2026-06-02T23:13:56Z

Summary

Closes #113 (RDMA cross-host SEND fails with IBV_WC_RETRY_EXC_ERR) plus a graceful-shutdown deadlock the #113 fix exposes, and adds cross-host DGX Spark example YAMLs as the regression test.

Fixes (`src/managers/rdma/daqiri_rdma_mgr.{h,cpp}`)

Wrong interface index on the server ([BUG] RDMA cross-host SEND immediately fails with transport_retry_exceeded and the server never posts receives #113). When a client connects, the server picks which configured interface to use for that connection. It was using the wrong number: the hardware's port number from the InfiniBand driver, instead of the index into our own configured interface list. The two happen to look similar but mean different things, so the server thread immediately crashed trying to look up an interface that wasn't there. Because the server crashed before posting any receives, the client's first send had nothing to land in and timed out with RETRY_EXC_ERR. Fix: use the interface index we already saved when the listener was set up.
Server hangs on graceful shutdown (only reachable once the first issue was fixed). When the client disconnects, the server tries to clean up its worker thread. It was doing this badly in three ways at once: it held a lock while waiting for the worker, the worker had no way to know it should stop (it only watched the global "everything is shutting down" flag), and the stats-printing code also wanted that same lock. Once issue 1 stopped the server from crashing early, every clean run ended up stuck in this three-way deadlock and had to be SIGKILLed. Fix: give each worker its own "please exit" flag, set it when the client disconnects, and release the lock before waiting for the worker to actually stop.

The Couldn't find server params for address … line that may show up once at startup is a benign timing race (the app polls slightly before the listener is registered), not the bug.

Regression test

Three new examples/ YAMLs exercising the cross-host data plane that no existing config reached:

daqiri_bench_raw_{tx,rx}_spark_xhost.yaml: split per-host (raw bench has no --mode).
daqiri_bench_rdma_tx_rx_spark_xhost.yaml: single config selected via --mode {client,server}, matching the existing one-file-per-workload pattern.

Reuses the 1.1.1.1 / 2.2.2.2 IPs and CPU pinning of the existing _spark.yaml configs, with each address bound to its host's p0 instead of both on one box.

Tests ran to verify

python scripts/check_doc_refs.py: 23 YAMLs, all covered.
mkdocs build --strict: clean.
python scripts/check_html_links.py site/: clean.

Hardware (two cross-cabled DGX Sparks, GB10, ConnectX-7 fw 28.45.4028, IGX OS / Ubuntu 24.04 ARM, kernel 6.14, CUDA 13.0, driver 580.95.05):

RDMA xhost: Server received: 21761, Client received: 15256 (was 0 / 0 pre-fix); zero transport retry counter exceeded; server reaches Removed leftover hugepage file … and exits ~3 s after its --seconds timer with no SIGKILL. Logs: logs/spark_rdma_{rx,tx}_post_fix3.log (vs. pre-fix logs/spark_rdma_{rx,tx}.log).
Raw GPUDirect xhost: lossless over 15 s. TX 24,657,920 / RX 24,657,920 / dropped 0 (~106 Gbps). Logs: logs/spark_raw_{rx,tx}_post_fix.log.
Single-host regression (daqiri_bench_rdma_tx_rx_spark.yaml): same code paths, unchanged behavior.

Signed-off-by: Chloe Crozier <chloecrozier@gmail.com>

greptile-apps · 2026-06-02T23:18:15Z

Greptile Summary

This PR fixes two RDMA cross-host correctness bugs: the server was indexing into the configured interface list using the IB hardware port number (cm_event->id->port_num) instead of the saved cfg-interface index (listen_iter->second.if_idx), causing the worker to crash before posting any receives and triggering RETRY_EXC_ERR on the client; and a three-way deadlock on graceful shutdown, where the CM event loop held threads_mutex_ across a join() that could only complete after ready_to_exit was set — which itself was under that same lock — while print_stats() also competed for it.

Interface index fix (daqiri_rdma_mgr.cpp line ~1311): replaces cm_event->id->port_num with listen_iter->second.if_idx, the value stored when the listener was registered.
Shutdown deadlock fix (daqiri_rdma_mgr.cpp DISCONNECTED handler): signals ready_to_exit, moves the thread handle out of worker_threads_ under a scoped lock, releases the lock, then calls join() — ensuring print_stats() can never block behind the join.
ready_to_exit atomicity (daqiri_rdma_mgr.h): upgraded from plain bool to std::atomic<bool>; the non-movable consequence is handled by switching server_q_params_ population to try_emplace, and three new cross-host DGX Spark example YAMLs are added as regression configs with full doc updates.

Confidence Score: 5/5

Safe to merge — both bugs are narrowly scoped to the RDMA server connection setup and shutdown path, the fixes are straightforward and well-reasoned, all commits carry DCO sign-offs, and hardware regression results confirm the fixed behavior.

The interface-index fix is a one-liner swap of the wrong field for the correct one already in scope. The deadlock fix correctly sequences signal → lock → move → unlock → join → cleanup, with ring returns happening only after the join. The atomicity upgrade for ready_to_exit and the try_emplace workaround for the non-movable type are both sound. All items flagged in the previous review round are addressed.

No files require special attention.

Important Files Changed

Filename	Overview
src/managers/rdma/daqiri_rdma_mgr.h	Changes `ready_to_exit` from `bool` to `std::atomic<bool>` with an in-place initializer; adds explanatory comment. Both issues raised in previous review are resolved.
src/managers/rdma/daqiri_rdma_mgr.cpp	Three targeted fixes: (1) uses `listen_iter->second.if_idx` instead of `cm_event->id->port_num`; (2) restructures DISCONNECTED handler to release `threads_mutex_` before `join()`, pushing rings back only after join; (3) switches `server_q_params_` construction to `try_emplace` for the non-movable type.
examples/daqiri_bench_rdma_tx_rx_spark_xhost.yaml	New cross-host RDMA bench config; single file, both roles selected via `--mode {client,server}`. IPs, CPU cores, and `host_pinned` kind match existing Spark configs.
examples/daqiri_bench_raw_tx_spark_xhost.yaml	New TX-side raw GPUDirect cross-host config. `eth_dst_addr` correctly left as placeholder. All other values match existing Spark profile.
examples/daqiri_bench_raw_rx_spark_xhost.yaml	New RX-side raw GPUDirect cross-host config. Flow rule matches UDP src/dst ports sent by the TX companion config.
AGENTS.md	Benchmark table updated to include all three new xhost YAML configs, addressing the doc-sync concern from the previous review.
docs/benchmarks/raw_benchmarking.md	Adds a new cross-host two-DGX-Spark loopback subsection with correct run instructions for both raw GPUDirect and RDMA variants.
docs/tutorials/configuration-walkthrough.md	Adds one-line entries for each new xhost YAML in the raw GPUDirect and RDMA sections, with correct links.

_{Reviews (3): Last reviewed commit: "Merge branch 'main' into feature/spark-x..." | Re-trigger Greptile}

Signed-off-by: Chloe Crozier <chloecrozier@gmail.com>

chloecrozier · 2026-06-02T23:29:58Z

Addressed Greptile's review:

ready_to_exit is now std::atomic (matching the existing rdma_force_quit pattern, so the worker actually observes the write under the C++ memory model), and the per-connection rings are returned to the pool only after worker_to_join.join() so the worker can't be observed touching a "freed" ring.

greptile-apps · 2026-06-02T23:33:54Z

Want your agent to iterate on Greptile's feedback? Try greploops.

Resolve conflicts from the docs PR #114 restructure: keep the #113 cross-host DGX-Spark bullets and benchmarking section, but repoint their links to the relocated docs/benchmarks/raw_benchmarking.md (single-host RDMA now lives in socket_benchmarking.md). Also fix the cross-host section's system_configuration.md link for its new docs/benchmarks/ location. check_doc_refs.py, mkdocs build --strict, and check_html_links.py all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>

#113 - Fix RDMA cross-host SEND timeout and shutdown deadlock

3854d9d

Signed-off-by: Chloe Crozier <chloecrozier@gmail.com>

chloecrozier requested review from cliffburdick and dleshchev June 2, 2026 23:13

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/managers/rdma/daqiri_rdma_mgr.h Outdated

Comment thread src/managers/rdma/daqiri_rdma_mgr.cpp Outdated

Comment thread examples/daqiri_bench_rdma_tx_rx_spark_xhost.yaml

#113 - Fix RDMA cross-host SEND timeout and shutdown deadlock

27347bb

Signed-off-by: Chloe Crozier <chloecrozier@gmail.com>

dleshchev merged commit 562f054 into main Jun 4, 2026
3 checks passed

RamyaGuru mentioned this pull request Jun 5, 2026

[BUG] RDMA bench: intermittent throughput cliff at small messages / low pace targets #126

Open

chloecrozier deleted the feature/spark-xhost-configs branch June 8, 2026 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#113 - Fix RDMA cross-host SEND timeout and shutdown deadlock#115

#113 - Fix RDMA cross-host SEND timeout and shutdown deadlock#115
dleshchev merged 3 commits into
mainfrom
feature/spark-xhost-configs

chloecrozier commented Jun 2, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading

Greptile Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chloecrozier commented Jun 2, 2026

Uh oh!

greptile-apps Bot commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chloecrozier commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fixes (src/managers/rdma/daqiri_rdma_mgr.{h,cpp})

Regression test

Tests ran to verify

Uh oh!

greptile-apps Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chloecrozier commented Jun 2, 2026

Addressed Greptile's review:

Uh oh!

greptile-apps Bot commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chloecrozier commented Jun 2, 2026 •

edited

Loading

Fixes (`src/managers/rdma/daqiri_rdma_mgr.{h,cpp}`)

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading