#113 - Fix RDMA cross-host SEND timeout and shutdown deadlock#115
Conversation
Signed-off-by: Chloe Crozier <chloecrozier@gmail.com>
|
| Filename | Overview |
|---|---|
| src/managers/rdma/daqiri_rdma_mgr.h | Changes ready_to_exit from bool to std::atomic<bool> with an in-place initializer; adds explanatory comment. Both issues raised in previous review are resolved. |
| src/managers/rdma/daqiri_rdma_mgr.cpp | Three targeted fixes: (1) uses listen_iter->second.if_idx instead of cm_event->id->port_num; (2) restructures DISCONNECTED handler to release threads_mutex_ before join(), pushing rings back only after join; (3) switches server_q_params_ construction to try_emplace for the non-movable type. |
| examples/daqiri_bench_rdma_tx_rx_spark_xhost.yaml | New cross-host RDMA bench config; single file, both roles selected via --mode {client,server}. IPs, CPU cores, and host_pinned kind match existing Spark configs. |
| examples/daqiri_bench_raw_tx_spark_xhost.yaml | New TX-side raw GPUDirect cross-host config. eth_dst_addr correctly left as placeholder. All other values match existing Spark profile. |
| examples/daqiri_bench_raw_rx_spark_xhost.yaml | New RX-side raw GPUDirect cross-host config. Flow rule matches UDP src/dst ports sent by the TX companion config. |
| AGENTS.md | Benchmark table updated to include all three new xhost YAML configs, addressing the doc-sync concern from the previous review. |
| docs/benchmarks/raw_benchmarking.md | Adds a new cross-host two-DGX-Spark loopback subsection with correct run instructions for both raw GPUDirect and RDMA variants. |
| docs/tutorials/configuration-walkthrough.md | Adds one-line entries for each new xhost YAML in the raw GPUDirect and RDMA sections, with correct links. |
Reviews (3): Last reviewed commit: "Merge branch 'main' into feature/spark-x..." | Re-trigger Greptile
Signed-off-by: Chloe Crozier <chloecrozier@gmail.com>
Addressed Greptile's review:ready_to_exit is now std::atomic (matching the existing rdma_force_quit pattern, so the worker actually observes the write under the C++ memory model), and the per-connection rings are returned to the pool only after worker_to_join.join() so the worker can't be observed touching a "freed" ring. |
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
Resolve conflicts from the docs PR #114 restructure: keep the #113 cross-host DGX-Spark bullets and benchmarking section, but repoint their links to the relocated docs/benchmarks/raw_benchmarking.md (single-host RDMA now lives in socket_benchmarking.md). Also fix the cross-host section's system_configuration.md link for its new docs/benchmarks/ location. check_doc_refs.py, mkdocs build --strict, and check_html_links.py all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
Summary
Closes #113 (RDMA cross-host SEND fails with
IBV_WC_RETRY_EXC_ERR) plus a graceful-shutdown deadlock the #113 fix exposes, and adds cross-host DGX Spark example YAMLs as the regression test.Fixes (
src/managers/rdma/daqiri_rdma_mgr.{h,cpp})Wrong interface index on the server ([BUG] RDMA cross-host SEND immediately fails with transport_retry_exceeded and the server never posts receives #113). When a client connects, the server picks which configured interface to use for that connection. It was using the wrong number: the hardware's port number from the InfiniBand driver, instead of the index into our own configured interface list. The two happen to look similar but mean different things, so the server thread immediately crashed trying to look up an interface that wasn't there. Because the server crashed before posting any receives, the client's first send had nothing to land in and timed out with
RETRY_EXC_ERR. Fix: use the interface index we already saved when the listener was set up.Server hangs on graceful shutdown (only reachable once the first issue was fixed). When the client disconnects, the server tries to clean up its worker thread. It was doing this badly in three ways at once: it held a lock while waiting for the worker, the worker had no way to know it should stop (it only watched the global "everything is shutting down" flag), and the stats-printing code also wanted that same lock. Once issue 1 stopped the server from crashing early, every clean run ended up stuck in this three-way deadlock and had to be
SIGKILLed. Fix: give each worker its own "please exit" flag, set it when the client disconnects, and release the lock before waiting for the worker to actually stop.The
Couldn't find server params for address …line that may show up once at startup is a benign timing race (the app polls slightly before the listener is registered), not the bug.Regression test
Three new
examples/YAMLs exercising the cross-host data plane that no existing config reached:daqiri_bench_raw_{tx,rx}_spark_xhost.yaml: split per-host (raw bench has no--mode).daqiri_bench_rdma_tx_rx_spark_xhost.yaml: single config selected via--mode {client,server}, matching the existing one-file-per-workload pattern.Reuses the
1.1.1.1/2.2.2.2IPs and CPU pinning of the existing_spark.yamlconfigs, with each address bound to its host's p0 instead of both on one box.Tests ran to verify
python scripts/check_doc_refs.py: 23 YAMLs, all covered.mkdocs build --strict: clean.python scripts/check_html_links.py site/: clean.Hardware (two cross-cabled DGX Sparks, GB10, ConnectX-7 fw
28.45.4028, IGX OS / Ubuntu 24.04 ARM, kernel 6.14, CUDA 13.0, driver 580.95.05):Server received: 21761,Client received: 15256(was0/0pre-fix); zerotransport retry counter exceeded; server reachesRemoved leftover hugepage file …and exits ~3 s after its--secondstimer with noSIGKILL. Logs:logs/spark_rdma_{rx,tx}_post_fix3.log(vs. pre-fixlogs/spark_rdma_{rx,tx}.log).24,657,920/ RX24,657,920/ dropped0(~106 Gbps). Logs:logs/spark_raw_{rx,tx}_post_fix.log.daqiri_bench_rdma_tx_rx_spark.yaml): same code paths, unchanged behavior.