Summary
MX/NIXL's per-receiver RDMA pull is single-NIC-pinned by default, which leaves 75-87.5% of allocated NIC bandwidth idle on multi-NIC topologies (GB200 NVL72 / GB300 / AWS p5). For workloads with many concurrent receivers pulling from a small set of sources, this is the dominant factor capping refit-cycle wall time.
The fix is a deployment-config + UCX-config change. No new C++ code required — possibly a new MX env var to override the NIC pin behaviour and a new NIXL UCX backend param to bump MAX_RMA_RAILS past 2.
Evidence
Cluster validation on a 4-RDMA-NIC GB200 pod / Qwen3-4B-Thinking, 2026-06-23:
- Each DGD worker pod claims
rdma-0..3 (4 NICs via GKE multi-network).
- Single-receiver bulk pull measured at 308 Gbps.
- A single RoCE NIC's bandwidth on this hardware is ~400 Gbps line rate; 308 Gbps means we're already saturating ONE NIC.
- Theoretical 4× striping ceiling: ~1232 Gbps. We never see anywhere near that.
Why we don't stripe:
MX_RDMA_NIC_PIN=auto in the production example manifests (e.g. examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx.gb300.infra.yaml) → modelexpress.ucx_utils.apply_nic_pin_for_device resolves each device to ONE topologically-nearest NIC and sets UCX_NET_DEVICES=mlx5_X:1.
- Even if
MX_RDMA_NIC_PIN is unset, NIXL's UCX backend pins MAX_RMA_RAILS=2 (in nixl/src/plugins/ucx/ucx_utils.cpp:422), capping per-endpoint NIC count at 2 regardless of how many NICs UCX sees.
What other collective libraries do differently
NCCL (and other established collective libraries) automatically stripe traffic across all available HCA channels (default = 1 channel per NIC, up to N channels) and amortize the cost of the broadcast tree across all receivers. For full-tensor broadcast in a tightly-coupled topology, this hits NIC-saturation per receiver.
For MX, per-receiver bandwidth = single-NIC bandwidth, so a 16 GB transfer on a 4-NIC pod takes ~0.32 s pull alone. Multi-NIC striping should bring this proportionally lower.
Proposal — three layers
Layer 1: deployment-side env-var changes (zero code, immediate)
Update the example manifests in ai-dynamo/dynamo and nemo-rl to:
- Drop
MX_RDMA_NIC_PIN=auto (or set to "" to disable).
- Explicitly set
UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 (all claimed NICs).
- Add
UCX_MAX_RMA_RAILS=4 env override (UCX env var overrides NIXL's default).
- Add
UCX_RNDV_SCHEME=get_zcopy + UCX_RNDV_THRESH=0 per the existing Dynamo deployment guide for KV-cache transfers.
This is what the Dynamo disagg-communication doc already recommends for production deployments; the MX example manifests haven't been carried over yet.
Layer 2: MX client tweak
modelexpress.ucx_utils.apply_nic_pin_for_device should default to "no pin" when there are 4+ NICs visible to the pod (the high-bandwidth striping case). Today it always picks one.
Concretely: add an MX_RDMA_NIC_PIN=stripe mode (or treat unset as stripe-by-default) that:
- Lists all visible IB/RoCE devices.
- Sets
UCX_NET_DEVICES to the comma-separated list with :1 suffix on each.
- Lets UCX/NIXL decide per-rail load-balancing.
Layer 3: NIXL backend param
Add max_rma_rails to NIXL's UCX backend params (currently hardcoded at 2 in ucx_utils.cpp:422). Expose via:
config.modify("MAX_RMA_RAILS",
params.get("max_rma_rails", default_max_rma_rails));
Default to min(num_devices, 8). Allow override via env (UCX_MAX_RMA_RAILS already does this, but the NIXL hard-set overrides it — that's the actual bug).
Expected impact
For a 16 GB model on a many-receiver multi-NIC topology, paired with the buffer-caching fixes in #421 / #429 / ai-dynamo/dynamo#10901 / jthomson04/RL#7:
| Setup |
Per-receiver pull |
| Today (1 NIC + caching) |
~0.32 s |
| Layer 1 only (env var override, 4 NICs) |
~0.10 s |
| Layer 1 + 2 + 3 (proper multi-rail) |
~0.05-0.08 s |
Substantial reduction in per-receiver pull time, which translates directly to lower wall time on wave-parallel multi-receiver dispatches.
Validation plan
- Land Layer 1 first (env-var only, no code) — re-run multi-cycle benchmark on a 4-NIC pod to confirm bandwidth improvement.
- If Layer 1 confirms the thesis, land Layer 3 (NIXL UCX param) to make this the default for everyone, not just users who know the magic env vars.
- Layer 2 last — converts MX's NIC pin behavior into a stripe-friendly default.
Out of scope (future)
- GB200 NVL72 cross-pod NVLink via NVIDIA IMEX + DRA + ComputeDomain resources. Real ~10-20× win on top of multi-NIC striping, but requires platform-level migration (Kubernetes DRA driver + NVIDIA
nvidia.com/gpu.clique affinity + IMEX service). See https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html. Tracking separately.
- Multi-receiver tree fan-out (
publish_self_as_source) — orthogonal optimization for many-receiver setups, already implemented in MX but not enabled in production DGD configs.
Related PRs
Summary
MX/NIXL's per-receiver RDMA pull is single-NIC-pinned by default, which leaves 75-87.5% of allocated NIC bandwidth idle on multi-NIC topologies (GB200 NVL72 / GB300 / AWS p5). For workloads with many concurrent receivers pulling from a small set of sources, this is the dominant factor capping refit-cycle wall time.
The fix is a deployment-config + UCX-config change. No new C++ code required — possibly a new MX env var to override the NIC pin behaviour and a new NIXL UCX backend param to bump
MAX_RMA_RAILSpast 2.Evidence
Cluster validation on a 4-RDMA-NIC GB200 pod / Qwen3-4B-Thinking, 2026-06-23:
rdma-0..3(4 NICs via GKE multi-network).Why we don't stripe:
MX_RDMA_NIC_PIN=autoin the production example manifests (e.g.examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx.gb300.infra.yaml) →modelexpress.ucx_utils.apply_nic_pin_for_deviceresolves each device to ONE topologically-nearest NIC and setsUCX_NET_DEVICES=mlx5_X:1.MX_RDMA_NIC_PINis unset, NIXL's UCX backend pinsMAX_RMA_RAILS=2(innixl/src/plugins/ucx/ucx_utils.cpp:422), capping per-endpoint NIC count at 2 regardless of how many NICs UCX sees.What other collective libraries do differently
NCCL (and other established collective libraries) automatically stripe traffic across all available HCA channels (default = 1 channel per NIC, up to N channels) and amortize the cost of the broadcast tree across all receivers. For full-tensor broadcast in a tightly-coupled topology, this hits NIC-saturation per receiver.
For MX, per-receiver bandwidth = single-NIC bandwidth, so a 16 GB transfer on a 4-NIC pod takes ~0.32 s pull alone. Multi-NIC striping should bring this proportionally lower.
Proposal — three layers
Layer 1: deployment-side env-var changes (zero code, immediate)
Update the example manifests in
ai-dynamo/dynamoandnemo-rlto:MX_RDMA_NIC_PIN=auto(or set to""to disable).UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1(all claimed NICs).UCX_MAX_RMA_RAILS=4env override (UCX env var overrides NIXL's default).UCX_RNDV_SCHEME=get_zcopy+UCX_RNDV_THRESH=0per the existing Dynamo deployment guide for KV-cache transfers.This is what the Dynamo disagg-communication doc already recommends for production deployments; the MX example manifests haven't been carried over yet.
Layer 2: MX client tweak
modelexpress.ucx_utils.apply_nic_pin_for_deviceshould default to "no pin" when there are 4+ NICs visible to the pod (the high-bandwidth striping case). Today it always picks one.Concretely: add an
MX_RDMA_NIC_PIN=stripemode (or treat unset as stripe-by-default) that:UCX_NET_DEVICESto the comma-separated list with:1suffix on each.Layer 3: NIXL backend param
Add
max_rma_railsto NIXL's UCX backend params (currently hardcoded at 2 inucx_utils.cpp:422). Expose via:Default to
min(num_devices, 8). Allow override via env (UCX_MAX_RMA_RAILSalready does this, but the NIXL hard-set overrides it — that's the actual bug).Expected impact
For a 16 GB model on a many-receiver multi-NIC topology, paired with the buffer-caching fixes in #421 / #429 / ai-dynamo/dynamo#10901 / jthomson04/RL#7:
Substantial reduction in per-receiver pull time, which translates directly to lower wall time on wave-parallel multi-receiver dispatches.
Validation plan
Out of scope (future)
nvidia.com/gpu.cliqueaffinity + IMEX service). See https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html. Tracking separately.publish_self_as_source) — orthogonal optimization for many-receiver setups, already implemented in MX but not enabled in production DGD configs.Related PRs