perf: multi-NIC parallel RDMA pull — MX is single-NIC-pinned by default, leaves 75-87% of allocated bandwidth idle

## Summary

MX/NIXL's per-receiver RDMA pull is single-NIC-pinned by default, which leaves 75-87.5% of allocated NIC bandwidth idle on multi-NIC topologies (GB200 NVL72 / GB300 / AWS p5). For workloads with many concurrent receivers pulling from a small set of sources, this is the dominant factor capping refit-cycle wall time.

The fix is a deployment-config + UCX-config change. No new C++ code required — possibly a new MX env var to override the NIC pin behaviour and a new NIXL UCX backend param to bump `MAX_RMA_RAILS` past 2.

## Evidence

Cluster validation on a 4-RDMA-NIC GB200 pod / Qwen3-4B-Thinking, 2026-06-23:

- Each DGD worker pod claims `rdma-0..3` (4 NICs via GKE multi-network).
- Single-receiver bulk pull measured at **308 Gbps**.
- A single RoCE NIC's bandwidth on this hardware is ~400 Gbps line rate; 308 Gbps means we're already saturating ONE NIC.
- Theoretical 4× striping ceiling: ~1232 Gbps. We never see anywhere near that.

Why we don't stripe:

1. `MX_RDMA_NIC_PIN=auto` in the production example manifests (e.g. `examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx.gb300.infra.yaml`) → `modelexpress.ucx_utils.apply_nic_pin_for_device` resolves each device to ONE topologically-nearest NIC and sets `UCX_NET_DEVICES=mlx5_X:1`.
2. Even if `MX_RDMA_NIC_PIN` is unset, NIXL's UCX backend pins `MAX_RMA_RAILS=2` (in `nixl/src/plugins/ucx/ucx_utils.cpp:422`), capping per-endpoint NIC count at 2 regardless of how many NICs UCX sees.

## What other collective libraries do differently

NCCL (and other established collective libraries) automatically stripe traffic across all available HCA channels (default = 1 channel per NIC, up to N channels) and amortize the cost of the broadcast tree across all receivers. For full-tensor broadcast in a tightly-coupled topology, this hits NIC-saturation per receiver.

For MX, per-receiver bandwidth = single-NIC bandwidth, so a 16 GB transfer on a 4-NIC pod takes ~0.32 s pull alone. Multi-NIC striping should bring this proportionally lower.

## Proposal — three layers

### Layer 1: deployment-side env-var changes (zero code, immediate)

Update the example manifests in `ai-dynamo/dynamo` and `nemo-rl` to:
1. Drop `MX_RDMA_NIC_PIN=auto` (or set to `""` to disable).
2. Explicitly set `UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1` (all claimed NICs).
3. Add `UCX_MAX_RMA_RAILS=4` env override (UCX env var overrides NIXL's default).
4. Add `UCX_RNDV_SCHEME=get_zcopy` + `UCX_RNDV_THRESH=0` per the existing Dynamo deployment guide for KV-cache transfers.

This is what the Dynamo disagg-communication doc already recommends for production deployments; the MX example manifests haven't been carried over yet.

### Layer 2: MX client tweak

`modelexpress.ucx_utils.apply_nic_pin_for_device` should default to "no pin" when there are 4+ NICs visible to the pod (the high-bandwidth striping case). Today it always picks one.

Concretely: add an `MX_RDMA_NIC_PIN=stripe` mode (or treat unset as stripe-by-default) that:
- Lists all visible IB/RoCE devices.
- Sets `UCX_NET_DEVICES` to the comma-separated list with `:1` suffix on each.
- Lets UCX/NIXL decide per-rail load-balancing.

### Layer 3: NIXL backend param

Add `max_rma_rails` to NIXL's UCX backend params (currently hardcoded at 2 in `ucx_utils.cpp:422`). Expose via:
```cpp
config.modify("MAX_RMA_RAILS",
  params.get("max_rma_rails", default_max_rma_rails));
```

Default to `min(num_devices, 8)`. Allow override via env (`UCX_MAX_RMA_RAILS` already does this, but the NIXL hard-set overrides it — that's the actual bug).

## Expected impact

For a 16 GB model on a many-receiver multi-NIC topology, paired with the buffer-caching fixes in #421 / #429 / ai-dynamo/dynamo#10901 / jthomson04/RL#7:

| Setup | Per-receiver pull |
|---|---|
| Today (1 NIC + caching) | ~0.32 s |
| Layer 1 only (env var override, 4 NICs) | ~0.10 s |
| Layer 1 + 2 + 3 (proper multi-rail) | ~0.05-0.08 s |

Substantial reduction in per-receiver pull time, which translates directly to lower wall time on wave-parallel multi-receiver dispatches.

## Validation plan

1. Land Layer 1 first (env-var only, no code) — re-run multi-cycle benchmark on a 4-NIC pod to confirm bandwidth improvement.
2. If Layer 1 confirms the thesis, land Layer 3 (NIXL UCX param) to make this the default for everyone, not just users who know the magic env vars.
3. Layer 2 last — converts MX's NIC pin behavior into a stripe-friendly default.

## Out of scope (future)

- GB200 NVL72 cross-pod NVLink via NVIDIA IMEX + DRA + ComputeDomain resources. Real ~10-20× win on top of multi-NIC striping, but requires platform-level migration (Kubernetes DRA driver + NVIDIA `nvidia.com/gpu.clique` affinity + IMEX service). See https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html. Tracking separately.
- Multi-receiver tree fan-out (`publish_self_as_source`) — orthogonal optimization for many-receiver setups, already implemented in MX but not enabled in production DGD configs.

## Related PRs
- ai-dynamo/dynamo#10900 — Dynamo Megatron-MX extension upstream port
- ai-dynamo/dynamo#10901 — Buffer-caching perf fix (cycle 2+ overhead removal)
- jthomson04/RL#7 — Same caching fix on the NeMo-RL trainer side

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: multi-NIC parallel RDMA pull — MX is single-NIC-pinned by default, leaves 75-87% of allocated bandwidth idle #449

Summary

Evidence

What other collective libraries do differently

Proposal — three layers

Layer 1: deployment-side env-var changes (zero code, immediate)

Layer 2: MX client tweak

Layer 3: NIXL backend param

Expected impact

Validation plan

Out of scope (future)

Related PRs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Setup	Per-receiver pull
Today (1 NIC + caching)	~0.32 s
Layer 1 only (env var override, 4 NICs)	~0.10 s
Layer 1 + 2 + 3 (proper multi-rail)	~0.05-0.08 s

Uh oh!

perf: multi-NIC parallel RDMA pull — MX is single-NIC-pinned by default, leaves 75-87% of allocated bandwidth idle #449

Description

Summary

Evidence

What other collective libraries do differently

Proposal — three layers

Layer 1: deployment-side env-var changes (zero code, immediate)

Layer 2: MX client tweak

Layer 3: NIXL backend param

Expected impact

Validation plan

Out of scope (future)

Related PRs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions