Skip to content

perf: multi-NIC parallel RDMA pull — MX is single-NIC-pinned by default, leaves 75-87% of allocated bandwidth idle #449

Description

@KavinKrishnan

Summary

MX/NIXL's per-receiver RDMA pull is single-NIC-pinned by default, which leaves 75-87.5% of allocated NIC bandwidth idle on multi-NIC topologies (GB200 NVL72 / GB300 / AWS p5). For workloads with many concurrent receivers pulling from a small set of sources, this is the dominant factor capping refit-cycle wall time.

The fix is a deployment-config + UCX-config change. No new C++ code required — possibly a new MX env var to override the NIC pin behaviour and a new NIXL UCX backend param to bump MAX_RMA_RAILS past 2.

Evidence

Cluster validation on a 4-RDMA-NIC GB200 pod / Qwen3-4B-Thinking, 2026-06-23:

  • Each DGD worker pod claims rdma-0..3 (4 NICs via GKE multi-network).
  • Single-receiver bulk pull measured at 308 Gbps.
  • A single RoCE NIC's bandwidth on this hardware is ~400 Gbps line rate; 308 Gbps means we're already saturating ONE NIC.
  • Theoretical 4× striping ceiling: ~1232 Gbps. We never see anywhere near that.

Why we don't stripe:

  1. MX_RDMA_NIC_PIN=auto in the production example manifests (e.g. examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx.gb300.infra.yaml) → modelexpress.ucx_utils.apply_nic_pin_for_device resolves each device to ONE topologically-nearest NIC and sets UCX_NET_DEVICES=mlx5_X:1.
  2. Even if MX_RDMA_NIC_PIN is unset, NIXL's UCX backend pins MAX_RMA_RAILS=2 (in nixl/src/plugins/ucx/ucx_utils.cpp:422), capping per-endpoint NIC count at 2 regardless of how many NICs UCX sees.

What other collective libraries do differently

NCCL (and other established collective libraries) automatically stripe traffic across all available HCA channels (default = 1 channel per NIC, up to N channels) and amortize the cost of the broadcast tree across all receivers. For full-tensor broadcast in a tightly-coupled topology, this hits NIC-saturation per receiver.

For MX, per-receiver bandwidth = single-NIC bandwidth, so a 16 GB transfer on a 4-NIC pod takes ~0.32 s pull alone. Multi-NIC striping should bring this proportionally lower.

Proposal — three layers

Layer 1: deployment-side env-var changes (zero code, immediate)

Update the example manifests in ai-dynamo/dynamo and nemo-rl to:

  1. Drop MX_RDMA_NIC_PIN=auto (or set to "" to disable).
  2. Explicitly set UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 (all claimed NICs).
  3. Add UCX_MAX_RMA_RAILS=4 env override (UCX env var overrides NIXL's default).
  4. Add UCX_RNDV_SCHEME=get_zcopy + UCX_RNDV_THRESH=0 per the existing Dynamo deployment guide for KV-cache transfers.

This is what the Dynamo disagg-communication doc already recommends for production deployments; the MX example manifests haven't been carried over yet.

Layer 2: MX client tweak

modelexpress.ucx_utils.apply_nic_pin_for_device should default to "no pin" when there are 4+ NICs visible to the pod (the high-bandwidth striping case). Today it always picks one.

Concretely: add an MX_RDMA_NIC_PIN=stripe mode (or treat unset as stripe-by-default) that:

  • Lists all visible IB/RoCE devices.
  • Sets UCX_NET_DEVICES to the comma-separated list with :1 suffix on each.
  • Lets UCX/NIXL decide per-rail load-balancing.

Layer 3: NIXL backend param

Add max_rma_rails to NIXL's UCX backend params (currently hardcoded at 2 in ucx_utils.cpp:422). Expose via:

config.modify("MAX_RMA_RAILS",
  params.get("max_rma_rails", default_max_rma_rails));

Default to min(num_devices, 8). Allow override via env (UCX_MAX_RMA_RAILS already does this, but the NIXL hard-set overrides it — that's the actual bug).

Expected impact

For a 16 GB model on a many-receiver multi-NIC topology, paired with the buffer-caching fixes in #421 / #429 / ai-dynamo/dynamo#10901 / jthomson04/RL#7:

Setup Per-receiver pull
Today (1 NIC + caching) ~0.32 s
Layer 1 only (env var override, 4 NICs) ~0.10 s
Layer 1 + 2 + 3 (proper multi-rail) ~0.05-0.08 s

Substantial reduction in per-receiver pull time, which translates directly to lower wall time on wave-parallel multi-receiver dispatches.

Validation plan

  1. Land Layer 1 first (env-var only, no code) — re-run multi-cycle benchmark on a 4-NIC pod to confirm bandwidth improvement.
  2. If Layer 1 confirms the thesis, land Layer 3 (NIXL UCX param) to make this the default for everyone, not just users who know the magic env vars.
  3. Layer 2 last — converts MX's NIC pin behavior into a stripe-friendly default.

Out of scope (future)

  • GB200 NVL72 cross-pod NVLink via NVIDIA IMEX + DRA + ComputeDomain resources. Real ~10-20× win on top of multi-NIC striping, but requires platform-level migration (Kubernetes DRA driver + NVIDIA nvidia.com/gpu.clique affinity + IMEX service). See https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html. Tracking separately.
  • Multi-receiver tree fan-out (publish_self_as_source) — orthogonal optimization for many-receiver setups, already implemented in MX but not enabled in production DGD configs.

Related PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions