perf(megatron-mx): cache NIXL-registered dest buffers across refit cycles by KavinKrishnan · Pull Request #7 · jthomson04/RL

KavinKrishnan · 2026-06-23T21:07:37Z

Summary

Cache NIXL-registered dest buffers across Megatron-MX refit cycles in vllm_backend.py's mixed-TP branch. Plan shapes are deterministic for a fixed (source_tp, target_tp) layout — re-allocating + re-registering NIXL buffers on every refit cycle wastes ~0.15 s per cycle on small models and proportionally more on larger ones.

Follow-up to #2. The matched-TP path in the same file already cached on _mx_megatron_buffers (line 815); this brings the mixed-TP plan_dests branch to parity.

Motivation

Audit of refit-cycle wall time on multi-receiver setups surfaced a per-cycle bottleneck: NIXL register_tensors was being called every refit for ~290 buffers on Qwen3-4B (8 GB), costing ~150 ms of ibv_reg_mr overhead per receiver per cycle. The buffer shapes never change for a fixed TP layout, so the register call can be amortized to cycle 1 only.

Change

One block in _update_weights_via_mx_megatron's mixed-TP branch:

cached_plan_dests = getattr(self, "_mx_megatron_plan_dests", None)
plan_dests = cached_plan_dests or {}
newly_allocated_this_cycle = 0
for plan in plans:
    if plan.tensor_name in plan_dests:
        dest = plan_dests[plan.tensor_name]
    else:
        dest = torch.empty(plan.target_shape, ...)
        plan_dests[plan.tensor_name] = dest
        newly_allocated_this_cycle += 1
    # ...

if newly_allocated_this_cycle > 0 and plan_dests:
    self._mx_receiver._receiver._nixl.register_tensors(plan_dests)
    self._mx_megatron_plan_dests = plan_dests
# else: skip register entirely on cache hit

The cache survives for the lifetime of the worker. Plans that fall back to the v0 scratch path don't break the cache for plans that were v1 in prior cycles.

Validation

Cluster-validated on GB200 + Qwen3-4B-Thinking-2507 via 3 back-to-back refit cycles. The benchmark exercises the same MxV2RefitReceiver + NIXL layer that this code path uses, with MX_CACHE_BUFFERS env toggle for A/B comparison (effectively a Python-level cache that mirrors the production code change line-for-line):

                 alloc  register   pull  translate  total
BEFORE FIX:
  Cycle 1       0.032     0.152   0.209     0.016   0.409
  Cycle 2       0.026     0.152   0.203     0.012   0.392
  Cycle 3       0.001     0.024   0.210     0.011   0.246

AFTER FIX:
  Cycle 1       0.028     0.085   0.206     0.014   0.333
  Cycle 2       0.000     0.000   0.204     0.010   0.215   (-45% vs cycle-2 baseline)
  Cycle 3       0.000     0.000   0.204     0.011   0.215

The pull step (~205-210 ms for 8 GB at 308 Gbps single-NIC) is unchanged — the fix touches setup overhead only.
Cycle 2's register cost (0.152 s) goes to zero after the fix.

Scope

Touches only the mixed-TP _update_weights_via_mx_megatron branch in vllm_backend.py.
No public API change; the cache lives entirely on self.
Does NOT address the Dynamo-side mx_refit/extension.py; a parallel PR is going up against ai-dynamo/dynamo for that (perf(vllm/mx_refit): cache NIXL-registered dest buffers across refit cycles ai-dynamo/dynamo#10901).

Test Plan

Multi-cycle benchmark on GB200 / Qwen3-4B-Thinking — cycle 1 + cycle 2 + cycle 3 captured before/after, summarised above.
Re-run on a larger multi-receiver setup to validate the extrapolation. (Requires the matching Dynamo extension fix to be deployed too.)

…cles Audit of refit-cycle wall time on multi-receiver setups surfaced a receiver-side bug: the mixed-TP path in _update_weights_via_mx_megatron re-allocated + re-registered per-plan dest buffers with NIXL on every refit cycle, paying ~0.15s of `ibv_reg_mr` per cycle on Qwen3-4B and larger costs on bigger models. Plan shapes are deterministic for a fixed (source_tp, target_tp) layout, so cycle-N's allocations are identical to cycle-1's. Cache plan_dests on `self._mx_megatron_plan_dests` and skip the register call when newly_allocated_this_cycle == 0. The matched-TP path was already caching via _mx_megatron_buffers (line 815); this commit brings the mixed-TP branch to parity. Cluster validation on GB200 / Qwen3-4B-Thinking 2026-06-23, 3 back-to-back refit cycles via a standalone benchmark exercising the same MxV2RefitReceiver layer with MX_CACHE_BUFFERS env toggle for A/B: alloc register pull translate total Before fix: Cycle 1 0.032 0.152 0.209 0.016 0.409 Cycle 2 0.026 0.152 0.203 0.012 0.392 Cycle 3 0.001 0.024 0.210 0.011 0.246 After fix: Cycle 1 0.028 0.085 0.206 0.014 0.333 Cycle 2 0.000 0.000 0.204 0.010 0.215 (-45%) Cycle 3 0.000 0.000 0.204 0.011 0.215 The pull step (~205-210 ms for 8 GB at 308 Gbps single-NIC) is unchanged — the fix touches setup overhead only. Cycle 2's register cost (0.152 s) goes to zero after the fix, dropping warm-cycle wall by ~180 ms. Larger models (Llama 3.1 8B at 16 GB, 30B+ MoE) will see proportionally larger savings since `ibv_reg_mr` scales with both buffer count and pinned-memory size. Companion fix needed in `dynamo/vllm/mx_refit/extension.py` for the matched-TP buffers + mixed-TP plan_dests paths in the Dynamo-side extension. Tracking separately in ai-dynamo/dynamo.

…cycles Audit of refit-cycle wall time on multi-receiver setups surfaced a receiver-side bug: both the matched-TP and mixed-TP branches in `_update_weights_via_mx_megatron` re-allocated + re-registered per-rank buffers with NIXL on every refit cycle, paying ~0.15 s of `ibv_reg_mr` per cycle on Qwen3-4B (proportionally more on larger models). Plan shapes (and buffer shapes more generally) are deterministic for a fixed `(source_tp, target_tp)` layout, so cycle-N's allocations are identical to cycle-1's. Cache the buffers / plan_dests dicts on `self._mx_megatron_buffers` and `self._mx_megatron_plan_dests` respectively, and skip the `register_tensors` call when nothing was newly allocated this cycle. Cluster validation on GB200 / Qwen3-4B-Thinking 2026-06-23, 3 back-to-back refit cycles via a standalone benchmark exercising the same MxV2RefitReceiver + NIXL layer: alloc register pull translate total Before fix: Cycle 1 0.032 0.152 0.209 0.016 0.409 Cycle 2 0.026 0.152 0.203 0.012 0.392 Cycle 3 0.001 0.024 0.210 0.011 0.246 After fix: Cycle 1 0.028 0.085 0.206 0.014 0.333 Cycle 2 0.000 0.000 0.204 0.010 0.215 (-45%) Cycle 3 0.000 0.000 0.204 0.011 0.215 Pull step unchanged (~205-210 ms for 8 GB at 308 Gbps single-NIC); the fix touches setup overhead only. Larger models will see proportionally larger savings since `ibv_reg_mr` scales with both buffer count and pinned-memory size. The matching fix in the NeMo-RL trainer-side `vllm_backend.py` lives at jthomson04/RL#7.

KavinKrishnan force-pushed the kavink/megatron-mx-perf branch from f913302 to 4d81436 Compare June 24, 2026 04:56

KavinKrishnan mentioned this pull request Jun 26, 2026

feat(RL/post-2389): MX V2 Support in RL + WeightTransferEngine Support ai-dynamo/modelexpress#349

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(megatron-mx): cache NIXL-registered dest buffers across refit cycles#7

perf(megatron-mx): cache NIXL-registered dest buffers across refit cycles#7
KavinKrishnan wants to merge 1 commit into
jthomson04:dynamo-k8s-integrationfrom
KavinKrishnan:kavink/megatron-mx-perf

KavinKrishnan commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KavinKrishnan commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Change

Validation

Scope

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KavinKrishnan commented Jun 23, 2026 •

edited

Loading