Skip to content

perf(megatron-mx): cache NIXL-registered dest buffers across refit cycles#7

Open
KavinKrishnan wants to merge 1 commit into
jthomson04:dynamo-k8s-integrationfrom
KavinKrishnan:kavink/megatron-mx-perf
Open

perf(megatron-mx): cache NIXL-registered dest buffers across refit cycles#7
KavinKrishnan wants to merge 1 commit into
jthomson04:dynamo-k8s-integrationfrom
KavinKrishnan:kavink/megatron-mx-perf

Conversation

@KavinKrishnan

@KavinKrishnan KavinKrishnan commented Jun 23, 2026

Copy link
Copy Markdown

Summary

Cache NIXL-registered dest buffers across Megatron-MX refit cycles in vllm_backend.py's mixed-TP branch. Plan shapes are deterministic for a fixed (source_tp, target_tp) layout — re-allocating + re-registering NIXL buffers on every refit cycle wastes ~0.15 s per cycle on small models and proportionally more on larger ones.

Follow-up to #2. The matched-TP path in the same file already cached on _mx_megatron_buffers (line 815); this brings the mixed-TP plan_dests branch to parity.

Motivation

Audit of refit-cycle wall time on multi-receiver setups surfaced a per-cycle bottleneck: NIXL register_tensors was being called every refit for ~290 buffers on Qwen3-4B (8 GB), costing ~150 ms of ibv_reg_mr overhead per receiver per cycle. The buffer shapes never change for a fixed TP layout, so the register call can be amortized to cycle 1 only.

Change

One block in _update_weights_via_mx_megatron's mixed-TP branch:

cached_plan_dests = getattr(self, "_mx_megatron_plan_dests", None)
plan_dests = cached_plan_dests or {}
newly_allocated_this_cycle = 0
for plan in plans:
    if plan.tensor_name in plan_dests:
        dest = plan_dests[plan.tensor_name]
    else:
        dest = torch.empty(plan.target_shape, ...)
        plan_dests[plan.tensor_name] = dest
        newly_allocated_this_cycle += 1
    # ...

if newly_allocated_this_cycle > 0 and plan_dests:
    self._mx_receiver._receiver._nixl.register_tensors(plan_dests)
    self._mx_megatron_plan_dests = plan_dests
# else: skip register entirely on cache hit

The cache survives for the lifetime of the worker. Plans that fall back to the v0 scratch path don't break the cache for plans that were v1 in prior cycles.

Validation

Cluster-validated on GB200 + Qwen3-4B-Thinking-2507 via 3 back-to-back refit cycles. The benchmark exercises the same MxV2RefitReceiver + NIXL layer that this code path uses, with MX_CACHE_BUFFERS env toggle for A/B comparison (effectively a Python-level cache that mirrors the production code change line-for-line):

                 alloc  register   pull  translate  total
BEFORE FIX:
  Cycle 1       0.032     0.152   0.209     0.016   0.409
  Cycle 2       0.026     0.152   0.203     0.012   0.392
  Cycle 3       0.001     0.024   0.210     0.011   0.246

AFTER FIX:
  Cycle 1       0.028     0.085   0.206     0.014   0.333
  Cycle 2       0.000     0.000   0.204     0.010   0.215   (-45% vs cycle-2 baseline)
  Cycle 3       0.000     0.000   0.204     0.011   0.215
  • The pull step (~205-210 ms for 8 GB at 308 Gbps single-NIC) is unchanged — the fix touches setup overhead only.
  • Cycle 2's register cost (0.152 s) goes to zero after the fix.

Scope

Test Plan

  • Multi-cycle benchmark on GB200 / Qwen3-4B-Thinking — cycle 1 + cycle 2 + cycle 3 captured before/after, summarised above.
  • Re-run on a larger multi-receiver setup to validate the extrapolation. (Requires the matching Dynamo extension fix to be deployed too.)

…cles

Audit of refit-cycle wall time on multi-receiver setups surfaced a
receiver-side bug: the mixed-TP path in _update_weights_via_mx_megatron
re-allocated + re-registered per-plan dest buffers with NIXL on every
refit cycle, paying ~0.15s of `ibv_reg_mr` per cycle on Qwen3-4B and
larger costs on bigger models.

Plan shapes are deterministic for a fixed (source_tp, target_tp)
layout, so cycle-N's allocations are identical to cycle-1's. Cache
plan_dests on `self._mx_megatron_plan_dests` and skip the register
call when newly_allocated_this_cycle == 0. The matched-TP path was
already caching via _mx_megatron_buffers (line 815); this commit
brings the mixed-TP branch to parity.

Cluster validation on GB200 / Qwen3-4B-Thinking 2026-06-23, 3
back-to-back refit cycles via a standalone benchmark exercising the
same MxV2RefitReceiver layer with MX_CACHE_BUFFERS env toggle for A/B:

                 alloc  register   pull  translate  total
  Before fix:
    Cycle 1     0.032     0.152   0.209     0.016   0.409
    Cycle 2     0.026     0.152   0.203     0.012   0.392
    Cycle 3     0.001     0.024   0.210     0.011   0.246

  After fix:
    Cycle 1     0.028     0.085   0.206     0.014   0.333
    Cycle 2     0.000     0.000   0.204     0.010   0.215   (-45%)
    Cycle 3     0.000     0.000   0.204     0.011   0.215

The pull step (~205-210 ms for 8 GB at 308 Gbps single-NIC) is
unchanged — the fix touches setup overhead only. Cycle 2's
register cost (0.152 s) goes to zero after the fix, dropping
warm-cycle wall by ~180 ms.

Larger models (Llama 3.1 8B at 16 GB, 30B+ MoE) will see
proportionally larger savings since `ibv_reg_mr` scales with both
buffer count and pinned-memory size.

Companion fix needed in `dynamo/vllm/mx_refit/extension.py` for the
matched-TP buffers + mixed-TP plan_dests paths in the Dynamo-side
extension. Tracking separately in ai-dynamo/dynamo.
@KavinKrishnan KavinKrishnan force-pushed the kavink/megatron-mx-perf branch from f913302 to 4d81436 Compare June 24, 2026 04:56
KavinKrishnan added a commit to KavinKrishnan/dynamo that referenced this pull request Jun 24, 2026
…cycles

Audit of refit-cycle wall time on multi-receiver setups surfaced a
receiver-side bug: both the matched-TP and mixed-TP branches in
`_update_weights_via_mx_megatron` re-allocated + re-registered per-rank
buffers with NIXL on every refit cycle, paying ~0.15 s of `ibv_reg_mr`
per cycle on Qwen3-4B (proportionally more on larger models).

Plan shapes (and buffer shapes more generally) are deterministic for a
fixed `(source_tp, target_tp)` layout, so cycle-N's allocations are
identical to cycle-1's. Cache the buffers / plan_dests dicts on
`self._mx_megatron_buffers` and `self._mx_megatron_plan_dests`
respectively, and skip the `register_tensors` call when nothing was
newly allocated this cycle.

Cluster validation on GB200 / Qwen3-4B-Thinking 2026-06-23, 3
back-to-back refit cycles via a standalone benchmark exercising the
same MxV2RefitReceiver + NIXL layer:

                 alloc  register   pull  translate  total
  Before fix:
    Cycle 1     0.032     0.152   0.209     0.016   0.409
    Cycle 2     0.026     0.152   0.203     0.012   0.392
    Cycle 3     0.001     0.024   0.210     0.011   0.246

  After fix:
    Cycle 1     0.028     0.085   0.206     0.014   0.333
    Cycle 2     0.000     0.000   0.204     0.010   0.215   (-45%)
    Cycle 3     0.000     0.000   0.204     0.011   0.215

Pull step unchanged (~205-210 ms for 8 GB at 308 Gbps single-NIC); the
fix touches setup overhead only. Larger models will see proportionally
larger savings since `ibv_reg_mr` scales with both buffer count and
pinned-memory size.

The matching fix in the NeMo-RL trainer-side `vllm_backend.py` lives
at jthomson04/RL#7.
jthomson04 pushed a commit to ai-dynamo/dynamo that referenced this pull request Jun 25, 2026
…cycles

Audit of refit-cycle wall time on multi-receiver setups surfaced a
receiver-side bug: both the matched-TP and mixed-TP branches in
`_update_weights_via_mx_megatron` re-allocated + re-registered per-rank
buffers with NIXL on every refit cycle, paying ~0.15 s of `ibv_reg_mr`
per cycle on Qwen3-4B (proportionally more on larger models).

Plan shapes (and buffer shapes more generally) are deterministic for a
fixed `(source_tp, target_tp)` layout, so cycle-N's allocations are
identical to cycle-1's. Cache the buffers / plan_dests dicts on
`self._mx_megatron_buffers` and `self._mx_megatron_plan_dests`
respectively, and skip the `register_tensors` call when nothing was
newly allocated this cycle.

Cluster validation on GB200 / Qwen3-4B-Thinking 2026-06-23, 3
back-to-back refit cycles via a standalone benchmark exercising the
same MxV2RefitReceiver + NIXL layer:

                 alloc  register   pull  translate  total
  Before fix:
    Cycle 1     0.032     0.152   0.209     0.016   0.409
    Cycle 2     0.026     0.152   0.203     0.012   0.392
    Cycle 3     0.001     0.024   0.210     0.011   0.246

  After fix:
    Cycle 1     0.028     0.085   0.206     0.014   0.333
    Cycle 2     0.000     0.000   0.204     0.010   0.215   (-45%)
    Cycle 3     0.000     0.000   0.204     0.011   0.215

Pull step unchanged (~205-210 ms for 8 GB at 308 Gbps single-NIC); the
fix touches setup overhead only. Larger models will see proportionally
larger savings since `ibv_reg_mr` scales with both buffer count and
pinned-memory size.

The matching fix in the NeMo-RL trainer-side `vllm_backend.py` lives
at jthomson04/RL#7.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant