perf(megatron-mx): cache NIXL-registered dest buffers across refit cycles#7
Open
KavinKrishnan wants to merge 1 commit into
Open
Conversation
This was referenced Jun 23, 2026
Open
…cles
Audit of refit-cycle wall time on multi-receiver setups surfaced a
receiver-side bug: the mixed-TP path in _update_weights_via_mx_megatron
re-allocated + re-registered per-plan dest buffers with NIXL on every
refit cycle, paying ~0.15s of `ibv_reg_mr` per cycle on Qwen3-4B and
larger costs on bigger models.
Plan shapes are deterministic for a fixed (source_tp, target_tp)
layout, so cycle-N's allocations are identical to cycle-1's. Cache
plan_dests on `self._mx_megatron_plan_dests` and skip the register
call when newly_allocated_this_cycle == 0. The matched-TP path was
already caching via _mx_megatron_buffers (line 815); this commit
brings the mixed-TP branch to parity.
Cluster validation on GB200 / Qwen3-4B-Thinking 2026-06-23, 3
back-to-back refit cycles via a standalone benchmark exercising the
same MxV2RefitReceiver layer with MX_CACHE_BUFFERS env toggle for A/B:
alloc register pull translate total
Before fix:
Cycle 1 0.032 0.152 0.209 0.016 0.409
Cycle 2 0.026 0.152 0.203 0.012 0.392
Cycle 3 0.001 0.024 0.210 0.011 0.246
After fix:
Cycle 1 0.028 0.085 0.206 0.014 0.333
Cycle 2 0.000 0.000 0.204 0.010 0.215 (-45%)
Cycle 3 0.000 0.000 0.204 0.011 0.215
The pull step (~205-210 ms for 8 GB at 308 Gbps single-NIC) is
unchanged — the fix touches setup overhead only. Cycle 2's
register cost (0.152 s) goes to zero after the fix, dropping
warm-cycle wall by ~180 ms.
Larger models (Llama 3.1 8B at 16 GB, 30B+ MoE) will see
proportionally larger savings since `ibv_reg_mr` scales with both
buffer count and pinned-memory size.
Companion fix needed in `dynamo/vllm/mx_refit/extension.py` for the
matched-TP buffers + mixed-TP plan_dests paths in the Dynamo-side
extension. Tracking separately in ai-dynamo/dynamo.
f913302 to
4d81436
Compare
KavinKrishnan
added a commit
to KavinKrishnan/dynamo
that referenced
this pull request
Jun 24, 2026
…cycles
Audit of refit-cycle wall time on multi-receiver setups surfaced a
receiver-side bug: both the matched-TP and mixed-TP branches in
`_update_weights_via_mx_megatron` re-allocated + re-registered per-rank
buffers with NIXL on every refit cycle, paying ~0.15 s of `ibv_reg_mr`
per cycle on Qwen3-4B (proportionally more on larger models).
Plan shapes (and buffer shapes more generally) are deterministic for a
fixed `(source_tp, target_tp)` layout, so cycle-N's allocations are
identical to cycle-1's. Cache the buffers / plan_dests dicts on
`self._mx_megatron_buffers` and `self._mx_megatron_plan_dests`
respectively, and skip the `register_tensors` call when nothing was
newly allocated this cycle.
Cluster validation on GB200 / Qwen3-4B-Thinking 2026-06-23, 3
back-to-back refit cycles via a standalone benchmark exercising the
same MxV2RefitReceiver + NIXL layer:
alloc register pull translate total
Before fix:
Cycle 1 0.032 0.152 0.209 0.016 0.409
Cycle 2 0.026 0.152 0.203 0.012 0.392
Cycle 3 0.001 0.024 0.210 0.011 0.246
After fix:
Cycle 1 0.028 0.085 0.206 0.014 0.333
Cycle 2 0.000 0.000 0.204 0.010 0.215 (-45%)
Cycle 3 0.000 0.000 0.204 0.011 0.215
Pull step unchanged (~205-210 ms for 8 GB at 308 Gbps single-NIC); the
fix touches setup overhead only. Larger models will see proportionally
larger savings since `ibv_reg_mr` scales with both buffer count and
pinned-memory size.
The matching fix in the NeMo-RL trainer-side `vllm_backend.py` lives
at jthomson04/RL#7.
jthomson04
pushed a commit
to ai-dynamo/dynamo
that referenced
this pull request
Jun 25, 2026
…cycles
Audit of refit-cycle wall time on multi-receiver setups surfaced a
receiver-side bug: both the matched-TP and mixed-TP branches in
`_update_weights_via_mx_megatron` re-allocated + re-registered per-rank
buffers with NIXL on every refit cycle, paying ~0.15 s of `ibv_reg_mr`
per cycle on Qwen3-4B (proportionally more on larger models).
Plan shapes (and buffer shapes more generally) are deterministic for a
fixed `(source_tp, target_tp)` layout, so cycle-N's allocations are
identical to cycle-1's. Cache the buffers / plan_dests dicts on
`self._mx_megatron_buffers` and `self._mx_megatron_plan_dests`
respectively, and skip the `register_tensors` call when nothing was
newly allocated this cycle.
Cluster validation on GB200 / Qwen3-4B-Thinking 2026-06-23, 3
back-to-back refit cycles via a standalone benchmark exercising the
same MxV2RefitReceiver + NIXL layer:
alloc register pull translate total
Before fix:
Cycle 1 0.032 0.152 0.209 0.016 0.409
Cycle 2 0.026 0.152 0.203 0.012 0.392
Cycle 3 0.001 0.024 0.210 0.011 0.246
After fix:
Cycle 1 0.028 0.085 0.206 0.014 0.333
Cycle 2 0.000 0.000 0.204 0.010 0.215 (-45%)
Cycle 3 0.000 0.000 0.204 0.011 0.215
Pull step unchanged (~205-210 ms for 8 GB at 308 Gbps single-NIC); the
fix touches setup overhead only. Larger models will see proportionally
larger savings since `ibv_reg_mr` scales with both buffer count and
pinned-memory size.
The matching fix in the NeMo-RL trainer-side `vllm_backend.py` lives
at jthomson04/RL#7.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cache NIXL-registered dest buffers across Megatron-MX refit cycles in
vllm_backend.py's mixed-TP branch. Plan shapes are deterministic for a fixed(source_tp, target_tp)layout — re-allocating + re-registering NIXL buffers on every refit cycle wastes ~0.15 s per cycle on small models and proportionally more on larger ones.Follow-up to #2. The matched-TP path in the same file already cached on
_mx_megatron_buffers(line 815); this brings the mixed-TPplan_destsbranch to parity.Motivation
Audit of refit-cycle wall time on multi-receiver setups surfaced a per-cycle bottleneck: NIXL
register_tensorswas being called every refit for ~290 buffers on Qwen3-4B (8 GB), costing ~150 ms ofibv_reg_mroverhead per receiver per cycle. The buffer shapes never change for a fixed TP layout, so the register call can be amortized to cycle 1 only.Change
One block in
_update_weights_via_mx_megatron's mixed-TP branch:The cache survives for the lifetime of the worker. Plans that fall back to the v0 scratch path don't break the cache for plans that were v1 in prior cycles.
Validation
Cluster-validated on GB200 + Qwen3-4B-Thinking-2507 via 3 back-to-back refit cycles. The benchmark exercises the same
MxV2RefitReceiver+ NIXL layer that this code path uses, withMX_CACHE_BUFFERSenv toggle for A/B comparison (effectively a Python-level cache that mirrors the production code change line-for-line):registercost (0.152 s) goes to zero after the fix.Scope
_update_weights_via_mx_megatronbranch invllm_backend.py.self.mx_refit/extension.py; a parallel PR is going up againstai-dynamo/dynamofor that (perf(vllm/mx_refit): cache NIXL-registered dest buffers across refit cycles ai-dynamo/dynamo#10901).Test Plan