perf(mx): improve Dynamo weight refit fanout by jthomson04 · Pull Request #8 · jthomson04/RL

jthomson04 · 2026-06-24T04:45:55Z

Summary

add JSON-safe MX source candidates/source plans so trainer publishers can pass exact-version direct NIXL metadata into the Dynamo refit dispatcher
update the Dynamo MX dispatcher to send source plans per fanout wave and collect newly-published inference replica candidates for later waves
tighten Megatron/vLLM MX source selection to exact-version eligible trainer/replica sources and spread choices by receiver identity
add stripe NIC pinning support/fallback plus temporary weight-sync timing telemetry for the MX/NCCL comparison work
commit the temporary DGD sitecustomize.py overlay and exact tiny k8s/DGD benchmark configs needed to reproduce the PR benchmark
add focused unit coverage for source plans, NIC pinning, source selection, and GRPO passing trainer source candidates into generation refit

Benchmark

Same tiny 4 trainer GPU / 4 Dynamo decode worker config, with MX_POOL_REG=1 and striped RoCE on both sides.

Version	Before	After	Delta	Speedup
v1	`1.128s`	`0.498s`	`-0.630s` / `-55.9%`	`2.27x`
v2	`1.176s`	`0.506s`	`-0.670s` / `-57.0%`	`2.32x`

Baseline run: training-1782262066
New run: training-1782275425

Repro files now included in this branch:

sitecustomize.py provides the temporary Dynamo worker-side source-plan consumer, direct metadata receive path, and benchmark telemetry.
infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.yaml
infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.gb300.infra.yaml
infra/nrl_k8s/examples_dgd/k8s_exemplars/V2/llama3_1_8b_instruct_gb300_mx_tree4_samever_poolreg_debug.yaml

Expected reproduction command from a checkout of this branch at /mnt/rl-workspace/$USER/nemo-rl:

RECIPE=infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.yaml
INFRA=infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.gb300.infra.yaml
nrl-k8s check "$RECIPE" --infra "$INFRA"
nrl-k8s run "$RECIPE" --infra "$INFRA" --raycluster --no-wait --replace --recreate

A valid reproduction should show DGD logs with receive_path=source_plan, received_bytes=15.010GB, and client_get_metadata_count=0 for warm transfers.

Draft Notes

This PR still contains temporary benchmark telemetry and a repo-root sitecustomize.py overlay. That overlay is intentionally included now so reviewers can reproduce the benchmark result from this branch, but it is not the desired final production shape. The corresponding DGD/ModeExpress-side source-plan support should still land upstream or be vendored properly before this is ready to merge.

Credit to @KavinKrishnan for the strided/striped NIC handling that this PR builds on.

Validation

python -m py_compile nemo_rl/algorithms/grpo.py nemo_rl/distributed/mx_helpers.py nemo_rl/distributed/mx_source_plan.py nemo_rl/models/generation/dynamo/dynamo_generation.py nemo_rl/models/generation/interfaces.py nemo_rl/models/generation/vllm/vllm_backend.py nemo_rl/models/generation/vllm/vllm_generation.py nemo_rl/models/policy/interfaces.py nemo_rl/models/policy/workers/megatron_policy_worker.py nemo_rl/utils/packed_tensor.py tests/unit/algorithms/test_grpo.py tests/unit/distributed/test_mx_helpers.py tests/unit/distributed/test_mx_source_plan.py tests/unit/models/generation/test_vllm_mx_selection.py
python -m ruff check nemo_rl/algorithms/grpo.py nemo_rl/distributed/mx_helpers.py nemo_rl/distributed/mx_source_plan.py nemo_rl/models/generation/dynamo/dynamo_generation.py nemo_rl/models/generation/interfaces.py nemo_rl/models/generation/vllm/vllm_backend.py nemo_rl/models/generation/vllm/vllm_generation.py nemo_rl/models/policy/interfaces.py nemo_rl/models/policy/workers/megatron_policy_worker.py nemo_rl/utils/packed_tensor.py tests/unit/algorithms/test_grpo.py tests/unit/distributed/test_mx_helpers.py tests/unit/distributed/test_mx_source_plan.py tests/unit/models/generation/test_vllm_mx_selection.py
git diff --cached --check
python -m pytest tests/unit/distributed/test_mx_source_plan.py tests/unit/distributed/test_mx_helpers.py tests/unit/models/generation/test_vllm_mx_selection.py tests/unit/algorithms/test_grpo.py::test_refit_policy_generation_mx_passes_kv_scales
uv run --no-project python -m py_compile sitecustomize.py
/opt/nemo_rl_venv/bin/ruff check sitecustomize.py
nrl-k8s check infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.yaml --infra infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.gb300.infra.yaml
k8s benchmark rerun: training-1782275425

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

jthomson04 added 2 commits June 24, 2026 04:45

perf(mx): improve Dynamo weight refit fanout

86ae042

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

perf(mx): add reproducible Dynamo benchmark overlay

774b4a8

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(mx): improve Dynamo weight refit fanout#8

perf(mx): improve Dynamo weight refit fanout#8
jthomson04 wants to merge 2 commits into
dynamo-k8s-integrationfrom
debug/mx-weight-sync-attribution-20260623

jthomson04 commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jthomson04 commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Draft Notes

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jthomson04 commented Jun 24, 2026 •

edited

Loading