Skip to content

perf(mx): improve Dynamo weight refit fanout#8

Draft
jthomson04 wants to merge 2 commits into
dynamo-k8s-integrationfrom
debug/mx-weight-sync-attribution-20260623
Draft

perf(mx): improve Dynamo weight refit fanout#8
jthomson04 wants to merge 2 commits into
dynamo-k8s-integrationfrom
debug/mx-weight-sync-attribution-20260623

Conversation

@jthomson04

@jthomson04 jthomson04 commented Jun 24, 2026

Copy link
Copy Markdown
Owner

Summary

  • add JSON-safe MX source candidates/source plans so trainer publishers can pass exact-version direct NIXL metadata into the Dynamo refit dispatcher
  • update the Dynamo MX dispatcher to send source plans per fanout wave and collect newly-published inference replica candidates for later waves
  • tighten Megatron/vLLM MX source selection to exact-version eligible trainer/replica sources and spread choices by receiver identity
  • add stripe NIC pinning support/fallback plus temporary weight-sync timing telemetry for the MX/NCCL comparison work
  • commit the temporary DGD sitecustomize.py overlay and exact tiny k8s/DGD benchmark configs needed to reproduce the PR benchmark
  • add focused unit coverage for source plans, NIC pinning, source selection, and GRPO passing trainer source candidates into generation refit

Benchmark

Same tiny 4 trainer GPU / 4 Dynamo decode worker config, with MX_POOL_REG=1 and striped RoCE on both sides.

Version Before After Delta Speedup
v1 1.128s 0.498s -0.630s / -55.9% 2.27x
v2 1.176s 0.506s -0.670s / -57.0% 2.32x

Baseline run: training-1782262066
New run: training-1782275425

Repro files now included in this branch:

  • sitecustomize.py provides the temporary Dynamo worker-side source-plan consumer, direct metadata receive path, and benchmark telemetry.
  • infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.yaml
  • infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.gb300.infra.yaml
  • infra/nrl_k8s/examples_dgd/k8s_exemplars/V2/llama3_1_8b_instruct_gb300_mx_tree4_samever_poolreg_debug.yaml

Expected reproduction command from a checkout of this branch at /mnt/rl-workspace/$USER/nemo-rl:

RECIPE=infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.yaml
INFRA=infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.gb300.infra.yaml
nrl-k8s check "$RECIPE" --infra "$INFRA"
nrl-k8s run "$RECIPE" --infra "$INFRA" --raycluster --no-wait --replace --recreate

A valid reproduction should show DGD logs with receive_path=source_plan, received_bytes=15.010GB, and client_get_metadata_count=0 for warm transfers.

Draft Notes

This PR still contains temporary benchmark telemetry and a repo-root sitecustomize.py overlay. That overlay is intentionally included now so reviewers can reproduce the benchmark result from this branch, but it is not the desired final production shape. The corresponding DGD/ModeExpress-side source-plan support should still land upstream or be vendored properly before this is ready to merge.

Credit to @KavinKrishnan for the strided/striped NIC handling that this PR builds on.

Validation

  • python -m py_compile nemo_rl/algorithms/grpo.py nemo_rl/distributed/mx_helpers.py nemo_rl/distributed/mx_source_plan.py nemo_rl/models/generation/dynamo/dynamo_generation.py nemo_rl/models/generation/interfaces.py nemo_rl/models/generation/vllm/vllm_backend.py nemo_rl/models/generation/vllm/vllm_generation.py nemo_rl/models/policy/interfaces.py nemo_rl/models/policy/workers/megatron_policy_worker.py nemo_rl/utils/packed_tensor.py tests/unit/algorithms/test_grpo.py tests/unit/distributed/test_mx_helpers.py tests/unit/distributed/test_mx_source_plan.py tests/unit/models/generation/test_vllm_mx_selection.py
  • python -m ruff check nemo_rl/algorithms/grpo.py nemo_rl/distributed/mx_helpers.py nemo_rl/distributed/mx_source_plan.py nemo_rl/models/generation/dynamo/dynamo_generation.py nemo_rl/models/generation/interfaces.py nemo_rl/models/generation/vllm/vllm_backend.py nemo_rl/models/generation/vllm/vllm_generation.py nemo_rl/models/policy/interfaces.py nemo_rl/models/policy/workers/megatron_policy_worker.py nemo_rl/utils/packed_tensor.py tests/unit/algorithms/test_grpo.py tests/unit/distributed/test_mx_helpers.py tests/unit/distributed/test_mx_source_plan.py tests/unit/models/generation/test_vllm_mx_selection.py
  • git diff --cached --check
  • python -m pytest tests/unit/distributed/test_mx_source_plan.py tests/unit/distributed/test_mx_helpers.py tests/unit/models/generation/test_vllm_mx_selection.py tests/unit/algorithms/test_grpo.py::test_refit_policy_generation_mx_passes_kv_scales
  • uv run --no-project python -m py_compile sitecustomize.py
  • /opt/nemo_rl_venv/bin/ruff check sitecustomize.py
  • nrl-k8s check infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.yaml --infra infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.gb300.infra.yaml
  • k8s benchmark rerun: training-1782275425

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant