perf(mx): improve Dynamo weight refit fanout#8
Draft
jthomson04 wants to merge 2 commits into
Draft
Conversation
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
stripeNIC pinning support/fallback plus temporary weight-sync timing telemetry for the MX/NCCL comparison worksitecustomize.pyoverlay and exact tiny k8s/DGD benchmark configs needed to reproduce the PR benchmarkBenchmark
Same tiny 4 trainer GPU / 4 Dynamo decode worker config, with
MX_POOL_REG=1and striped RoCE on both sides.1.128s0.498s-0.630s/-55.9%2.27x1.176s0.506s-0.670s/-57.0%2.32xBaseline run:
training-1782262066New run:
training-1782275425Repro files now included in this branch:
sitecustomize.pyprovides the temporary Dynamo worker-side source-plan consumer, direct metadata receive path, and benchmark telemetry.infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.yamlinfra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.gb300.infra.yamlinfra/nrl_k8s/examples_dgd/k8s_exemplars/V2/llama3_1_8b_instruct_gb300_mx_tree4_samever_poolreg_debug.yamlExpected reproduction command from a checkout of this branch at
/mnt/rl-workspace/$USER/nemo-rl:A valid reproduction should show DGD logs with
receive_path=source_plan,received_bytes=15.010GB, andclient_get_metadata_count=0for warm transfers.Draft Notes
This PR still contains temporary benchmark telemetry and a repo-root
sitecustomize.pyoverlay. That overlay is intentionally included now so reviewers can reproduce the benchmark result from this branch, but it is not the desired final production shape. The corresponding DGD/ModeExpress-side source-plan support should still land upstream or be vendored properly before this is ready to merge.Credit to @KavinKrishnan for the strided/striped NIC handling that this PR builds on.
Validation
python -m py_compile nemo_rl/algorithms/grpo.py nemo_rl/distributed/mx_helpers.py nemo_rl/distributed/mx_source_plan.py nemo_rl/models/generation/dynamo/dynamo_generation.py nemo_rl/models/generation/interfaces.py nemo_rl/models/generation/vllm/vllm_backend.py nemo_rl/models/generation/vllm/vllm_generation.py nemo_rl/models/policy/interfaces.py nemo_rl/models/policy/workers/megatron_policy_worker.py nemo_rl/utils/packed_tensor.py tests/unit/algorithms/test_grpo.py tests/unit/distributed/test_mx_helpers.py tests/unit/distributed/test_mx_source_plan.py tests/unit/models/generation/test_vllm_mx_selection.pypython -m ruff check nemo_rl/algorithms/grpo.py nemo_rl/distributed/mx_helpers.py nemo_rl/distributed/mx_source_plan.py nemo_rl/models/generation/dynamo/dynamo_generation.py nemo_rl/models/generation/interfaces.py nemo_rl/models/generation/vllm/vllm_backend.py nemo_rl/models/generation/vllm/vllm_generation.py nemo_rl/models/policy/interfaces.py nemo_rl/models/policy/workers/megatron_policy_worker.py nemo_rl/utils/packed_tensor.py tests/unit/algorithms/test_grpo.py tests/unit/distributed/test_mx_helpers.py tests/unit/distributed/test_mx_source_plan.py tests/unit/models/generation/test_vllm_mx_selection.pygit diff --cached --checkpython -m pytest tests/unit/distributed/test_mx_source_plan.py tests/unit/distributed/test_mx_helpers.py tests/unit/models/generation/test_vllm_mx_selection.py tests/unit/algorithms/test_grpo.py::test_refit_policy_generation_mx_passes_kv_scalesuv run --no-project python -m py_compile sitecustomize.py/opt/nemo_rl_venv/bin/ruff check sitecustomize.pynrl-k8s check infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.yaml --infra infra/nrl_k8s/examples/k8s_exemplars/V2/grpo_llama3_1_8b_instruct_megatron_dynamo_mx_tree_tiny_samever_poolreg_debug.gb300.infra.yamltraining-1782275425