[NPU][Example] Add Qwen3-32B GRPO training script for Ascend#164
[NPU][Example] Add Qwen3-32B GRPO training script for Ascend#164CalvinXKY wants to merge 3 commits into
Conversation
* update docker patch. * fix mindspeed.patch try-except formatting per review Replace malformed features_manager hunks with proper try/except/pass blocks. * add torch_npu.patch for NPU Docker build Wrap eager_connect_single_device in try/except to avoid RuntimeError on A3.
Add scripts/run-qwen3-32B-npu.sh for end-to-end GRPO on Atlas 800I A3 (16 NPUs).
There was a problem hiding this comment.
Code Review
This pull request introduces a new bash script, scripts/run-qwen3-32B-npu.sh, designed to configure and launch Qwen3-32B training on NPU clusters using Ray. The code review feedback focuses on correcting a typo in the PYTHONBUFFERED environment variable (which should be PYTHONUNBUFFERED), improving script portability by replacing hardcoded /home/ma-user paths with the ${HOME} environment variable, and enhancing shell script robustness by properly double-quoting variables and array expansions to prevent word splitting.
|
|
||
| SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)" | ||
|
|
||
| export PYTHONBUFFERED=16 |
There was a problem hiding this comment.
The environment variable to disable Python's stdout/stderr buffering is PYTHONUNBUFFERED, not PYTHONBUFFERED. Because of this typo, Python output buffering will not be disabled, which can lead to delayed or missing logs during training. It should be set to PYTHONUNBUFFERED=1.
| export PYTHONBUFFERED=16 | |
| export PYTHONUNBUFFERED=1 |
| export RAY_DISABLE_SIGINT_OVERRIDE=1 | ||
| export HCCL_CONNECT_TIMEOUT=7200 | ||
|
|
||
| export PYTHONPATH="/home/ma-user/Megatron-LM:/home/ma-user/vllm:/home/ma-user/vime:${PYTHONPATH}" |
There was a problem hiding this comment.
Hardcoding /home/ma-user limits the script's portability to other environments or users. Consider using the ${HOME} environment variable instead.
| export PYTHONPATH="/home/ma-user/Megatron-LM:/home/ma-user/vllm:/home/ma-user/vime:${PYTHONPATH}" | |
| export PYTHONPATH="${HOME}/Megatron-LM:${HOME}/vllm:${HOME}/vime:${PYTHONPATH}" |
| PROMPT_SET=/data/nfs_87/xky/datasets/gsm8k/train.parquet | ||
|
|
||
| ROLLOUT_ARGS=( | ||
| --prompt-data ${PROMPT_SET} |
| ) | ||
|
|
||
| export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"} | ||
| ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 0 --resources '{"NPU": 16}' --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265 |
There was a problem hiding this comment.
Double-quote ${MASTER_ADDR} to prevent word splitting.
| ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 0 --resources '{"NPU": 16}' --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265 | |
| ray start --head --node-ip-address "${MASTER_ADDR}" --num-gpus 0 --resources '{"NPU": 16}' --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265 |
|
|
||
| RUNTIME_ENV_JSON='{ | ||
| "env_vars": { | ||
| "PYTHONPATH": "/home/ma-user/Megatron-LM:/home/ma-user/vllm:/home/ma-user/vime:/usr/local/Ascend/ascend-toolkit/latest/tools/ms_fmk_transplt/torch_npu_bridge:/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:'"$PYTHONPATH"'", |
There was a problem hiding this comment.
Hardcoding /home/ma-user in the PYTHONPATH inside RUNTIME_ENV_JSON limits portability. Consider using ${HOME} instead.
| "PYTHONPATH": "/home/ma-user/Megatron-LM:/home/ma-user/vllm:/home/ma-user/vime:/usr/local/Ascend/ascend-toolkit/latest/tools/ms_fmk_transplt/torch_npu_bridge:/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:'"$PYTHONPATH"'", | |
| "PYTHONPATH": "'"${HOME}"'/Megatron-LM:'"${HOME}"'/vllm:'"${HOME}"'/vime:/usr/local/Ascend/ascend-toolkit/latest/tools/ms_fmk_transplt/torch_npu_bridge:/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:'"$PYTHONPATH"'", |
| "ASCEND_HOME_PATH": "/usr/local/Ascend/ascend-toolkit/latest/", | ||
| "HYDRA_FULL_ERROR": "1", | ||
| "RAY_DEBUG_POST_MORTEM_DISABLED": "1", | ||
| "LD_LIBRARY_PATH": "/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/cann-8.5.2/lib64:'"$LD_LIBRARY_PATH"'" |
| } | ||
| }' | ||
|
|
||
| cd /home/ma-user/vime |
| cd /home/ma-user/vime | ||
| ray job submit --address="http://127.0.0.1:8265" \ | ||
| --runtime-env-json="${RUNTIME_ENV_JSON}" \ | ||
| -- python3 /home/ma-user/vime/train.py \ |
| ${MODEL_ARGS[@]} \ | ||
| ${CKPT_ARGS[@]} \ | ||
| ${ROLLOUT_ARGS[@]} \ | ||
| ${OPTIMIZER_ARGS[@]} \ | ||
| ${GRPO_ARGS[@]} \ | ||
| ${PERF_ARGS[@]} \ | ||
| ${VLLM_ARGS[@]} \ | ||
| ${MISC_ARGS[@]} |
There was a problem hiding this comment.
Array expansions should be double-quoted (e.g., "${MODEL_ARGS[@]}") to prevent word splitting and glob expansion of individual elements. This ensures that arguments containing spaces or special characters are passed correctly to the python script.
| ${MODEL_ARGS[@]} \ | |
| ${CKPT_ARGS[@]} \ | |
| ${ROLLOUT_ARGS[@]} \ | |
| ${OPTIMIZER_ARGS[@]} \ | |
| ${GRPO_ARGS[@]} \ | |
| ${PERF_ARGS[@]} \ | |
| ${VLLM_ARGS[@]} \ | |
| ${MISC_ARGS[@]} | |
| "${MODEL_ARGS[@]}" \ | |
| "${CKPT_ARGS[@]}" \ | |
| "${ROLLOUT_ARGS[@]}" \ | |
| "${OPTIMIZER_ARGS[@]}" \ | |
| "${GRPO_ARGS[@]}" \ | |
| "${PERF_ARGS[@]}" \ | |
| "${VLLM_ARGS[@]}" \ | |
| "${MISC_ARGS[@]}" |
Summary
Add
scripts/run-qwen3-32B-npu.sh, an end-to-end GRPO training example for Qwen3-32B on Ascend NPU (Atlas 800I A3).This PR is scoped to the run script only. The Docker build and NPU dependency patches are already on the
ascendbranch (#163).The script reuses the existing model config at
scripts/models/qwen3-32B.shand follows the same layout as other scripts underscripts/(e.g.run-qwen3-4B.sh).What the script covers
ASCEND_RT_VISIBLE_DEVICES, HCCL port ranges,RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES, CANN/Ascend runtime env forray job submitdeepscalerreward), TP=8 + sequence parallel, CPU optimizer offload--rollout-num-gpus-per-engine 2,--vllm-enforce-eagerPrerequisites
docker/Dockerfile.npu--hf-checkpoint: HuggingFace Qwen3-32B weights--ref-load: Megatrontorch_distcheckpointPaths to customize before running
--hf-checkpoint/data/local_models/Qwen3-32B--ref-load/data/local_models/Qwen3-32B_torch_dist--prompt-data/data/nfs_87/xky/datasets/gsm8k/train.parquetASCEND_RT_VISIBLE_DEVICES0–15Test plan
Related