Skip to content

[NPU] support qwen3-30b model.#189

Open
momo609 wants to merge 6 commits into
npufrom
upstream_main
Open

[NPU] support qwen3-30b model.#189
momo609 wants to merge 6 commits into
npufrom
upstream_main

Conversation

@momo609

@momo609 momo609 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

support qwen3-30b model.

train script:

export PYTHONPATH="/root/Megatron-Bridge/src:/root/Megatron-LM:$PYTHONPATH"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export CUDA_DEVICE_MAX_CONNECTIONS=1
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050
export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050
export HYDRA_FULL_ERROR=1
export MASTER_PORT=$(shuf -i 20000-65000 -n 1)  # or any free port
export DISABLE_L2_CACHE=1
export VLLM_ASCEND_ENABLE_NZ=0

source /usr/local/CANN_9.0.0.B160/ascend-toolkit/set_env.sh
source /usr/local/CANN_9.0.0.B160/nnal/atb/set_env.sh
unset http_proxy
unset https_proxy

SCRIPT_DIR="/root/vime/scripts"
source "${SCRIPT_DIR}/models/qwen3-30B-A3B.sh"

python train.py \
  --train-backend megatron \
  --actor-num-nodes 1 \
  --actor-num-gpus-per-node 8 \
  --rollout-num-gpus 8 \
  --rollout-num-gpus-per-engine 8 \
  ${MODEL_ARGS[@]} \
  \
  --hf-checkpoint /root/weight/Qwen3-30B-A3B/ \
  \
  --prompt-data /root/dataset/dapo-math-17k/dapo-math-17k.jsonl \
  --input-key prompt \
  --label-key label \
  --apply-chat-template \
  --rollout-shuffle \
  --rm-type math \
  \
  --rollout-backend vllm \
  --vllm-weight-sync-mode native \
  --vllm-enforce-eager \
  --vllm-enable-sleep-mode \
  --vllm-max-model-len 8192 \
  --vllm-enable-expert-parallel \
  --vllm-max-num-seqs 200 \
  \
  --num-rollout 200 \
  --rollout-batch-size 4 \
  --n-samples-per-prompt 8 \
  --rollout-max-response-len 2048 \
  --rollout-temperature 1.0 \
  --global-batch-size 32 \
  --balance-data \
  \
  --advantage-estimator grpo \
  --kl-loss-coef 0.0 \
  --kl-loss-type low_var_kl \
  --kl-coef 0.00 \
  --entropy-coef 0.0 \
  --eps-clip 0.2 \
  --eps-clip-high 0.28 \
  \
  --optimizer adam \
  --lr 1e-6 \
  --lr-decay-style constant \
  --weight-decay 0.1 \
  --adam-beta1 0.9 \
  --adam-beta2 0.98 \
  --optimizer-cpu-offload \
  --overlap-cpu-optimizer-d2h-h2d \
  --use-precision-aware-optimizer \
  \
  --tensor-model-parallel-size 4 \
  --sequence-parallel \
  --pipeline-model-parallel-size 1 \
  --context-parallel-size 1 \
  --expert-model-parallel-size 8 \
  --expert-tensor-parallel-size 1 \
  --recompute-granularity full \
  --recompute-method uniform \
  --recompute-num-layers 1 \
  --use-dynamic-batch-size \
  --max-tokens-per-gpu 8192 \
  --load /root/weight/Qwen3-30B-A3B/  \
  --megatron-to-hf-mode bridge  \
  \
  --attention-dropout 0.0 \
  --hidden-dropout 0.0 \
  --accumulate-allreduce-grads-in-fp32 \
  --attention-softmax-in-fp32 \
  --attention-backend flash \
  --micro-batch-size 1 \
  --use-flash-attn \
  \
  --train-memory-margin-bytes 2147483648

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds Ascend NPU support to the Vime repository, introducing installation documentation, patches for Megatron-LM, Megatron-Bridge, and MindSpeed, and integrating NPU/HCCL-specific logic across training and rollout components. The review feedback highlights several critical issues to address: top-level imports of vllm_ascend and hardcoded "hccl" defaults will break compatibility for GPU users; a typo in the patch filename causes inconsistencies with the README; hardcoded CANN paths limit portability; and potential runtime errors exist due to immediate re-raising of connection errors in flush_cache and direct attribute access in model_provider.py.

Comment on lines 18 to +19
from vllm.distributed.weight_transfer.nccl_engine import NCCLTrainerSendWeightsArgs, NCCLWeightTransferEngine
from vllm_ascend.distributed.weight_transfer.hccl_engine import HCCLTrainerSendWeightsArgs, HCCLWeightTransferEngine

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Importing vllm_ascend at the top level will cause a ModuleNotFoundError on GPU (CUDA) environments where vllm_ascend is not installed. Please remove this top-level import and import it locally inside the if is_npu(): block in update_weights_from_distributed (and other functions where needed) to ensure compatibility for both GPU and NPU users.

Suggested change
from vllm.distributed.weight_transfer.nccl_engine import NCCLTrainerSendWeightsArgs, NCCLWeightTransferEngine
from vllm_ascend.distributed.weight_transfer.hccl_engine import HCCLTrainerSendWeightsArgs, HCCLWeightTransferEngine
from vllm.distributed.weight_transfer.nccl_engine import NCCLTrainerSendWeightsArgs, NCCLWeightTransferEngine

Comment on lines +507 to +511
if is_npu():
HCCLWeightTransferEngine.trainer_send_weights(
named_gpu_iter,
HCCLTrainerSendWeightsArgs(group=group, packed=packed),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Import HCCLTrainerSendWeightsArgs and HCCLWeightTransferEngine locally here to prevent ModuleNotFoundError on GPU systems.

Suggested change
if is_npu():
HCCLWeightTransferEngine.trainer_send_weights(
named_gpu_iter,
HCCLTrainerSendWeightsArgs(group=group, packed=packed),
)
if is_npu():
from vllm_ascend.distributed.weight_transfer.hccl_engine import HCCLTrainerSendWeightsArgs, HCCLWeightTransferEngine
HCCLWeightTransferEngine.trainer_send_weights(
named_gpu_iter,
HCCLTrainerSendWeightsArgs(group=group, packed=packed),
)

Comment thread vime/utils/arguments.py
)

reset_arg(parser, "--distributed-backend", type=str, default="nccl")
reset_arg(parser, "--distributed-backend", type=str, default="hccl")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Changing the default value of --distributed-backend to "hccl" will break training for GPU (CUDA) users who rely on the default "nccl" backend. We should dynamically set the default backend based on whether an NPU is being used.

Suggested change
reset_arg(parser, "--distributed-backend", type=str, default="hccl")
from vime.utils.common import is_npu
reset_arg(parser, "--distributed-backend", type=str, default="hccl" if is_npu() else "nccl")

Comment thread docker/npu_patch/README.md Outdated

```shell
cd <WORKSPACE>/Megatron-LM
git apply ../npu_patch/megatron_common.patch

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The patch filename in the git apply command (megatron_common.patch) does not match the actual file added in this pull request (meagtron_comm.patch). Please update the command to use the correct filename, or rename the patch file to megatron_common.patch to fix the typo.

Suggested change
git apply ../npu_patch/megatron_common.patch
git apply ../npu_patch/meagtron_comm.patch

@@ -0,0 +1,781 @@
diff --git a/megatron/core/dist_checkpointing/strategies/common.py b/megatron/core/dist_checkpointing/strategies/common.py

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the filename of this patch file (meagtron_comm.patch instead of megatron_common.patch). Please rename this file to megatron_common.patch (or megatron_comm.patch) to fix the typo and ensure consistency with the installation guide in README.md.

Comment thread vime/backends/vllm_utils/vllm_engine.py Outdated
Comment on lines +94 to +97
raise RuntimeError(
f"GPU id {physical_gpu_id} is not valid under CUDA_VISIBLE_DEVICES={cvd}. "
f"Expected one of {visible} (physical) or 0..{len(visible)-1} (local)."
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In _to_local_gpu_id, when is_npu() is True, the environment variable used is ASCEND_RT_VISIBLE_DEVICES, but the error message still hardcodes CUDA_VISIBLE_DEVICES. Let's update the error message to dynamically reflect the correct environment variable name.

Suggested change
raise RuntimeError(
f"GPU id {physical_gpu_id} is not valid under CUDA_VISIBLE_DEVICES={cvd}. "
f"Expected one of {visible} (physical) or 0..{len(visible)-1} (local)."
)
env_var = "ASCEND_RT_VISIBLE_DEVICES" if is_npu() else "CUDA_VISIBLE_DEVICES"
raise RuntimeError(
f"GPU id {physical_gpu_id} is not valid under {env_var}={cvd}. "
f"Expected one of {visible} (physical) or 0..{len(visible)-1} (local)."
)

Comment thread vime/backends/vllm_utils/vllm_engine.py Outdated
Comment on lines +849 to +854
except requests.ConnectionError:
raise
except Exception as e:
logger.info("Error resetting vLLM prefix cache: %s", e)
time.sleep(1)
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In flush_cache, catching requests.ConnectionError and immediately re-raising it will cause the training run to crash on transient connection issues, completely bypassing the 60-attempt retry loop. Removing this explicit raise allows ConnectionError to be caught by the general except Exception as e: block and retried safely.

Suggested change
except requests.ConnectionError:
raise
except Exception as e:
logger.info("Error resetting vLLM prefix cache: %s", e)
time.sleep(1)
continue
except Exception as e:
logger.info("Error resetting vLLM prefix cache: %s", e)
time.sleep(1)
continue

Comment on lines +234 to +237
"ASCEND_TOOLKIT_HOME": "/usr/local/CANN9.0.0/ascend-toolkit/latest/",
"ASCEND_OPP_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/opp/",
"ASCEND_AICPU_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/",
"ASCEND_HOME_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The CANN installation paths are hardcoded to /usr/local/CANN9.0.0/.... This makes the script non-portable for environments where CANN is installed in other standard paths (like /usr/local/Ascend/...) or custom directories. We should check the environment variables first and fall back to these paths if they are not set.

Suggested change
"ASCEND_TOOLKIT_HOME": "/usr/local/CANN9.0.0/ascend-toolkit/latest/",
"ASCEND_OPP_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/opp/",
"ASCEND_AICPU_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/",
"ASCEND_HOME_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/",
"ASCEND_TOOLKIT_HOME": os.environ.get("ASCEND_TOOLKIT_HOME", "/usr/local/CANN9.0.0/ascend-toolkit/latest/"),
"ASCEND_OPP_PATH": os.environ.get("ASCEND_OPP_PATH", "/usr/local/CANN9.0.0/ascend-toolkit/latest/opp/"),
"ASCEND_AICPU_PATH": os.environ.get("ASCEND_AICPU_PATH", "/usr/local/CANN9.0.0/ascend-toolkit/latest/"),
"ASCEND_HOME_PATH": os.environ.get("ASCEND_HOME_PATH", "/usr/local/CANN9.0.0/ascend-toolkit/latest/"),

Comment on lines +110 to +114
provider.recompute_granularity = args.recompute_granularity
provider.recompute_method = args.recompute_method
provider.recompute_num_layers = args.recompute_num_layers
provider.moe_permute_fusion = args.moe_permute_fusion
provider.moe_aux_loss_coeff = args.moe_aux_loss_coeff

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Directly accessing attributes like args.recompute_granularity can raise an AttributeError if they are not defined in the parsed arguments namespace. Using getattr with a default value of None is safer and more robust.

Suggested change
provider.recompute_granularity = args.recompute_granularity
provider.recompute_method = args.recompute_method
provider.recompute_num_layers = args.recompute_num_layers
provider.moe_permute_fusion = args.moe_permute_fusion
provider.moe_aux_loss_coeff = args.moe_aux_loss_coeff
provider.recompute_granularity = getattr(args, "recompute_granularity", None)
provider.recompute_method = getattr(args, "recompute_method", None)
provider.recompute_num_layers = getattr(args, "recompute_num_layers", None)
provider.moe_permute_fusion = getattr(args, "moe_permute_fusion", None)
provider.moe_aux_loss_coeff = getattr(args, "moe_aux_loss_coeff", None)

@momo609 momo609 force-pushed the upstream_main branch 2 times, most recently from f55df8e to 548c8d8 Compare June 8, 2026 13:21
@aoshen02 aoshen02 changed the title support qwen3-30b model. [NPU] support qwen3-30b model. Jun 8, 2026
momo609 and others added 5 commits June 9, 2026 21:47
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>
Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>
@momo609 momo609 force-pushed the upstream_main branch 3 times, most recently from fcd6766 to b20d8ea Compare June 10, 2026 03:06
support ipcweighttransfer.

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants