[NPU] support qwen3-30b model. by momo609 · Pull Request #189 · vllm-project/vime

momo609 · 2026-06-08T10:50:03Z

support qwen3-30b model.

train script:

export PYTHONPATH="/root/Megatron-Bridge/src:/root/Megatron-LM:$PYTHONPATH"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export CUDA_DEVICE_MAX_CONNECTIONS=1
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050
export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050
export HYDRA_FULL_ERROR=1
export MASTER_PORT=$(shuf -i 20000-65000 -n 1)  # or any free port
export DISABLE_L2_CACHE=1
export VLLM_ASCEND_ENABLE_NZ=0

source /usr/local/CANN_9.0.0.B160/ascend-toolkit/set_env.sh
source /usr/local/CANN_9.0.0.B160/nnal/atb/set_env.sh
unset http_proxy
unset https_proxy

SCRIPT_DIR="/root/vime/scripts"
source "${SCRIPT_DIR}/models/qwen3-30B-A3B.sh"

python train.py \
  --train-backend megatron \
  --actor-num-nodes 1 \
  --actor-num-gpus-per-node 8 \
  --rollout-num-gpus 8 \
  --rollout-num-gpus-per-engine 8 \
  ${MODEL_ARGS[@]} \
  \
  --hf-checkpoint /root/weight/Qwen3-30B-A3B/ \
  \
  --prompt-data /root/dataset/dapo-math-17k/dapo-math-17k.jsonl \
  --input-key prompt \
  --label-key label \
  --apply-chat-template \
  --rollout-shuffle \
  --rm-type math \
  \
  --rollout-backend vllm \
  --vllm-weight-sync-mode native \
  --vllm-enforce-eager \
  --vllm-enable-sleep-mode \
  --vllm-max-model-len 8192 \
  --vllm-enable-expert-parallel \
  --vllm-max-num-seqs 200 \
  \
  --num-rollout 200 \
  --rollout-batch-size 4 \
  --n-samples-per-prompt 8 \
  --rollout-max-response-len 2048 \
  --rollout-temperature 1.0 \
  --global-batch-size 32 \
  --balance-data \
  \
  --advantage-estimator grpo \
  --kl-loss-coef 0.0 \
  --kl-loss-type low_var_kl \
  --kl-coef 0.00 \
  --entropy-coef 0.0 \
  --eps-clip 0.2 \
  --eps-clip-high 0.28 \
  \
  --optimizer adam \
  --lr 1e-6 \
  --lr-decay-style constant \
  --weight-decay 0.1 \
  --adam-beta1 0.9 \
  --adam-beta2 0.98 \
  --optimizer-cpu-offload \
  --overlap-cpu-optimizer-d2h-h2d \
  --use-precision-aware-optimizer \
  \
  --tensor-model-parallel-size 4 \
  --sequence-parallel \
  --pipeline-model-parallel-size 1 \
  --context-parallel-size 1 \
  --expert-model-parallel-size 8 \
  --expert-tensor-parallel-size 1 \
  --recompute-granularity full \
  --recompute-method uniform \
  --recompute-num-layers 1 \
  --use-dynamic-batch-size \
  --max-tokens-per-gpu 8192 \
  --load /root/weight/Qwen3-30B-A3B/  \
  --megatron-to-hf-mode bridge  \
  \
  --attention-dropout 0.0 \
  --hidden-dropout 0.0 \
  --accumulate-allreduce-grads-in-fp32 \
  --attention-softmax-in-fp32 \
  --attention-backend flash \
  --micro-batch-size 1 \
  --use-flash-attn \
  \
  --train-memory-margin-bytes 2147483648

gemini-code-assist

Code Review

This pull request adds Ascend NPU support to the Vime repository, introducing installation documentation, patches for Megatron-LM, Megatron-Bridge, and MindSpeed, and integrating NPU/HCCL-specific logic across training and rollout components. The review feedback highlights several critical issues to address: top-level imports of vllm_ascend and hardcoded "hccl" defaults will break compatibility for GPU users; a typo in the patch filename causes inconsistencies with the README; hardcoded CANN paths limit portability; and potential runtime errors exist due to immediate re-raising of connection errors in flush_cache and direct attribute access in model_provider.py.

gemini-code-assist · 2026-06-08T10:52:25Z

 from vllm.distributed.weight_transfer.nccl_engine import NCCLTrainerSendWeightsArgs, NCCLWeightTransferEngine
+from vllm_ascend.distributed.weight_transfer.hccl_engine import HCCLTrainerSendWeightsArgs, HCCLWeightTransferEngine


Importing vllm_ascend at the top level will cause a ModuleNotFoundError on GPU (CUDA) environments where vllm_ascend is not installed. Please remove this top-level import and import it locally inside the if is_npu(): block in update_weights_from_distributed (and other functions where needed) to ensure compatibility for both GPU and NPU users.

Suggested change

from vllm.distributed.weight_transfer.nccl_engine import NCCLTrainerSendWeightsArgs, NCCLWeightTransferEngine

from vllm_ascend.distributed.weight_transfer.hccl_engine import HCCLTrainerSendWeightsArgs, HCCLWeightTransferEngine

from vllm.distributed.weight_transfer.nccl_engine import NCCLTrainerSendWeightsArgs, NCCLWeightTransferEngine

gemini-code-assist · 2026-06-08T10:52:25Z

+    if is_npu():
+        HCCLWeightTransferEngine.trainer_send_weights(
+            named_gpu_iter,
+            HCCLTrainerSendWeightsArgs(group=group, packed=packed),
+        )


Import HCCLTrainerSendWeightsArgs and HCCLWeightTransferEngine locally here to prevent ModuleNotFoundError on GPU systems.

Suggested change

if is_npu():

HCCLWeightTransferEngine.trainer_send_weights(

named_gpu_iter,

HCCLTrainerSendWeightsArgs(group=group, packed=packed),

)

if is_npu():

from vllm_ascend.distributed.weight_transfer.hccl_engine import HCCLTrainerSendWeightsArgs, HCCLWeightTransferEngine

HCCLWeightTransferEngine.trainer_send_weights(

named_gpu_iter,

HCCLTrainerSendWeightsArgs(group=group, packed=packed),

)

gemini-code-assist · 2026-06-08T10:52:25Z

            )

-            reset_arg(parser, "--distributed-backend", type=str, default="nccl")
+            reset_arg(parser, "--distributed-backend", type=str, default="hccl")


Changing the default value of --distributed-backend to "hccl" will break training for GPU (CUDA) users who rely on the default "nccl" backend. We should dynamically set the default backend based on whether an NPU is being used.

Suggested change

reset_arg(parser, "--distributed-backend", type=str, default="hccl")

from vime.utils.common import is_npu

reset_arg(parser, "--distributed-backend", type=str, default="hccl" if is_npu() else "nccl")

gemini-code-assist · 2026-06-08T10:52:25Z

+
+```shell
+cd <WORKSPACE>/Megatron-LM
+git apply ../npu_patch/megatron_common.patch


The patch filename in the git apply command (megatron_common.patch) does not match the actual file added in this pull request (meagtron_comm.patch). Please update the command to use the correct filename, or rename the patch file to megatron_common.patch to fix the typo.

Suggested change

git apply ../npu_patch/megatron_common.patch

git apply ../npu_patch/meagtron_comm.patch

gemini-code-assist · 2026-06-08T10:52:25Z

@@ -0,0 +1,781 @@
+diff --git a/megatron/core/dist_checkpointing/strategies/common.py b/megatron/core/dist_checkpointing/strategies/common.py


There is a typo in the filename of this patch file (meagtron_comm.patch instead of megatron_common.patch). Please rename this file to megatron_common.patch (or megatron_comm.patch) to fix the typo and ensure consistency with the installation guide in README.md.

gemini-code-assist · 2026-06-08T10:52:25Z

+    raise RuntimeError(
+        f"GPU id {physical_gpu_id} is not valid under CUDA_VISIBLE_DEVICES={cvd}. "
+        f"Expected one of {visible} (physical) or 0..{len(visible)-1} (local)."
+    )


In _to_local_gpu_id, when is_npu() is True, the environment variable used is ASCEND_RT_VISIBLE_DEVICES, but the error message still hardcodes CUDA_VISIBLE_DEVICES. Let's update the error message to dynamically reflect the correct environment variable name.

Suggested change

raise RuntimeError(

f"GPU id {physical_gpu_id} is not valid under CUDA_VISIBLE_DEVICES={cvd}. "

f"Expected one of {visible} (physical) or 0..{len(visible)-1} (local)."

)

env_var = "ASCEND_RT_VISIBLE_DEVICES" if is_npu() else "CUDA_VISIBLE_DEVICES"

raise RuntimeError(

f"GPU id {physical_gpu_id} is not valid under {env_var}={cvd}. "

f"Expected one of {visible} (physical) or 0..{len(visible)-1} (local)."

)

gemini-code-assist · 2026-06-08T10:52:25Z

+            except requests.ConnectionError:
+                raise
+            except Exception as e:
+                logger.info("Error resetting vLLM prefix cache: %s", e)
+                time.sleep(1)
+                continue


In flush_cache, catching requests.ConnectionError and immediately re-raising it will cause the training run to crash on transient connection issues, completely bypassing the 60-attempt retry loop. Removing this explicit raise allows ConnectionError to be caught by the general except Exception as e: block and retried safely.

Suggested change

except requests.ConnectionError:

raise

except Exception as e:

logger.info("Error resetting vLLM prefix cache: %s", e)

time.sleep(1)

continue

except Exception as e:

logger.info("Error resetting vLLM prefix cache: %s", e)

time.sleep(1)

continue

gemini-code-assist · 2026-06-08T10:52:25Z

+                "ASCEND_TOOLKIT_HOME": "/usr/local/CANN9.0.0/ascend-toolkit/latest/",
+                "ASCEND_OPP_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/opp/",
+                "ASCEND_AICPU_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/",
+                "ASCEND_HOME_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/",


The CANN installation paths are hardcoded to /usr/local/CANN9.0.0/.... This makes the script non-portable for environments where CANN is installed in other standard paths (like /usr/local/Ascend/...) or custom directories. We should check the environment variables first and fall back to these paths if they are not set.

Suggested change

"ASCEND_TOOLKIT_HOME": "/usr/local/CANN9.0.0/ascend-toolkit/latest/",

"ASCEND_OPP_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/opp/",

"ASCEND_AICPU_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/",

"ASCEND_HOME_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/",

"ASCEND_TOOLKIT_HOME": os.environ.get("ASCEND_TOOLKIT_HOME", "/usr/local/CANN9.0.0/ascend-toolkit/latest/"),

"ASCEND_OPP_PATH": os.environ.get("ASCEND_OPP_PATH", "/usr/local/CANN9.0.0/ascend-toolkit/latest/opp/"),

"ASCEND_AICPU_PATH": os.environ.get("ASCEND_AICPU_PATH", "/usr/local/CANN9.0.0/ascend-toolkit/latest/"),

"ASCEND_HOME_PATH": os.environ.get("ASCEND_HOME_PATH", "/usr/local/CANN9.0.0/ascend-toolkit/latest/"),

gemini-code-assist · 2026-06-08T10:52:26Z

+        provider.recompute_granularity = args.recompute_granularity
+        provider.recompute_method = args.recompute_method
+        provider.recompute_num_layers = args.recompute_num_layers
+        provider.moe_permute_fusion = args.moe_permute_fusion
+        provider.moe_aux_loss_coeff = args.moe_aux_loss_coeff


Directly accessing attributes like args.recompute_granularity can raise an AttributeError if they are not defined in the parsed arguments namespace. Using getattr with a default value of None is safer and more robust.

Suggested change

provider.recompute_granularity = args.recompute_granularity

provider.recompute_method = args.recompute_method

provider.recompute_num_layers = args.recompute_num_layers

provider.moe_permute_fusion = args.moe_permute_fusion

provider.moe_aux_loss_coeff = args.moe_aux_loss_coeff

provider.recompute_granularity = getattr(args, "recompute_granularity", None)

provider.recompute_method = getattr(args, "recompute_method", None)

provider.recompute_num_layers = getattr(args, "recompute_num_layers", None)

provider.moe_permute_fusion = getattr(args, "moe_permute_fusion", None)

provider.moe_aux_loss_coeff = getattr(args, "moe_aux_loss_coeff", None)

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>

Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>

support ipcweighttransfer. Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>

momo609 force-pushed the upstream_main branch from 4815af0 to 95c4567 Compare June 8, 2026 10:52

gemini-code-assist Bot reviewed Jun 8, 2026

View reviewed changes

momo609 force-pushed the upstream_main branch 2 times, most recently from f55df8e to 548c8d8 Compare June 8, 2026 13:21

aoshen02 changed the title ~~support qwen3-30b model.~~ [NPU] support qwen3-30b model. Jun 8, 2026

This was referenced Jun 9, 2026

[NPU] Steps and Test Result for Running Qwen3-30B on NPU(A3) #205

Open

[RFC] NPU support roadmap. #51

Open

momo609 and others added 5 commits June 9, 2026 21:47

add npu support. (#179)

e71ab3a

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>

fix npu_4b. (#194)

682ce56

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>

fix vllm_engine bug.

465f8d6

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>

[NPU]add Ascend A3 Dockerfile (#204)

d2ec346

Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>

[NPU] Add Ascend NPU setup guide and Qwen3-4B training script (#212)

3182a34

Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>

Meihan-chen force-pushed the npu branch from 72d519b to 3182a34 Compare June 9, 2026 13:47

momo609 force-pushed the upstream_main branch 3 times, most recently from fcd6766 to b20d8ea Compare June 10, 2026 03:06

support qwen3-30b.

f00d783

support ipcweighttransfer. Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>

momo609 force-pushed the upstream_main branch from e6cbe1b to f00d783 Compare June 10, 2026 03:23

Meihan-chen force-pushed the npu branch from 45a82a4 to 00ccc21 Compare June 17, 2026 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NPU] support qwen3-30b model.#189

[NPU] support qwen3-30b model.#189
momo609 wants to merge 6 commits into
npufrom
upstream_main

momo609 commented Jun 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		from vllm.distributed.weight_transfer.nccl_engine import NCCLTrainerSendWeightsArgs, NCCLWeightTransferEngine
		from vllm_ascend.distributed.weight_transfer.hccl_engine import HCCLTrainerSendWeightsArgs, HCCLWeightTransferEngine

	reset_arg(parser, "--distributed-backend", type=str, default="hccl")
	from vime.utils.common import is_npu
	reset_arg(parser, "--distributed-backend", type=str, default="hccl" if is_npu() else "nccl")

	git apply ../npu_patch/megatron_common.patch
	git apply ../npu_patch/meagtron_comm.patch

		@@ -0,0 +1,781 @@
		diff --git a/megatron/core/dist_checkpointing/strategies/common.py b/megatron/core/dist_checkpointing/strategies/common.py

Uh oh!

Conversation

momo609 commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

train script:

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

momo609 commented Jun 8, 2026 •

edited

Loading