[NPU] support qwen3-30b model.#189
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds Ascend NPU support to the Vime repository, introducing installation documentation, patches for Megatron-LM, Megatron-Bridge, and MindSpeed, and integrating NPU/HCCL-specific logic across training and rollout components. The review feedback highlights several critical issues to address: top-level imports of vllm_ascend and hardcoded "hccl" defaults will break compatibility for GPU users; a typo in the patch filename causes inconsistencies with the README; hardcoded CANN paths limit portability; and potential runtime errors exist due to immediate re-raising of connection errors in flush_cache and direct attribute access in model_provider.py.
| from vllm.distributed.weight_transfer.nccl_engine import NCCLTrainerSendWeightsArgs, NCCLWeightTransferEngine | ||
| from vllm_ascend.distributed.weight_transfer.hccl_engine import HCCLTrainerSendWeightsArgs, HCCLWeightTransferEngine |
There was a problem hiding this comment.
Importing vllm_ascend at the top level will cause a ModuleNotFoundError on GPU (CUDA) environments where vllm_ascend is not installed. Please remove this top-level import and import it locally inside the if is_npu(): block in update_weights_from_distributed (and other functions where needed) to ensure compatibility for both GPU and NPU users.
| from vllm.distributed.weight_transfer.nccl_engine import NCCLTrainerSendWeightsArgs, NCCLWeightTransferEngine | |
| from vllm_ascend.distributed.weight_transfer.hccl_engine import HCCLTrainerSendWeightsArgs, HCCLWeightTransferEngine | |
| from vllm.distributed.weight_transfer.nccl_engine import NCCLTrainerSendWeightsArgs, NCCLWeightTransferEngine |
| if is_npu(): | ||
| HCCLWeightTransferEngine.trainer_send_weights( | ||
| named_gpu_iter, | ||
| HCCLTrainerSendWeightsArgs(group=group, packed=packed), | ||
| ) |
There was a problem hiding this comment.
Import HCCLTrainerSendWeightsArgs and HCCLWeightTransferEngine locally here to prevent ModuleNotFoundError on GPU systems.
| if is_npu(): | |
| HCCLWeightTransferEngine.trainer_send_weights( | |
| named_gpu_iter, | |
| HCCLTrainerSendWeightsArgs(group=group, packed=packed), | |
| ) | |
| if is_npu(): | |
| from vllm_ascend.distributed.weight_transfer.hccl_engine import HCCLTrainerSendWeightsArgs, HCCLWeightTransferEngine | |
| HCCLWeightTransferEngine.trainer_send_weights( | |
| named_gpu_iter, | |
| HCCLTrainerSendWeightsArgs(group=group, packed=packed), | |
| ) |
| ) | ||
|
|
||
| reset_arg(parser, "--distributed-backend", type=str, default="nccl") | ||
| reset_arg(parser, "--distributed-backend", type=str, default="hccl") |
There was a problem hiding this comment.
Changing the default value of --distributed-backend to "hccl" will break training for GPU (CUDA) users who rely on the default "nccl" backend. We should dynamically set the default backend based on whether an NPU is being used.
| reset_arg(parser, "--distributed-backend", type=str, default="hccl") | |
| from vime.utils.common import is_npu | |
| reset_arg(parser, "--distributed-backend", type=str, default="hccl" if is_npu() else "nccl") |
|
|
||
| ```shell | ||
| cd <WORKSPACE>/Megatron-LM | ||
| git apply ../npu_patch/megatron_common.patch |
There was a problem hiding this comment.
The patch filename in the git apply command (megatron_common.patch) does not match the actual file added in this pull request (meagtron_comm.patch). Please update the command to use the correct filename, or rename the patch file to megatron_common.patch to fix the typo.
| git apply ../npu_patch/megatron_common.patch | |
| git apply ../npu_patch/meagtron_comm.patch |
| @@ -0,0 +1,781 @@ | |||
| diff --git a/megatron/core/dist_checkpointing/strategies/common.py b/megatron/core/dist_checkpointing/strategies/common.py | |||
There was a problem hiding this comment.
| raise RuntimeError( | ||
| f"GPU id {physical_gpu_id} is not valid under CUDA_VISIBLE_DEVICES={cvd}. " | ||
| f"Expected one of {visible} (physical) or 0..{len(visible)-1} (local)." | ||
| ) |
There was a problem hiding this comment.
In _to_local_gpu_id, when is_npu() is True, the environment variable used is ASCEND_RT_VISIBLE_DEVICES, but the error message still hardcodes CUDA_VISIBLE_DEVICES. Let's update the error message to dynamically reflect the correct environment variable name.
| raise RuntimeError( | |
| f"GPU id {physical_gpu_id} is not valid under CUDA_VISIBLE_DEVICES={cvd}. " | |
| f"Expected one of {visible} (physical) or 0..{len(visible)-1} (local)." | |
| ) | |
| env_var = "ASCEND_RT_VISIBLE_DEVICES" if is_npu() else "CUDA_VISIBLE_DEVICES" | |
| raise RuntimeError( | |
| f"GPU id {physical_gpu_id} is not valid under {env_var}={cvd}. " | |
| f"Expected one of {visible} (physical) or 0..{len(visible)-1} (local)." | |
| ) |
| except requests.ConnectionError: | ||
| raise | ||
| except Exception as e: | ||
| logger.info("Error resetting vLLM prefix cache: %s", e) | ||
| time.sleep(1) | ||
| continue |
There was a problem hiding this comment.
In flush_cache, catching requests.ConnectionError and immediately re-raising it will cause the training run to crash on transient connection issues, completely bypassing the 60-attempt retry loop. Removing this explicit raise allows ConnectionError to be caught by the general except Exception as e: block and retried safely.
| except requests.ConnectionError: | |
| raise | |
| except Exception as e: | |
| logger.info("Error resetting vLLM prefix cache: %s", e) | |
| time.sleep(1) | |
| continue | |
| except Exception as e: | |
| logger.info("Error resetting vLLM prefix cache: %s", e) | |
| time.sleep(1) | |
| continue |
| "ASCEND_TOOLKIT_HOME": "/usr/local/CANN9.0.0/ascend-toolkit/latest/", | ||
| "ASCEND_OPP_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/opp/", | ||
| "ASCEND_AICPU_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/", | ||
| "ASCEND_HOME_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/", |
There was a problem hiding this comment.
The CANN installation paths are hardcoded to /usr/local/CANN9.0.0/.... This makes the script non-portable for environments where CANN is installed in other standard paths (like /usr/local/Ascend/...) or custom directories. We should check the environment variables first and fall back to these paths if they are not set.
| "ASCEND_TOOLKIT_HOME": "/usr/local/CANN9.0.0/ascend-toolkit/latest/", | |
| "ASCEND_OPP_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/opp/", | |
| "ASCEND_AICPU_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/", | |
| "ASCEND_HOME_PATH": "/usr/local/CANN9.0.0/ascend-toolkit/latest/", | |
| "ASCEND_TOOLKIT_HOME": os.environ.get("ASCEND_TOOLKIT_HOME", "/usr/local/CANN9.0.0/ascend-toolkit/latest/"), | |
| "ASCEND_OPP_PATH": os.environ.get("ASCEND_OPP_PATH", "/usr/local/CANN9.0.0/ascend-toolkit/latest/opp/"), | |
| "ASCEND_AICPU_PATH": os.environ.get("ASCEND_AICPU_PATH", "/usr/local/CANN9.0.0/ascend-toolkit/latest/"), | |
| "ASCEND_HOME_PATH": os.environ.get("ASCEND_HOME_PATH", "/usr/local/CANN9.0.0/ascend-toolkit/latest/"), |
| provider.recompute_granularity = args.recompute_granularity | ||
| provider.recompute_method = args.recompute_method | ||
| provider.recompute_num_layers = args.recompute_num_layers | ||
| provider.moe_permute_fusion = args.moe_permute_fusion | ||
| provider.moe_aux_loss_coeff = args.moe_aux_loss_coeff |
There was a problem hiding this comment.
Directly accessing attributes like args.recompute_granularity can raise an AttributeError if they are not defined in the parsed arguments namespace. Using getattr with a default value of None is safer and more robust.
| provider.recompute_granularity = args.recompute_granularity | |
| provider.recompute_method = args.recompute_method | |
| provider.recompute_num_layers = args.recompute_num_layers | |
| provider.moe_permute_fusion = args.moe_permute_fusion | |
| provider.moe_aux_loss_coeff = args.moe_aux_loss_coeff | |
| provider.recompute_granularity = getattr(args, "recompute_granularity", None) | |
| provider.recompute_method = getattr(args, "recompute_method", None) | |
| provider.recompute_num_layers = getattr(args, "recompute_num_layers", None) | |
| provider.moe_permute_fusion = getattr(args, "moe_permute_fusion", None) | |
| provider.moe_aux_loss_coeff = getattr(args, "moe_aux_loss_coeff", None) |
f55df8e to
548c8d8
Compare
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>
Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>
fcd6766 to
b20d8ea
Compare
support ipcweighttransfer. Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
support qwen3-30b model.
train script: