Skip to content

feat(npu): Enhance Dockerfile.npu with verified vllm commit and run script#287

Open
miracle0517 wants to merge 2 commits into
vllm-project:ascendfrom
miracle0517:feat/ascend-dockerfile-verified-vllm
Open

feat(npu): Enhance Dockerfile.npu with verified vllm commit and run script#287
miracle0517 wants to merge 2 commits into
vllm-project:ascendfrom
miracle0517:feat/ascend-dockerfile-verified-vllm

Conversation

@miracle0517

@miracle0517 miracle0517 commented Jun 23, 2026

Copy link
Copy Markdown
  • Add compatibility handling to Dockerfile.npu for broader environment support.
  • Add a training script that uses Ray job submit to launch distributed training jobs.

This improves the build robustness on NPU platforms and provides a unified entry point for job scheduling via Ray.

Verification with qwen3-4b image on A3 machine:
image

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new script run-qwen3-4B-npu-ray.sh to run Qwen3-4B training on NPU using Ray, and updates Dockerfile.npu to refine the installation of vLLM and its Ascend backend. Feedback on the changes includes a critical fix for a missing file operand in a sed command in the Dockerfile, and a security concern regarding globally disabling SSL verification. For the new script, recommendations include using pkill for cleaner process termination, avoiding hardcoded script directories to maintain portability, and adding the --wait flag to the Ray job submission to stream logs synchronously.

Comment thread docker/Dockerfile.npu Outdated
# System and pip dependencies.
RUN sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' \
/etc/apt/sources.list && \
RUN sed -i '/ports\.ubuntu\.com/ {h;s|ports\.ubuntu\.com|mirrors.tuna.tsinghua.edu.cn|g;G}' && \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The sed command is missing the target file operand /etc/apt/sources.list. This will cause the Docker build to fail with a sed: no input files error.

RUN sed -i '/ports\.ubuntu\.com/ {h;s|ports\.ubuntu\.com|mirrors.tuna.tsinghua.edu.cn|g;G}' /etc/apt/sources.list && \

# cleanup
pkill -9 -f "vllm serve" 2>/dev/null || true
sleep 2
npu-smi info 2>/dev/null | grep rayWorker | awk '{print $4}' | xargs -r kill -9 2>/dev/null || true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The awk '{print $4}' command extracts the process name (or type) instead of the PID from the npu-smi info output, which will cause the subsequent kill -9 command to fail since it expects a numeric PID. Additionally, npu-smi info output format can vary. A more robust and standard way to clean up the Ray worker processes is to use pkill -9 -f rayWorker.

Suggested change
npu-smi info 2>/dev/null | grep rayWorker | awk '{print $4}' | xargs -r kill -9 2>/dev/null || true
pkill -9 -f rayWorker 2>/dev/null || true

Comment thread docker/Dockerfile.npu
# builds use the same files as the build context (not the later git-cloned vime tree).
COPY docker/npu_patch /tmp/npu_patch

RUN git config --global http.sslVerify false

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Disabling SSL verification globally (git config --global http.sslVerify false) is a security risk as it exposes the build process to Man-in-the-Middle (MITM) attacks when cloning repositories. Since all the cloned repositories are from public hosts like GitHub, SSL verification should remain enabled. If a specific proxy or environment requires it, consider configuring the specific CA certificate instead.

Comment thread scripts/run-qwen3-4B-npu-ray.sh Outdated
Comment on lines +44 to +45
# SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
SCRIPT_DIR="/root/vime/scripts/"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding SCRIPT_DIR to /root/vime/scripts/ reduces the portability of the script. It is better to use the dynamic directory resolution logic (which is currently commented out) so that the script works regardless of where the repository is cloned.

Suggested change
# SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
SCRIPT_DIR="/root/vime/scripts/"
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"

Comment on lines +147 to +150
ray job submit --address="http://127.0.0.1:${RAY_DASHBOARD_PORT}" \
--runtime-env-json="${RUNTIME_ENV_JSON}" \
--working-dir="/root/vime" \
-- python3 train.py \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

By default, ray job submit submits the job asynchronously and returns immediately. To ensure the script blocks until the training job completes and streams the logs to stdout, you should pass the --wait flag.

Suggested change
ray job submit --address="http://127.0.0.1:${RAY_DASHBOARD_PORT}" \
--runtime-env-json="${RUNTIME_ENV_JSON}" \
--working-dir="/root/vime" \
-- python3 train.py \
ray job submit --address="http://127.0.0.1:${RAY_DASHBOARD_PORT}" \
--runtime-env-json="${RUNTIME_ENV_JSON}" \
--working-dir="/root/vime" \
--wait \
-- python3 train.py \

@read-the-docs-community

read-the-docs-community Bot commented Jun 23, 2026

Copy link
Copy Markdown

@miracle0517 miracle0517 force-pushed the feat/ascend-dockerfile-verified-vllm branch from 3d3bd78 to 515acf1 Compare June 23, 2026 08:38
@CalvinXKY

Copy link
Copy Markdown
Collaborator

Conflicts need fixing; DCO error — commit with sign-off.

@miracle0517 miracle0517 force-pushed the feat/ascend-dockerfile-verified-vllm branch from 515acf1 to 5d22d7f Compare June 23, 2026 08:49
@@ -0,0 +1,165 @@
#!/bin/bash

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To limit the number of SHs, we need to modify the existing script based on the GPU example and use Ray for submission.
TODO left: keep only one SH per model type.

Image

@CalvinXKY CalvinXKY requested a review from Meihan-chen June 23, 2026 08:55
@CalvinXKY

Copy link
Copy Markdown
Collaborator

List verification tests performed in the description.

Comment thread docker/Dockerfile.npu
Comment thread docker/Dockerfile.npu
…cript

Signed-off-by: wuxiang <498160096@qq.com>
@miracle0517 miracle0517 force-pushed the feat/ascend-dockerfile-verified-vllm branch from e905637 to 25a1cb1 Compare June 24, 2026 02:12
@Meihan-chen

Copy link
Copy Markdown
Collaborator

LGTM

@miracle0517 miracle0517 force-pushed the feat/ascend-dockerfile-verified-vllm branch from 25a1cb1 to 315794f Compare June 24, 2026 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants