feat(npu): Enhance Dockerfile.npu with verified vllm commit and run script#287
feat(npu): Enhance Dockerfile.npu with verified vllm commit and run script#287miracle0517 wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new script run-qwen3-4B-npu-ray.sh to run Qwen3-4B training on NPU using Ray, and updates Dockerfile.npu to refine the installation of vLLM and its Ascend backend. Feedback on the changes includes a critical fix for a missing file operand in a sed command in the Dockerfile, and a security concern regarding globally disabling SSL verification. For the new script, recommendations include using pkill for cleaner process termination, avoiding hardcoded script directories to maintain portability, and adding the --wait flag to the Ray job submission to stream logs synchronously.
| # System and pip dependencies. | ||
| RUN sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' \ | ||
| /etc/apt/sources.list && \ | ||
| RUN sed -i '/ports\.ubuntu\.com/ {h;s|ports\.ubuntu\.com|mirrors.tuna.tsinghua.edu.cn|g;G}' && \ |
There was a problem hiding this comment.
| # cleanup | ||
| pkill -9 -f "vllm serve" 2>/dev/null || true | ||
| sleep 2 | ||
| npu-smi info 2>/dev/null | grep rayWorker | awk '{print $4}' | xargs -r kill -9 2>/dev/null || true |
There was a problem hiding this comment.
The awk '{print $4}' command extracts the process name (or type) instead of the PID from the npu-smi info output, which will cause the subsequent kill -9 command to fail since it expects a numeric PID. Additionally, npu-smi info output format can vary. A more robust and standard way to clean up the Ray worker processes is to use pkill -9 -f rayWorker.
| npu-smi info 2>/dev/null | grep rayWorker | awk '{print $4}' | xargs -r kill -9 2>/dev/null || true | |
| pkill -9 -f rayWorker 2>/dev/null || true |
| # builds use the same files as the build context (not the later git-cloned vime tree). | ||
| COPY docker/npu_patch /tmp/npu_patch | ||
|
|
||
| RUN git config --global http.sslVerify false |
There was a problem hiding this comment.
Disabling SSL verification globally (git config --global http.sslVerify false) is a security risk as it exposes the build process to Man-in-the-Middle (MITM) attacks when cloning repositories. Since all the cloned repositories are from public hosts like GitHub, SSL verification should remain enabled. If a specific proxy or environment requires it, consider configuring the specific CA certificate instead.
| # SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)" | ||
| SCRIPT_DIR="/root/vime/scripts/" |
There was a problem hiding this comment.
Hardcoding SCRIPT_DIR to /root/vime/scripts/ reduces the portability of the script. It is better to use the dynamic directory resolution logic (which is currently commented out) so that the script works regardless of where the repository is cloned.
| # SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)" | |
| SCRIPT_DIR="/root/vime/scripts/" | |
| SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)" |
| ray job submit --address="http://127.0.0.1:${RAY_DASHBOARD_PORT}" \ | ||
| --runtime-env-json="${RUNTIME_ENV_JSON}" \ | ||
| --working-dir="/root/vime" \ | ||
| -- python3 train.py \ |
There was a problem hiding this comment.
By default, ray job submit submits the job asynchronously and returns immediately. To ensure the script blocks until the training job completes and streams the logs to stdout, you should pass the --wait flag.
| ray job submit --address="http://127.0.0.1:${RAY_DASHBOARD_PORT}" \ | |
| --runtime-env-json="${RUNTIME_ENV_JSON}" \ | |
| --working-dir="/root/vime" \ | |
| -- python3 train.py \ | |
| ray job submit --address="http://127.0.0.1:${RAY_DASHBOARD_PORT}" \ | |
| --runtime-env-json="${RUNTIME_ENV_JSON}" \ | |
| --working-dir="/root/vime" \ | |
| --wait \ | |
| -- python3 train.py \ |
Documentation build overview
28 files changed ·
|
3d3bd78 to
515acf1
Compare
|
Conflicts need fixing; DCO error — commit with sign-off. |
515acf1 to
5d22d7f
Compare
| @@ -0,0 +1,165 @@ | |||
| #!/bin/bash | |||
|
List verification tests performed in the description. |
…cript Signed-off-by: wuxiang <498160096@qq.com>
e905637 to
25a1cb1
Compare
|
LGTM |
Signed-off-by: wuxiang <498160096@qq.com>
25a1cb1 to
315794f
Compare

This improves the build robustness on NPU platforms and provides a unified entry point for job scheduling via Ray.
Verification with qwen3-4b image on A3 machine:
