fix(npu): rewrite Dockerfile.npu and fix corrupt patches for verl base image#230
fix(npu): rewrite Dockerfile.npu and fix corrupt patches for verl base image#230CalvinXKY wants to merge 5 commits into
Conversation
* update docker patch. * fix mindspeed.patch try-except formatting per review Replace malformed features_manager hunks with proper try/except/pass blocks. * add torch_npu.patch for NPU Docker build Wrap eager_connect_single_device in try/except to avoid RuntimeError on A3.
…e image The original Dockerfile.npu failed to build due to: - git clone on pre-existing repos in verl base image - No network access during docker build - torch_memory_saver requires CUDA (unavailable on NPU) - Corrupt vllm-ascend.patch and mindspeed.patch (missing trailing context) - 3 megatron patches fail on verl base image (shallow clone + pre-existing modifications) Changes: - Rewrite Dockerfile.npu to work with verl base image (no git clone, no network) - Replace 3 megatron patches with single megatron-all-changes.patch from working container - Fix vllm-ascend.patch: add missing trailing context, fix import line position - Fix mindspeed.patch: add missing trailing context, fix fake index hashes, consolidate features_manager changes - Make torch_npu.patch optional (version mismatch) - Add git safe.directory config for vime - Use --no-deps for Megatron-LM pip install (base image has deps)
There was a problem hiding this comment.
Code Review
This pull request simplifies the NPU Dockerfile by consolidating patches, copying local files directly, and adjusting dependency installations. However, several critical issues were identified in the review: using || true or || echo when applying patches in the Dockerfile can silently ignore severe errors (such as missing files or permission issues) instead of only skipping on expected version mismatches or conflicts. Additionally, catching AttributeError in the mindspeed patch to handle missing methods can mask actual AttributeErrors raised inside those validation methods; using hasattr is recommended instead.
| echo "Patch failed to apply cleanly. Please resolve conflicts." && \ | ||
| exit 1; \ | ||
| git update-index --refresh || true && \ | ||
| git apply --3way /tmp/megatron-all-changes.patch || true && \ |
There was a problem hiding this comment.
Using || true with git apply will silently ignore severe errors (such as a missing patch file, permission issues, or a completely corrupt patch format) because grep will find no conflict markers and the build will succeed with an unpatched repository. We should only ignore the exit code if it is 1 (which indicates clean/unclean patch application with conflicts/3-way fallback), and fail on any other exit code.
git apply --3way /tmp/megatron-all-changes.patch || [ $? -eq 1 ] && \
| exit 1; \ | ||
| fi && \ | ||
| rm mindspeed.patch | ||
| git apply --3way /tmp/mindspeed.patch || true && \ |
There was a problem hiding this comment.
Using || true with git apply will silently ignore severe errors (such as a missing patch file or corrupt format). We should only ignore the exit code if it is 1 (which indicates patch application with conflicts/3-way fallback), and fail on any other exit code.
git apply --3way /tmp/mindspeed.patch || [ $? -eq 1 ] && \
| cd ${TORCH_NPU_DIR} && \ | ||
| patch -p2 < /tmp/torch_npu.patch && \ | ||
| rm /tmp/torch_npu.patch | ||
| patch -p2 < /tmp/torch_npu.patch || echo "torch_npu patch skipped (version mismatch)" |
There was a problem hiding this comment.
Using || echo ... silently ignores all severe errors from the patch command (such as patch command not found, permission denied, or missing patch file). According to POSIX standards, patch exits with 1 if hunks failed (e.g., due to version mismatch or already applied), and >1 for severe errors. We should only skip the patch if the exit code is 1, and fail on any other non-zero exit code.
patch -p2 < /tmp/torch_npu.patch; \
ret=$?;
if [ $ret -eq 1 ]; then \
echo "torch_npu patch skipped (version mismatch)"; \
elif [ $ret -ne 0 ]; then \
echo "Failed to apply torch_npu patch with error code $ret" && exit $ret; \
fi
| + try: | ||
| + feature.pre_validate_args(args) | ||
| + except AttributeError: | ||
| + pass |
There was a problem hiding this comment.
Catching AttributeError silently can mask actual AttributeErrors raised inside the pre_validate_args method itself (e.g., if there is a typo or a missing attribute access inside the feature's validation logic). It is much safer to use hasattr to check if the method exists before calling it.
+ if hasattr(feature, 'pre_validate_args'):
+ feature.pre_validate_args(args)
| + try: | ||
| + feature.post_validate_args(args) | ||
| + except AttributeError: | ||
| + pass |
There was a problem hiding this comment.
| + try: | ||
| + feature.validate_args(args) | ||
| + except AttributeError: | ||
| + pass |
There was a problem hiding this comment.
Remove 1706 irrelevant files (docs, examples, .github, workflows) and pure mode-only changes (old mode/new mode) that were from verl base image's Megatron-LM version difference, not vime modifications. The slim patch contains only 69 files with actual code changes: - cuda->npu adaptations (torch.cuda -> torch.npu) - verl-specific modifications to Megatron-LM - No megatron/bridge/ changes (already in verl base image)
…ibility git apply fails silently on verl base image's shallow clones with 'does not match index' for all files, causing patches to not take effect. patch -p1 works reliably regardless of git index state.
- Rewrite Dockerfile.npu for verl base image (no git clone, COPY instead) - Use patch -p1 instead of git apply for shallow clone compatibility - Add megatron-all-changes.patch (slim 2203 lines, replaces 3 separate patches) - Fix corrupt vllm-ascend.patch and mindspeed.patch - Adapt convert_hf_to_torch_dist.py with is_npu() conditional and mindspeed import
…ch robustness flags - convert_hf_to_torch_dist.py: add back is_npu() + mindspeed.megatron_adaptor (lost during PR #230 merge), use conditional cuda/npu paths - Dockerfile.npu: add -f --no-backup-if-mismatch and || true to patch commands to handle already-modified files in base image
- Rewrite Dockerfile.npu for verl base image (no git clone, COPY instead) - Use patch -p1 instead of git apply for shallow clone compatibility - Add megatron-all-changes.patch (slim 2203 lines, replaces 3 separate patches) - Fix corrupt vllm-ascend.patch and mindspeed.patch - Adapt convert_hf_to_torch_dist.py with is_npu() conditional and mindspeed import
…ch robustness flags - convert_hf_to_torch_dist.py: add back is_npu() + mindspeed.megatron_adaptor (lost during PR #230 merge), use conditional cuda/npu paths - Dockerfile.npu: add -f --no-backup-if-mismatch and || true to patch commands to handle already-modified files in base image
Summary
Dockerfile.nputo work with the verl base image (nogit clone, no network access during build)vllm-ascend.patchandmindspeed.patch(missing trailing context, fake index hashes)megatron-all-changes.patchthat applies cleanly on the verl base imagetorch_npu.patchoptional (version mismatch with base image)Problem
The original
Dockerfile.npufailed to build with 6 distinct issues:git cloneon directories already pre-existing in the verl base imagedocker build(proxy resolution fails)torch_memory_saverrequires CUDA C++ extensions (unavailable on NPU)vllm-ascend.patchcorrupt at line 78 (missing trailing context)mindspeed.patchcorrupt at line 107 (missing trailing context + fake index hash1234567..abcdef0)--3way, plus verl already modified files)Changes
docker/Dockerfile.npugit cloneandpip install git+steps (base image has repos pre-installed)torch_memory_saverinstall (vime handles missing import gracefully)COPYinstead ofgit clonefor vime sourcemegatron-all-changes.patch|| trueforgit apply --3way(returns non-zero on 3way fallback) with conflict marker checkgit config --global --add safe.directoryfor vimetorch_npu.patchoptional with fallbackdocker/patch/latest/megatron-all-changes.patch(NEW)megatron.patch,megatron-npu.patch,megatron-bridge.patchgit diff HEADin Megatron-LMdocker/patch/latest/vllm-ascend.patch(FIXED)error: corrupt patch at line 78)docker/patch/latest/mindspeed.patch(FIXED)error: corrupt patch at line 107)1234567..abcdef0→ proper hashes)features_manager.pychanges into single diff hunkTesting
Built and verified
vime-npu:testimage (26.2GB) on 400t-server (aarch64, Ascend 910B1):import vimesucceedsRelated: #157