update gkm-extract with unified (NVIDIA/AMD) support#145
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthrough
ChangesUnified GPU Runtime Image
Estimated code review effort: 3 (Moderate) | ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
Containerfile.gkm-extract (2)
95-95: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winRemove the duplicate binary copy layer.
Line 103 copies the same builder output to
/gkm-extractagain, so line 95 adds a redundant image layer with the same binary bytes.Proposed fix
-COPY --from=builder /workspace/bin/gkm-extract /gkm-extract - LABEL description="MCV Unified - includes both NVIDIA (CUDA/NVML) and AMD (ROCm) GPU support for auto-detection"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Containerfile.gkm-extract` at line 95, The Containerfile currently copies the gkm-extract binary from the builder stage twice, creating a redundant image layer with the same bytes. Remove the duplicate COPY for /gkm-extract so the final image only includes a single copy of the binary, and keep the remaining build steps in the Containerfile.gkm-extract stage consistent with the intended final artifact location.
48-50: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winSingle-source the ROCm version used by install paths and metadata.
ROCM_VERSION,OPT_ROCM_VERSION, and the hard-codedrocm-versionlabel can drift when build args are overridden, leaving broken SMI symlinks or incorrect image metadata. Use one version arg and fail fast if the expected binaries are missing.Proposed fix
ARG ROCM_VERSION=7.0.1 ARG AMDGPU_VERSION=7.0.1.70001 -ARG OPT_ROCM_VERSION=7.0.1 @@ - ln -s /opt/rocm-${OPT_ROCM_VERSION}/bin/amd-smi /usr/bin/amd-smi && \ - ln -s /opt/rocm-${OPT_ROCM_VERSION}/bin/rocm-smi /usr/bin/rocm-smi && \ + test -x /opt/rocm-${ROCM_VERSION}/bin/amd-smi && \ + test -x /opt/rocm-${ROCM_VERSION}/bin/rocm-smi && \ + ln -sf /opt/rocm-${ROCM_VERSION}/bin/amd-smi /usr/bin/amd-smi && \ + ln -sf /opt/rocm-${ROCM_VERSION}/bin/rocm-smi /usr/bin/rocm-smi && \ @@ -LABEL rocm-version="7.0.1" +LABEL rocm-version="${ROCM_VERSION}"Also applies to: 85-86, 101-101
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Containerfile.gkm-extract` around lines 48 - 50, The ROCm version is currently duplicated across ROCM_VERSION, OPT_ROCM_VERSION, and the hard-coded rocm-version label, which can drift when build args change. Update the Containerfile to single-source the version from one build arg and reuse it everywhere the install paths, symlinks, and metadata are defined. In the ROCm install/setup steps and the label assignment, reference the same version symbol consistently, and add a fail-fast check in the relevant setup stage to ensure the expected ROCm binaries are present before continuing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Containerfile.gkm-extract`:
- Around line 75-77: The ROCm installer download in the Containerfile still
points to the ubuntu/jammy channel, which can mismatch the ubuntu24.04 base
image. Update the download URL used in the AMDGPU install RUN step to the Noble
(ubuntu/noble) path so the installer matches this image and ROCm/GPU detection
can work correctly.
---
Nitpick comments:
In `@Containerfile.gkm-extract`:
- Line 95: The Containerfile currently copies the gkm-extract binary from the
builder stage twice, creating a redundant image layer with the same bytes.
Remove the duplicate COPY for /gkm-extract so the final image only includes a
single copy of the binary, and keep the remaining build steps in the
Containerfile.gkm-extract stage consistent with the intended final artifact
location.
- Around line 48-50: The ROCm version is currently duplicated across
ROCM_VERSION, OPT_ROCM_VERSION, and the hard-coded rocm-version label, which can
drift when build args change. Update the Containerfile to single-source the
version from one build arg and reuse it everywhere the install paths, symlinks,
and metadata are defined. In the ROCm install/setup steps and the label
assignment, reference the same version symbol consistently, and add a fail-fast
check in the relevant setup stage to ensure the expected ROCm binaries are
present before continuing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 20571795-f39b-4ac0-acd9-c5f3ec6d5907
📒 Files selected for processing (1)
Containerfile.gkm-extract
| # Install ROCm tools for AMD GPU detection | ||
| # Note: Using Ubuntu 22.04 (jammy) packages as ROCm doesn't officially support Ubuntu 24.04 yet | ||
| RUN wget https://repo.radeon.com/amdgpu-install/${ROCM_VERSION}/ubuntu/jammy/amdgpu-install_${AMDGPU_VERSION}-1_all.deb && \ |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate the file and inspect the relevant section with line numbers.
git ls-files 'Containerfile.gkm-extract'
echo '---'
cat -n Containerfile.gkm-extract | sed -n '65,85p'
# Show the surrounding ROCm-related build args and copy steps.
echo '---'
cat -n Containerfile.gkm-extract | sed -n '40,110p'Repository: redhat-et/GKM
Length of output: 4047
🌐 Web query:
ROCm 7.0.1 Ubuntu 24.04 quick start amdgpu-install ubuntu noble path
💡 Result:
To install ROCm 7.0.1 on Ubuntu 24.04 (Noble Numbat) using the amdgpu-install package, follow these official steps derived from the ROCm 7.0.1 documentation [1]. 1. Register the AMDGPU repository Download and install the amdgpu-install package for Ubuntu 24.04: wget https://repo.radeon.com/amdgpu-install/7.0.1/ubuntu/noble/amdgpu-install_7.0.1.70001-1_all.deb sudo apt install./amdgpu-install_7.0.1.70001-1_all.deb sudo apt update 2. Configure Repository and Install After registering the repository, you can proceed with the installation. Note that for ROCm 7.0.1, the repository is located at the path associated with 'noble' (the codename for Ubuntu 24.04) [1][2]. For a complete installation including the necessary kernel drivers and ROCm stack, refer to the official ROCm 7.0.1 Quick Start Guide [1] or the Ubuntu-specific native installation guide [3]. Important installation notes: - System Requirements: Ensure your system meets the documented ROCm hardware and kernel requirements before beginning [1]. - GPG Keys: If performing a manual repository setup (native method), ensure you correctly add the ROCm GPG key to /etc/apt/keyrings/rocm.gpg to verify package authenticity [3]. - Conflicts: If you are upgrading or switching versions, remove previous ROCm installations to avoid dependency conflicts [4]. - Permissions: After installation, remember to add your user to the render and video groups to access GPU resources: sudo usermod -a -G render,video $LOGNAME [5].
Citations:
- 1: https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.0.1/install/quick-start.html
- 2: http://repo.radeon.com/amdgpu-install/7.0.1/ubuntu/noble/
- 3: https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.0.1/install/install-methods/package-manager/package-manager-ubuntu.html
- 4: https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.0.1/install/install-methods/multi-version-install/multi-version-install-ubuntu.html
- 5: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html
Use the Noble AMDGPU installer path for this Ubuntu 24.04 image. The current ubuntu/jammy URL can pull the wrong package channel for the ubuntu24.04 base and break the ROCm install or GPU detection.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Containerfile.gkm-extract` around lines 75 - 77, The ROCm installer download
in the Containerfile still points to the ubuntu/jammy channel, which can
mismatch the ubuntu24.04 base image. Update the download URL used in the AMDGPU
install RUN step to the Noble (ubuntu/noble) path so the installer matches this
image and ROCm/GPU detection can work correctly.
| libbtrfs0 \ | ||
| libffi8 \ | ||
| libc6 \ | ||
| buildah \ |
There was a problem hiding this comment.
@maryamtahhan I took this list straight from #138. Do we need buildah in the running stage? Are there others that maybe were needed in #138 but aren't really needed in the extraction container?
cmagina
left a comment
There was a problem hiding this comment.
Other than my 2 small comments, lgtm
| amd-smi-lib \ | ||
| rocm-smi-lib \ | ||
| libdrm2 && \ | ||
| ln -s /opt/rocm-${OPT_ROCM_VERSION}/bin/amd-smi /usr/bin/amd-smi && \ |
There was a problem hiding this comment.
Should use /usr/local for this to avoid package clashes.
| rocm-smi-lib \ | ||
| libdrm2 && \ | ||
| ln -s /opt/rocm-${OPT_ROCM_VERSION}/bin/amd-smi /usr/bin/amd-smi && \ | ||
| ln -s /opt/rocm-${OPT_ROCM_VERSION}/bin/rocm-smi /usr/bin/rocm-smi && \ |
039aac0 to
6271b63
Compare
Follow PR#138 by adding NVIDIA support to the gkm-extract image, AMD support already there. Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Follow #138 by adding NVIDIA support to the gkm-extract image, AMD support already there.
Summary by CodeRabbit