Add nvidia-container-toolkit install and CDI generation#22
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
💤 Files with no reviewable changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds two default variables ( ChangesNVIDIA Container Toolkit extraction and CDI configuration
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
I am re-verifying this for L4 (non-MIG) on my downstream testproject again, after the recent MIG changes landed |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@gpu-validation/tasks/nvidia-container-toolkit.yaml`:
- Around line 34-36: The task "Configure NVIDIA container runtime"
unconditionally runs the nvidia-ctk command for containerd, which will fail on
systems where containerd is not installed or in use, particularly in this
Podman-focused role. Add a conditional to the task using the when clause that
checks whether containerd is actually available or in use before executing the
ansible.builtin.command step. This ensures the configuration only runs on
systems where containerd is relevant, allowing the role to work correctly on
Podman-only systems.
In `@gpu-validation/tasks/nvidia-cuda-repos.yaml`:
- Around line 16-17: The unconditional import of nvidia-container-toolkit.yaml
at line 16-17 bypasses the nvidia_container_toolkit_install gate, causing the
toolkit installation to run even when the flag is false and potentially creating
duplicate executions. Remove the import statement for
nvidia-container-toolkit.yaml from this location entirely and rely on the single
gated import path in gpu-validation/tasks/main.yaml to control toolkit
installation based on the nvidia_container_toolkit_install flag.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: b211d255-db78-425c-bd39-7df856e8ea3e
📒 Files selected for processing (4)
gpu-validation/defaults/main.yamlgpu-validation/tasks/main.yamlgpu-validation/tasks/nvidia-container-toolkit.yamlgpu-validation/tasks/nvidia-cuda-repos.yaml
In a L4 GPU passtrhough setup (vllm-cuda-rhel9:3.3.3, vLLM v0.13.0, supports CUDA 12.8 / GPU driver 570+) gpu validation could not pass unless CDI was generated (even though there is no MIG mode to apply). The role vars accept nvidia_container_toolkit_install but never acts on it unless that is a MIG-enabled setup. Make CUDA toolkit installation and config/generation a dedicated task gated on the new default variable. This fixes the "unresolvable CDI devices nvidia.com/gpu=all" error when running vLLM containers on CentOS Stream 9 VMs with passthrough GPUs. Signed-off-by: Bohdan Dobrelia <bdobreli@redhat.com>
|
I tested it for L4 non-mig case. Would be nice to re-ferify this for a MIG setup as well to catch possible regressions |
In a L4 GPU passtrhough setup (vllm-cuda-rhel9:3.3.3, vLLM v0.13.0, supports CUDA 12.8 / GPU driver 570+) gpu validation could not pass unless CDI was generated (even though there is no MIG mode to apply).
The role vars accept nvidia_container_toolkit_install but never acts on it unless that is a MIG-enabled setup.
Make CUDA toolkit installation and config/generation a dedicated task gated on the new default variable.
This fixes the "unresolvable CDI devices nvidia.com/gpu=all" error when running vLLM containers on CentOS Stream 9 VMs with passthrough GPUs.