Skip to content

Add nvidia-container-toolkit install and CDI generation#22

Open
bogdando wants to merge 1 commit into
rhos-vaf:mainfrom
bogdando:dev
Open

Add nvidia-container-toolkit install and CDI generation#22
bogdando wants to merge 1 commit into
rhos-vaf:mainfrom
bogdando:dev

Conversation

@bogdando

Copy link
Copy Markdown
Contributor

In a L4 GPU passtrhough setup (vllm-cuda-rhel9:3.3.3, vLLM v0.13.0, supports CUDA 12.8 / GPU driver 570+) gpu validation could not pass unless CDI was generated (even though there is no MIG mode to apply).

The role vars accept nvidia_container_toolkit_install but never acts on it unless that is a MIG-enabled setup.

Make CUDA toolkit installation and config/generation a dedicated task gated on the new default variable.

This fixes the "unresolvable CDI devices nvidia.com/gpu=all" error when running vLLM containers on CentOS Stream 9 VMs with passthrough GPUs.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 4678c980-0900-4392-8015-ce95921f30b0

📥 Commits

Reviewing files that changed from the base of the PR and between f667a01 and a5e5d64.

📒 Files selected for processing (4)
  • gpu-validation/defaults/main.yaml
  • gpu-validation/tasks/main.yaml
  • gpu-validation/tasks/nvidia-container-toolkit.yaml
  • gpu-validation/tasks/nvidia-cuda-repos.yaml
💤 Files with no reviewable changes (1)
  • gpu-validation/tasks/nvidia-cuda-repos.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • gpu-validation/tasks/main.yaml
  • gpu-validation/defaults/main.yaml

📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Added optional NVIDIA container toolkit installation for GPU container support.
    • Added automated generation of CDI specifications to enable GPU access in container runtimes.
    • Introduced configuration parameters to control whether the toolkit is installed and which repo is used.
  • Refactor / Workflow
    • Updated the CUDA repository setup to perform an immediate reboot to pick up newly installed drivers.
    • Moved container toolkit/CDI setup into its own conditional step.

Walkthrough

Adds two default variables (nvidia_container_toolkit_install, nvidia_container_toolkit_repo_url) to gate optional toolkit setup, extracts the NVIDIA Container Toolkit repo fetch, package install, and idempotent CDI generation into a new nvidia-container-toolkit.yaml task file, wires it into main.yaml, and removes the previously inlined equivalent steps from nvidia-cuda-repos.yaml.

Changes

NVIDIA Container Toolkit extraction and CDI configuration

Layer / File(s) Summary
Default variables and main.yaml wiring
gpu-validation/defaults/main.yaml, gpu-validation/tasks/main.yaml
Adds nvidia_container_toolkit_install (bool, default false) and nvidia_container_toolkit_repo_url (string) to defaults. Adds a conditional import of nvidia-container-toolkit.yaml in main.yaml gated on nvidia_container_toolkit_install.
nvidia-container-toolkit.yaml: repo setup, install, and CDI generation
gpu-validation/tasks/nvidia-container-toolkit.yaml
New task file: conditionally fetches and registers the NVIDIA toolkit yum repo when nvidia_container_toolkit_repo_url is set, installs nvidia-container-toolkit via dnf, stats /etc/cdi/nvidia.yaml, and if absent creates /etc/cdi, runs nvidia-ctk runtime configure for containerd, and generates CDI YAML via nvidia-ctk cdi generate.
nvidia-cuda-repos.yaml: remove inlined toolkit steps and reboot
gpu-validation/tasks/nvidia-cuda-repos.yaml
Removes ~32 lines of inlined toolkit install, CDI stat, and CDI generation steps; replaces them with a standalone VM reboot task (600s timeout) to be executed earlier in the flow.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding nvidia-container-toolkit installation and CDI generation support.
Description check ✅ Passed The description provides detailed context about the GPU validation issue, the problem with the previous implementation, and how this PR solves it.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@bogdando

Copy link
Copy Markdown
Contributor Author

I am re-verifying this for L4 (non-MIG) on my downstream testproject again, after the recent MIG changes landed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@gpu-validation/tasks/nvidia-container-toolkit.yaml`:
- Around line 34-36: The task "Configure NVIDIA container runtime"
unconditionally runs the nvidia-ctk command for containerd, which will fail on
systems where containerd is not installed or in use, particularly in this
Podman-focused role. Add a conditional to the task using the when clause that
checks whether containerd is actually available or in use before executing the
ansible.builtin.command step. This ensures the configuration only runs on
systems where containerd is relevant, allowing the role to work correctly on
Podman-only systems.

In `@gpu-validation/tasks/nvidia-cuda-repos.yaml`:
- Around line 16-17: The unconditional import of nvidia-container-toolkit.yaml
at line 16-17 bypasses the nvidia_container_toolkit_install gate, causing the
toolkit installation to run even when the flag is false and potentially creating
duplicate executions. Remove the import statement for
nvidia-container-toolkit.yaml from this location entirely and rely on the single
gated import path in gpu-validation/tasks/main.yaml to control toolkit
installation based on the nvidia_container_toolkit_install flag.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: b211d255-db78-425c-bd39-7df856e8ea3e

📥 Commits

Reviewing files that changed from the base of the PR and between 565225d and f667a01.

📒 Files selected for processing (4)
  • gpu-validation/defaults/main.yaml
  • gpu-validation/tasks/main.yaml
  • gpu-validation/tasks/nvidia-container-toolkit.yaml
  • gpu-validation/tasks/nvidia-cuda-repos.yaml

Comment thread gpu-validation/tasks/nvidia-container-toolkit.yaml
Comment thread gpu-validation/tasks/nvidia-cuda-repos.yaml Outdated
In a L4 GPU passtrhough setup (vllm-cuda-rhel9:3.3.3, vLLM v0.13.0,
supports CUDA 12.8 / GPU driver 570+) gpu validation could not pass
unless CDI was generated (even though there is no MIG mode to apply).

The role vars accept nvidia_container_toolkit_install but never acts
on it unless that is a MIG-enabled setup.

Make CUDA toolkit installation and config/generation a dedicated task
gated on the new default variable.

This fixes the "unresolvable CDI devices nvidia.com/gpu=all" error
when running vLLM containers on CentOS Stream 9 VMs with
passthrough GPUs.

Signed-off-by: Bohdan Dobrelia <bdobreli@redhat.com>
@bogdando

Copy link
Copy Markdown
Contributor Author

I tested it for L4 non-mig case. Would be nice to re-ferify this for a MIG setup as well to catch possible regressions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant