Add nvidia-container-toolkit install and CDI generation by bogdando · Pull Request #22 · rhos-vaf/gpu-validation

bogdando · 2026-06-15T08:12:30Z

In a L4 GPU passtrhough setup (vllm-cuda-rhel9:3.3.3, vLLM v0.13.0, supports CUDA 12.8 / GPU driver 570+) gpu validation could not pass unless CDI was generated (even though there is no MIG mode to apply).

The role vars accept nvidia_container_toolkit_install but never acts on it unless that is a MIG-enabled setup.

Make CUDA toolkit installation and config/generation a dedicated task gated on the new default variable.

This fixes the "unresolvable CDI devices nvidia.com/gpu=all" error when running vLLM containers on CentOS Stream 9 VMs with passthrough GPUs.

coderabbitai · 2026-06-15T08:12:41Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 4678c980-0900-4392-8015-ce95921f30b0

📥 Commits

Reviewing files that changed from the base of the PR and between f667a01 and a5e5d64.

📒 Files selected for processing (4)

gpu-validation/defaults/main.yaml
gpu-validation/tasks/main.yaml
gpu-validation/tasks/nvidia-container-toolkit.yaml
gpu-validation/tasks/nvidia-cuda-repos.yaml

💤 Files with no reviewable changes (1)

gpu-validation/tasks/nvidia-cuda-repos.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

gpu-validation/tasks/main.yaml
gpu-validation/defaults/main.yaml

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added optional NVIDIA container toolkit installation for GPU container support.
- Added automated generation of CDI specifications to enable GPU access in container runtimes.
- Introduced configuration parameters to control whether the toolkit is installed and which repo is used.
Refactor / Workflow
- Updated the CUDA repository setup to perform an immediate reboot to pick up newly installed drivers.
- Moved container toolkit/CDI setup into its own conditional step.

Walkthrough

Adds two default variables (nvidia_container_toolkit_install, nvidia_container_toolkit_repo_url) to gate optional toolkit setup, extracts the NVIDIA Container Toolkit repo fetch, package install, and idempotent CDI generation into a new nvidia-container-toolkit.yaml task file, wires it into main.yaml, and removes the previously inlined equivalent steps from nvidia-cuda-repos.yaml.

Changes

NVIDIA Container Toolkit extraction and CDI configuration

Layer / File(s)	Summary
Default variables and main.yaml wiring `gpu-validation/defaults/main.yaml`, `gpu-validation/tasks/main.yaml`	Adds `nvidia_container_toolkit_install` (bool, default `false`) and `nvidia_container_toolkit_repo_url` (string) to defaults. Adds a conditional import of `nvidia-container-toolkit.yaml` in `main.yaml` gated on `nvidia_container_toolkit_install`.
nvidia-container-toolkit.yaml: repo setup, install, and CDI generation `gpu-validation/tasks/nvidia-container-toolkit.yaml`	New task file: conditionally fetches and registers the NVIDIA toolkit yum repo when `nvidia_container_toolkit_repo_url` is set, installs `nvidia-container-toolkit` via `dnf`, stats `/etc/cdi/nvidia.yaml`, and if absent creates `/etc/cdi`, runs `nvidia-ctk runtime configure` for containerd, and generates CDI YAML via `nvidia-ctk cdi generate`.
nvidia-cuda-repos.yaml: remove inlined toolkit steps and reboot `gpu-validation/tasks/nvidia-cuda-repos.yaml`	Removes ~32 lines of inlined toolkit install, CDI stat, and CDI generation steps; replaces them with a standalone VM reboot task (600s timeout) to be executed earlier in the flow.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding nvidia-container-toolkit installation and CDI generation support.
Description check	✅ Passed	The description provides detailed context about the GPU validation issue, the problem with the previous implementation, and how this PR solves it.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

bogdando · 2026-06-15T08:13:21Z

I am re-verifying this for L4 (non-MIG) on my downstream testproject again, after the recent MIG changes landed

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@gpu-validation/tasks/nvidia-container-toolkit.yaml`:
- Around line 34-36: The task "Configure NVIDIA container runtime"
unconditionally runs the nvidia-ctk command for containerd, which will fail on
systems where containerd is not installed or in use, particularly in this
Podman-focused role. Add a conditional to the task using the when clause that
checks whether containerd is actually available or in use before executing the
ansible.builtin.command step. This ensures the configuration only runs on
systems where containerd is relevant, allowing the role to work correctly on
Podman-only systems.

In `@gpu-validation/tasks/nvidia-cuda-repos.yaml`:
- Around line 16-17: The unconditional import of nvidia-container-toolkit.yaml
at line 16-17 bypasses the nvidia_container_toolkit_install gate, causing the
toolkit installation to run even when the flag is false and potentially creating
duplicate executions. Remove the import statement for
nvidia-container-toolkit.yaml from this location entirely and rely on the single
gated import path in gpu-validation/tasks/main.yaml to control toolkit
installation based on the nvidia_container_toolkit_install flag.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: b211d255-db78-425c-bd39-7df856e8ea3e

📥 Commits

Reviewing files that changed from the base of the PR and between 565225d and f667a01.

📒 Files selected for processing (4)

gpu-validation/defaults/main.yaml
gpu-validation/tasks/main.yaml
gpu-validation/tasks/nvidia-container-toolkit.yaml
gpu-validation/tasks/nvidia-cuda-repos.yaml

In a L4 GPU passtrhough setup (vllm-cuda-rhel9:3.3.3, vLLM v0.13.0, supports CUDA 12.8 / GPU driver 570+) gpu validation could not pass unless CDI was generated (even though there is no MIG mode to apply). The role vars accept nvidia_container_toolkit_install but never acts on it unless that is a MIG-enabled setup. Make CUDA toolkit installation and config/generation a dedicated task gated on the new default variable. This fixes the "unresolvable CDI devices nvidia.com/gpu=all" error when running vLLM containers on CentOS Stream 9 VMs with passthrough GPUs. Signed-off-by: Bohdan Dobrelia <bdobreli@redhat.com>

bogdando · 2026-06-17T09:25:57Z

I tested it for L4 non-mig case. Would be nice to re-ferify this for a MIG setup as well to catch possible regressions

bogdando added the do-not-merge-work-in-progress label Jun 15, 2026

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread gpu-validation/tasks/nvidia-container-toolkit.yaml

Comment thread gpu-validation/tasks/nvidia-cuda-repos.yaml Outdated

bogdando force-pushed the dev branch from f667a01 to a5e5d64 Compare June 15, 2026 08:32

bogdando removed the do-not-merge-work-in-progress label Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add nvidia-container-toolkit install and CDI generation#22

Add nvidia-container-toolkit install and CDI generation#22
bogdando wants to merge 1 commit into
rhos-vaf:mainfrom
bogdando:dev

bogdando commented Jun 15, 2026

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

bogdando commented Jun 15, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

bogdando commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bogdando commented Jun 15, 2026

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

bogdando commented Jun 15, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bogdando commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading