Add MIG validations and RHEL setup#20
Conversation
rlandy
commented
Jun 8, 2026
- flavor addition for MIG testing
- Update MIG checks and NVIDIA validations
- Update lsmod
- Add RHEL guest repo support
- Gather facts on the target nodes
- Add repo set up tasks to generel VM set up
- Add dkms installation and reboot
- Disable GPG check for EPEL
- Reboot guest as root
- Install NVIDIA repos with custom RPM
- Make Cuda repo availble for install
- Install nvidia-ctk
- Setup CDI and Management Library
- Configure CDI as root
- Include all RHEL NVIDIA libs
- Clean old CDI spec before generating
- Reboot VM before CDI install
- Add gpu utilization option in service
- Move CUDA install tasks to own file
- Rework task for NVIDIA CUDA
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (11)
🚧 Files skipped from review as they are similar to previous changes (9)
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR adds a GPU validation mode switch (default pci_passthrough) and related defaults, detects NVIDIA A30 hardware, installs RHEL CUDA/driver prerequisites for MIG, performs mode-specific GPU checks and assertions, and parameterizes flavor and service templates for GPU validation. ChangesGPU Validation with PCI Passthrough and MIG Modes
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
gpu-validation/tasks/gpus.yaml (1)
7-15:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInitialize GPU detection facts before conditional
set_fact
found_nvidia/found_a30are only ever set totrue. If these facts were previously set in the same host context, later runs can take the wrong execution path (false positives). Initialize both tofalsebefore the detection checks.Proposed fix
+- name: Initialize GPU detection facts + ansible.builtin.set_fact: + found_nvidia: false + found_a30: false + - name: Set found_nvidia to true if NVIDIA is found ansible.builtin.set_fact: found_nvidia: true when: lspci_output.stdout is search("NVIDIA")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@gpu-validation/tasks/gpus.yaml` around lines 7 - 15, The playbook only ever sets found_nvidia and found_a30 to true, which can lead to stale true values across runs; add an initial ansible.builtin.set_fact task that explicitly sets found_nvidia: false and found_a30: false before the detection tasks so the subsequent conditional set_fact tasks (the tasks currently named "Set found_nvidia to true if NVIDIA is found" and "Set found_A30 to true if A30 is found") will reliably reflect the current host state.
🧹 Nitpick comments (2)
requirements.yaml (1)
8-8: ⚖️ Poor tradeoffConsider pinning the ci-framework collection to a specific version.
The git-sourced collection is added without a version, ref, or commit constraint, which can lead to non-deterministic builds if the upstream repository introduces breaking changes. While this is consistent with the existing
edpm-ansiblecollection pattern (line 7), pinning to a specific commit, tag, or branch would improve build reproducibility and stability.Example with version pinning
- - name: git+https://github.com/openstack-k8s-operators/ci-framework.git + - name: git+https://github.com/openstack-k8s-operators/ci-framework.git + version: main # or specify a commit SHA/tagAlternatively, specify a ref:
- name: git+https://github.com/openstack-k8s-operators/ci-framework.git#<commit-sha-or-tag>🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@requirements.yaml` at line 8, The ci-framework collection entry "git+https://github.com/openstack-k8s-operators/ci-framework.git" is unpinned which risks non-deterministic builds; update the requirements.yaml entry for that collection (the ci-framework line) to pin to a specific tag, branch, or commit (e.g., append #<commit-sha-or-tag> to the git URL or replace with a version spec) so the collection is reproducible and stable across installs.gpu-validation/tasks/vm_image.yaml (1)
29-33: 💤 Low valueMinor: Fix grammar in task name.
The task name "Reset extra_specs based GPU mode" should be "Reset extra_specs based on GPU mode" for better readability.
✏️ Suggested fix
-- name: Reset extra_specs based GPU mode +- name: Reset extra_specs based on GPU mode🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@gpu-validation/tasks/vm_image.yaml` around lines 29 - 33, Rename the Ansible task "Reset extra_specs based GPU mode" to "Reset extra_specs based on GPU mode" to correct the grammar; update the task name string where the play contains the task that sets gpu_validation_extra_specs (with "resources:VGPU": "1") and the when condition referencing gpu_validation_mode so the logic remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@gpu-validation/tasks/nvidia-cuda-repos.yaml`:
- Around line 23-47: The current playbook skips configuring/regenerating
/etc/cdi/nvidia.yaml when it exists (task "Check if CDI configfile exists" +
conditional on "Configure NVIDIA container runtime"), which can leave stale CDI
mappings; change the logic so that the block that runs "nvidia-ctk runtime
configure --runtime=containerd" and "nvidia-ctk cdi generate
--output=/etc/cdi/nvidia.yaml" always runs (or first remove /etc/cdi/nvidia.yaml
and then regenerate) instead of only running when
nvidia_driver_cdi_config_file.stat.exists is false; keep the directory-ensure
task ("Ensure CDI directory exists") and ensure idempotency by using a remove
step or unconditional regenerate to refresh CDI mappings each run.
---
Outside diff comments:
In `@gpu-validation/tasks/gpus.yaml`:
- Around line 7-15: The playbook only ever sets found_nvidia and found_a30 to
true, which can lead to stale true values across runs; add an initial
ansible.builtin.set_fact task that explicitly sets found_nvidia: false and
found_a30: false before the detection tasks so the subsequent conditional
set_fact tasks (the tasks currently named "Set found_nvidia to true if NVIDIA is
found" and "Set found_A30 to true if A30 is found") will reliably reflect the
current host state.
---
Nitpick comments:
In `@gpu-validation/tasks/vm_image.yaml`:
- Around line 29-33: Rename the Ansible task "Reset extra_specs based GPU mode"
to "Reset extra_specs based on GPU mode" to correct the grammar; update the task
name string where the play contains the task that sets
gpu_validation_extra_specs (with "resources:VGPU": "1") and the when condition
referencing gpu_validation_mode so the logic remains unchanged.
In `@requirements.yaml`:
- Line 8: The ci-framework collection entry
"git+https://github.com/openstack-k8s-operators/ci-framework.git" is unpinned
which risks non-deterministic builds; update the requirements.yaml entry for
that collection (the ci-framework line) to pin to a specific tag, branch, or
commit (e.g., append #<commit-sha-or-tag> to the git URL or replace with a
version spec) so the collection is reproducible and stable across installs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: a3638003-edd2-4c8e-8b5a-fd9725d2a1fb
📒 Files selected for processing (11)
gpu-validation/defaults/main.yamlgpu-validation/tasks/gpus.yamlgpu-validation/tasks/main.yamlgpu-validation/tasks/nvidia-cuda-repos.yamlgpu-validation/tasks/nvidia.yamlgpu-validation/tasks/nvidia_assertions.yamlgpu-validation/tasks/setup.yamlgpu-validation/tasks/vm_image.yamlgpu-validation/templates/vllm-serve.service.j2main.yamlrequirements.yaml
| - name: Check if CDI configfile exists | ||
| become: true | ||
| ansible.builtin.stat: | ||
| path: /etc/cdi/nvidia.yaml | ||
| register: nvidia_driver_cdi_config_file | ||
|
|
||
| - name: Configure NVIDIA container runtime | ||
| when: not nvidia_driver_cdi_config_file.stat.exists | ||
| become: true | ||
| block: | ||
| - name: Ensure CDI directory exists | ||
| ansible.builtin.file: | ||
| path: /etc/cdi | ||
| state: directory | ||
| mode: "0755" | ||
| owner: root | ||
|
|
||
| - name: Configure NVIDIA container runtime | ||
| ansible.builtin.command: nvidia-ctk runtime configure --runtime=containerd | ||
| changed_when: true | ||
|
|
||
| - name: Generate NVIDIA CDI configuration | ||
| ansible.builtin.command: nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml | ||
| changed_when: true | ||
|
|
There was a problem hiding this comment.
CDI config is not refreshed when it already exists
Current logic skips configuration if /etc/cdi/nvidia.yaml exists, which can leave stale CDI mappings after driver/runtime changes and cause incorrect GPU device exposure. Regenerate (or remove + regenerate) on each run for correctness.
Proposed fix
-- name: Check if CDI configfile exists
- become: true
- ansible.builtin.stat:
- path: /etc/cdi/nvidia.yaml
- register: nvidia_driver_cdi_config_file
-
- name: Configure NVIDIA container runtime
- when: not nvidia_driver_cdi_config_file.stat.exists
become: true
block:
- name: Ensure CDI directory exists
ansible.builtin.file:
path: /etc/cdi
state: directory
mode: "0755"
owner: root
+ - name: Remove existing NVIDIA CDI configuration
+ ansible.builtin.file:
+ path: /etc/cdi/nvidia.yaml
+ state: absent
- name: Configure NVIDIA container runtime
ansible.builtin.command: nvidia-ctk runtime configure --runtime=containerd
changed_when: true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Check if CDI configfile exists | |
| become: true | |
| ansible.builtin.stat: | |
| path: /etc/cdi/nvidia.yaml | |
| register: nvidia_driver_cdi_config_file | |
| - name: Configure NVIDIA container runtime | |
| when: not nvidia_driver_cdi_config_file.stat.exists | |
| become: true | |
| block: | |
| - name: Ensure CDI directory exists | |
| ansible.builtin.file: | |
| path: /etc/cdi | |
| state: directory | |
| mode: "0755" | |
| owner: root | |
| - name: Configure NVIDIA container runtime | |
| ansible.builtin.command: nvidia-ctk runtime configure --runtime=containerd | |
| changed_when: true | |
| - name: Generate NVIDIA CDI configuration | |
| ansible.builtin.command: nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml | |
| changed_when: true | |
| - name: Configure NVIDIA container runtime | |
| become: true | |
| block: | |
| - name: Ensure CDI directory exists | |
| ansible.builtin.file: | |
| path: /etc/cdi | |
| state: directory | |
| mode: "0755" | |
| owner: root | |
| - name: Remove existing NVIDIA CDI configuration | |
| ansible.builtin.file: | |
| path: /etc/cdi/nvidia.yaml | |
| state: absent | |
| - name: Configure NVIDIA container runtime | |
| ansible.builtin.command: nvidia-ctk runtime configure --runtime=containerd | |
| changed_when: true | |
| - name: Generate NVIDIA CDI configuration | |
| ansible.builtin.command: nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml | |
| changed_when: true |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@gpu-validation/tasks/nvidia-cuda-repos.yaml` around lines 23 - 47, The
current playbook skips configuring/regenerating /etc/cdi/nvidia.yaml when it
exists (task "Check if CDI configfile exists" + conditional on "Configure NVIDIA
container runtime"), which can leave stale CDI mappings; change the logic so
that the block that runs "nvidia-ctk runtime configure --runtime=containerd" and
"nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml" always runs (or first
remove /etc/cdi/nvidia.yaml and then regenerate) instead of only running when
nvidia_driver_cdi_config_file.stat.exists is false; keep the directory-ensure
task ("Ensure CDI directory exists") and ensure idempotency by using a remove
step or unconditional regenerate to refresh CDI mappings each run.
- flavor addition for MIG testing - Update MIG checks and NVIDIA validations - Update lsmod - Add RHEL guest repo support - Gather facts on the target nodes - Add repo set up tasks to generel VM set up - Add dkms installation and reboot - Disable GPG check for EPEL - Reboot guest as root - Install NVIDIA repos with custom RPM - Make Cuda repo availble for install - Install nvidia-ctk - Setup CDI and Management Library - Configure CDI as root - Include all RHEL NVIDIA libs - Clean old CDI spec before generating - Reboot VM before CDI install - Add gpu utilization option in service - Move CUDA install tasks to own file - Rework task for NVIDIA CUDA
|
|
||
| - name: Set found_A30 to true if A30 is found | ||
| ansible.builtin.set_fact: | ||
| found_a30: true |
There was a problem hiding this comment.
As I asked previously on another PR, please do not introduce a device specific facts.
|
Closing - PR #21 should have these changes |