Skip to content

Add MIG validations and RHEL setup#20

Closed
rlandy wants to merge 1 commit into
rhos-vaf:mainfrom
rlandy:rebased-mig
Closed

Add MIG validations and RHEL setup#20
rlandy wants to merge 1 commit into
rhos-vaf:mainfrom
rlandy:rebased-mig

Conversation

@rlandy

@rlandy rlandy commented Jun 8, 2026

Copy link
Copy Markdown
Contributor
  • flavor addition for MIG testing
  • Update MIG checks and NVIDIA validations
  • Update lsmod
  • Add RHEL guest repo support
  • Gather facts on the target nodes
  • Add repo set up tasks to generel VM set up
  • Add dkms installation and reboot
  • Disable GPG check for EPEL
  • Reboot guest as root
  • Install NVIDIA repos with custom RPM
  • Make Cuda repo availble for install
  • Install nvidia-ctk
  • Setup CDI and Management Library
  • Configure CDI as root
  • Include all RHEL NVIDIA libs
  • Clean old CDI spec before generating
  • Reboot VM before CDI install
  • Add gpu utilization option in service
  • Move CUDA install tasks to own file
  • Rework task for NVIDIA CUDA

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 186a0b26-1ccf-4b83-ac37-ff1a760477c0

📥 Commits

Reviewing files that changed from the base of the PR and between 10ed203 and 568b2b4.

📒 Files selected for processing (11)
  • gpu-validation/defaults/main.yaml
  • gpu-validation/tasks/gpus.yaml
  • gpu-validation/tasks/main.yaml
  • gpu-validation/tasks/nvidia-cuda-repos.yaml
  • gpu-validation/tasks/nvidia.yaml
  • gpu-validation/tasks/nvidia_assertions.yaml
  • gpu-validation/tasks/setup.yaml
  • gpu-validation/tasks/vm_image.yaml
  • gpu-validation/templates/vllm-serve.service.j2
  • main.yaml
  • requirements.yaml
🚧 Files skipped from review as they are similar to previous changes (9)
  • main.yaml
  • gpu-validation/templates/vllm-serve.service.j2
  • requirements.yaml
  • gpu-validation/tasks/nvidia.yaml
  • gpu-validation/tasks/nvidia_assertions.yaml
  • gpu-validation/tasks/nvidia-cuda-repos.yaml
  • gpu-validation/tasks/vm_image.yaml
  • gpu-validation/tasks/setup.yaml
  • gpu-validation/tasks/gpus.yaml

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added PCI passthrough and MIG GPU validation modes with A30 detection and mode-specific checks
    • Introduced configurable GPU extra-specs and optional GPU memory-utilization for validation workloads
    • Added configurable NVIDIA management library version and automated NVIDIA/CUDA tooling setup
  • Chores

    • Improved RHEL driver dependency handling and repository/setup tasks
    • Updated collection requirements and playbook fact-gathering behavior

Walkthrough

This PR adds a GPU validation mode switch (default pci_passthrough) and related defaults, detects NVIDIA A30 hardware, installs RHEL CUDA/driver prerequisites for MIG, performs mode-specific GPU checks and assertions, and parameterizes flavor and service templates for GPU validation.

Changes

GPU Validation with PCI Passthrough and MIG Modes

Layer / File(s) Summary
Configuration defaults and modes
gpu-validation/defaults/main.yaml
gpu_validation_mode defaults to pci_passthrough; gpu_validation_extra_specs defines PCI passthrough aliasing and KVM/NUMA settings; gpu_validation_libnvidia_ml_version pins NVIDIA management library package identifier.
GPU hardware detection and fact gathering
gpu-validation/tasks/gpus.yaml, main.yaml
A30 GPU detection via lspci output sets found_a30; main GPU Validation play uses full fact gathering (gather_facts: true) for conditional tasks.
RHEL driver dependencies and CUDA repositories
gpu-validation/tasks/setup.yaml, gpu-validation/tasks/main.yaml, gpu-validation/tasks/nvidia-cuda-repos.yaml
RHEL-only MIG-mode block sets up repos and installs DKMS/kernel headers; CUDA repo/task sequence installs nvidia-container-toolkit, reboots to apply drivers, configures/generates CDI if absent, installs NVIDIA management library pinned by gpu_validation_libnvidia_ml_version, and refreshes RPM facts.
GPU statistics collection and mode validation
gpu-validation/tasks/nvidia.yaml, gpu-validation/tasks/nvidia_assertions.yaml
GPU listing and stats tasks are gated on found_nvidia and not found_a30; MIG-mode tasks collect MIG status with nvidia-smi (non-A30) or lsmod (A30); assertions validate passthrough GPU presence, MIG Enabled state, or A30 module loading per mode.
OpenStack flavor extra_specs configuration
gpu-validation/tasks/vm_image.yaml
Flavor creation uses gpu_validation_extra_specs variable; MIG mode overrides it to resources:VGPU: 1.
vLLM service memory utilization configuration
gpu-validation/templates/vllm-serve.service.j2
Conditionally appends --gpu-memory-utilization to the podman run ExecStart when gpu_validation_mem_utilization is set.
Ansible collection dependencies
requirements.yaml
Adds ci-framework collection source from openstack-k8s-operators Git repository.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main changes: adding MIG validations and RHEL setup support for GPU validation.
Description check ✅ Passed The description is comprehensively related to the changeset, providing a detailed bulleted list of all major modifications across multiple configuration and task files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
gpu-validation/tasks/gpus.yaml (1)

7-15: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Initialize GPU detection facts before conditional set_fact

found_nvidia/found_a30 are only ever set to true. If these facts were previously set in the same host context, later runs can take the wrong execution path (false positives). Initialize both to false before the detection checks.

Proposed fix
+- name: Initialize GPU detection facts
+  ansible.builtin.set_fact:
+    found_nvidia: false
+    found_a30: false
+
 - name: Set found_nvidia to true if NVIDIA is found
   ansible.builtin.set_fact:
     found_nvidia: true
   when: lspci_output.stdout is search("NVIDIA")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@gpu-validation/tasks/gpus.yaml` around lines 7 - 15, The playbook only ever
sets found_nvidia and found_a30 to true, which can lead to stale true values
across runs; add an initial ansible.builtin.set_fact task that explicitly sets
found_nvidia: false and found_a30: false before the detection tasks so the
subsequent conditional set_fact tasks (the tasks currently named "Set
found_nvidia to true if NVIDIA is found" and "Set found_A30 to true if A30 is
found") will reliably reflect the current host state.
🧹 Nitpick comments (2)
requirements.yaml (1)

8-8: ⚖️ Poor tradeoff

Consider pinning the ci-framework collection to a specific version.

The git-sourced collection is added without a version, ref, or commit constraint, which can lead to non-deterministic builds if the upstream repository introduces breaking changes. While this is consistent with the existing edpm-ansible collection pattern (line 7), pinning to a specific commit, tag, or branch would improve build reproducibility and stability.

Example with version pinning
-  - name: git+https://github.com/openstack-k8s-operators/ci-framework.git
+  - name: git+https://github.com/openstack-k8s-operators/ci-framework.git
+    version: main  # or specify a commit SHA/tag

Alternatively, specify a ref:

  - name: git+https://github.com/openstack-k8s-operators/ci-framework.git#<commit-sha-or-tag>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@requirements.yaml` at line 8, The ci-framework collection entry
"git+https://github.com/openstack-k8s-operators/ci-framework.git" is unpinned
which risks non-deterministic builds; update the requirements.yaml entry for
that collection (the ci-framework line) to pin to a specific tag, branch, or
commit (e.g., append #<commit-sha-or-tag> to the git URL or replace with a
version spec) so the collection is reproducible and stable across installs.
gpu-validation/tasks/vm_image.yaml (1)

29-33: 💤 Low value

Minor: Fix grammar in task name.

The task name "Reset extra_specs based GPU mode" should be "Reset extra_specs based on GPU mode" for better readability.

✏️ Suggested fix
-- name: Reset extra_specs based GPU mode
+- name: Reset extra_specs based on GPU mode
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@gpu-validation/tasks/vm_image.yaml` around lines 29 - 33, Rename the Ansible
task "Reset extra_specs based GPU mode" to "Reset extra_specs based on GPU mode"
to correct the grammar; update the task name string where the play contains the
task that sets gpu_validation_extra_specs (with "resources:VGPU": "1") and the
when condition referencing gpu_validation_mode so the logic remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@gpu-validation/tasks/nvidia-cuda-repos.yaml`:
- Around line 23-47: The current playbook skips configuring/regenerating
/etc/cdi/nvidia.yaml when it exists (task "Check if CDI configfile exists" +
conditional on "Configure NVIDIA container runtime"), which can leave stale CDI
mappings; change the logic so that the block that runs "nvidia-ctk runtime
configure --runtime=containerd" and "nvidia-ctk cdi generate
--output=/etc/cdi/nvidia.yaml" always runs (or first remove /etc/cdi/nvidia.yaml
and then regenerate) instead of only running when
nvidia_driver_cdi_config_file.stat.exists is false; keep the directory-ensure
task ("Ensure CDI directory exists") and ensure idempotency by using a remove
step or unconditional regenerate to refresh CDI mappings each run.

---

Outside diff comments:
In `@gpu-validation/tasks/gpus.yaml`:
- Around line 7-15: The playbook only ever sets found_nvidia and found_a30 to
true, which can lead to stale true values across runs; add an initial
ansible.builtin.set_fact task that explicitly sets found_nvidia: false and
found_a30: false before the detection tasks so the subsequent conditional
set_fact tasks (the tasks currently named "Set found_nvidia to true if NVIDIA is
found" and "Set found_A30 to true if A30 is found") will reliably reflect the
current host state.

---

Nitpick comments:
In `@gpu-validation/tasks/vm_image.yaml`:
- Around line 29-33: Rename the Ansible task "Reset extra_specs based GPU mode"
to "Reset extra_specs based on GPU mode" to correct the grammar; update the task
name string where the play contains the task that sets
gpu_validation_extra_specs (with "resources:VGPU": "1") and the when condition
referencing gpu_validation_mode so the logic remains unchanged.

In `@requirements.yaml`:
- Line 8: The ci-framework collection entry
"git+https://github.com/openstack-k8s-operators/ci-framework.git" is unpinned
which risks non-deterministic builds; update the requirements.yaml entry for
that collection (the ci-framework line) to pin to a specific tag, branch, or
commit (e.g., append #<commit-sha-or-tag> to the git URL or replace with a
version spec) so the collection is reproducible and stable across installs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a3638003-edd2-4c8e-8b5a-fd9725d2a1fb

📥 Commits

Reviewing files that changed from the base of the PR and between 6f0c09f and a2446e2.

📒 Files selected for processing (11)
  • gpu-validation/defaults/main.yaml
  • gpu-validation/tasks/gpus.yaml
  • gpu-validation/tasks/main.yaml
  • gpu-validation/tasks/nvidia-cuda-repos.yaml
  • gpu-validation/tasks/nvidia.yaml
  • gpu-validation/tasks/nvidia_assertions.yaml
  • gpu-validation/tasks/setup.yaml
  • gpu-validation/tasks/vm_image.yaml
  • gpu-validation/templates/vllm-serve.service.j2
  • main.yaml
  • requirements.yaml

Comment on lines +23 to +47
- name: Check if CDI configfile exists
become: true
ansible.builtin.stat:
path: /etc/cdi/nvidia.yaml
register: nvidia_driver_cdi_config_file

- name: Configure NVIDIA container runtime
when: not nvidia_driver_cdi_config_file.stat.exists
become: true
block:
- name: Ensure CDI directory exists
ansible.builtin.file:
path: /etc/cdi
state: directory
mode: "0755"
owner: root

- name: Configure NVIDIA container runtime
ansible.builtin.command: nvidia-ctk runtime configure --runtime=containerd
changed_when: true

- name: Generate NVIDIA CDI configuration
ansible.builtin.command: nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
changed_when: true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

CDI config is not refreshed when it already exists

Current logic skips configuration if /etc/cdi/nvidia.yaml exists, which can leave stale CDI mappings after driver/runtime changes and cause incorrect GPU device exposure. Regenerate (or remove + regenerate) on each run for correctness.

Proposed fix
-- name: Check if CDI configfile exists
-  become: true
-  ansible.builtin.stat:
-    path: /etc/cdi/nvidia.yaml
-  register: nvidia_driver_cdi_config_file
-
 - name: Configure NVIDIA container runtime
-  when: not nvidia_driver_cdi_config_file.stat.exists
   become: true
   block:
     - name: Ensure CDI directory exists
       ansible.builtin.file:
         path: /etc/cdi
         state: directory
         mode: "0755"
         owner: root
+    - name: Remove existing NVIDIA CDI configuration
+      ansible.builtin.file:
+        path: /etc/cdi/nvidia.yaml
+        state: absent
 
     - name: Configure NVIDIA container runtime
       ansible.builtin.command: nvidia-ctk runtime configure --runtime=containerd
       changed_when: true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Check if CDI configfile exists
become: true
ansible.builtin.stat:
path: /etc/cdi/nvidia.yaml
register: nvidia_driver_cdi_config_file
- name: Configure NVIDIA container runtime
when: not nvidia_driver_cdi_config_file.stat.exists
become: true
block:
- name: Ensure CDI directory exists
ansible.builtin.file:
path: /etc/cdi
state: directory
mode: "0755"
owner: root
- name: Configure NVIDIA container runtime
ansible.builtin.command: nvidia-ctk runtime configure --runtime=containerd
changed_when: true
- name: Generate NVIDIA CDI configuration
ansible.builtin.command: nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
changed_when: true
- name: Configure NVIDIA container runtime
become: true
block:
- name: Ensure CDI directory exists
ansible.builtin.file:
path: /etc/cdi
state: directory
mode: "0755"
owner: root
- name: Remove existing NVIDIA CDI configuration
ansible.builtin.file:
path: /etc/cdi/nvidia.yaml
state: absent
- name: Configure NVIDIA container runtime
ansible.builtin.command: nvidia-ctk runtime configure --runtime=containerd
changed_when: true
- name: Generate NVIDIA CDI configuration
ansible.builtin.command: nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
changed_when: true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@gpu-validation/tasks/nvidia-cuda-repos.yaml` around lines 23 - 47, The
current playbook skips configuring/regenerating /etc/cdi/nvidia.yaml when it
exists (task "Check if CDI configfile exists" + conditional on "Configure NVIDIA
container runtime"), which can leave stale CDI mappings; change the logic so
that the block that runs "nvidia-ctk runtime configure --runtime=containerd" and
"nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml" always runs (or first
remove /etc/cdi/nvidia.yaml and then regenerate) instead of only running when
nvidia_driver_cdi_config_file.stat.exists is false; keep the directory-ensure
task ("Ensure CDI directory exists") and ensure idempotency by using a remove
step or unconditional regenerate to refresh CDI mappings each run.

 - flavor addition for MIG testing
 - Update MIG checks and NVIDIA validations
 - Update lsmod
 - Add RHEL guest repo support
 - Gather facts on the target nodes
 - Add repo set up tasks to generel VM set up
 - Add dkms installation and reboot
 - Disable GPG check for EPEL
 - Reboot guest as root
 - Install NVIDIA repos with custom RPM
 - Make Cuda repo availble for install
 - Install nvidia-ctk
 - Setup CDI and Management Library
 - Configure CDI as root
 - Include all RHEL NVIDIA libs
 - Clean old  CDI spec before generating
 - Reboot VM before CDI install
 - Add gpu utilization option in service
 - Move CUDA install tasks to own file
 - Rework task for NVIDIA CUDA

- name: Set found_A30 to true if A30 is found
ansible.builtin.set_fact:
found_a30: true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I asked previously on another PR, please do not introduce a device specific facts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in PR #21

@bogdando bogdando left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not introduce devices specific logic, like a30 facts vs other cases

@rlandy rlandy mentioned this pull request Jun 9, 2026

@rlandy rlandy left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See PR #21 for updated code

@rlandy

rlandy commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Closing - PR #21 should have these changes

@rlandy rlandy closed this Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants