Skip to content

Add reboot detection after GPU driver installation#18

Merged
MiguelCarpio merged 1 commit into
rhos-vaf:mainfrom
MiguelCarpio:reboot
Jun 7, 2026
Merged

Add reboot detection after GPU driver installation#18
MiguelCarpio merged 1 commit into
rhos-vaf:mainfrom
MiguelCarpio:reboot

Conversation

@MiguelCarpio

@MiguelCarpio MiguelCarpio commented May 27, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?
Uses needs-restarting to detect when kernel or driver updates require a reboot. Automatically reboots the test VM to load new kernel modules after driver installation.

Why do we need this PR?
After the driver installation with the edpm_accel_drivers role, the next task - name: Run nvidia-smi to list NVIDIA GPUs and count them tries to use nvidia-smi, which fails because the NVIDIA driver is not loaded yet.

@MiguelCarpio

Copy link
Copy Markdown
Contributor Author

Error message

TASK [gpu-validation : Run nvidia-smi to list NVIDIA GPUs and count them] ******
task path: /var/lib/ansible/ansible/gpu-validation/tasks/nvidia.yaml:2
...
fatal: [gpu-validation-0]: FAILED! => {
    "changed": false,
    "cmd": "set -o pipefail && nvidia-smi --list-gpus | wc -l",
    "delta": "0:00:00.072636",
    "end": "2026-05-27 12:21:47.170860",
    "invocation": {
        "module_args": {
            "_raw_params": "set -o pipefail && nvidia-smi --list-gpus | wc -l",
            "_uses_shell": true,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true
        }
    },
    "msg": "non-zero return code",
    "rc": 9,
    "start": "2026-05-27 12:21:47.098224",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "2",
    "stdout_lines": [
        "2"
    ]
}

PLAY RECAP *********************************************************************
gpu-validation-0           : ok=30   changed=11   unreachable=0    failed=1    skipped=23   rescued=0    ignored=0
localhost                  : ok=20   changed=3    unreachable=0    failed=0    skipped=6    rescued=0    ignored=0

Reboot justification

ok: [gpu-validation-0] => {
    "changed": false,
    "cmd": [
        "needs-restarting",
        "-r"
    ],
    "stdout_lines": [
        "Core libraries or services have been updated since boot-up:",
        "  * nvidia-driver",
        "  * nvidia-driver-cuda",
        "",
        "Reboot is required to fully utilize these updates.",
        "More information: https://access.redhat.com/solutions/27943"
    ]

Fix with this PR

TASK [gpu-validation : Make sure yum-utils is installed] ***********************
task path: /var/lib/ansible/ansible/gpu-validation/tasks/reboot_if_needed.yaml:2
..
ok: [gpu-validation-0] => {
    "changed": false,
    "invocation": {
        "module_args": {
            "allow_downgrade": false,
            "allowerasing": false,
            "autoremove": false,
            "bugfix": false,
            "cacheonly": false,
            "conf_file": null,
            "disable_excludes": null,
            "disable_gpg_check": false,
            "disable_plugin": [],
            "disablerepo": [],
            "download_dir": null,
            "download_only": false,
            "enable_plugin": [],
            "enablerepo": [],
            "exclude": [],
            "install_repoquery": true,
            "install_weak_deps": true,
            "installroot": "/",
            "list": null,
            "lock_timeout": 30,
            "name": [
                "yum-utils"
            ],
            "nobest": false,
            "releasever": null,
            "security": false,
            "skip_broken": false,
            "sslverify": true,
            "state": null,
            "update_cache": false,
            "update_only": false,
            "use_backend": "auto",
            "validate_certs": true
        }
    },
    "msg": "Nothing to do",
    "rc": 0,
    "results": []
}
TASK [gpu-validation : Check if reboot is required with needs-restarting] ******
task path: /var/lib/ansible/ansible/gpu-validation/tasks/reboot_if_needed.yaml:7
..
ok: [gpu-validation-0] => {
    "changed": false,
    "cmd": [
        "needs-restarting",
        "-r"
    ],
    "delta": "0:00:00.398268",
    "end": "2026-05-27 12:57:48.239588",
    "failed_when_result": false,
    "invocation": {
        "module_args": {
            "_raw_params": "needs-restarting -r",
            "_uses_shell": false,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true
        }
    },
    "msg": "non-zero return code",
    "rc": 1,
    "start": "2026-05-27 12:57:47.841320",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "Core libraries or services have been updated since boot-up:\n  * nvidia-driver\n  * nvidia-driver-cuda\n\nReboot is required to fully utilize these updates.\nMore information: https://access.redhat.com/solutions/27943",
    "stdout_lines": [
        "Core libraries or services have been updated since boot-up:",
        "  * nvidia-driver",
        "  * nvidia-driver-cuda",
        "",
        "Reboot is required to fully utilize these updates.",
        "More information: https://access.redhat.com/solutions/27943"
    ]
}
TASK [gpu-validation : Print return information from needs-restarting] *********
task path: /var/lib/ansible/ansible/gpu-validation/tasks/reboot_if_needed.yaml:14
ok: [gpu-validation-0] => {
    "_gpu_validation_needs_restarting.stdout": "Core libraries or services have been updated since boot-up:\n  * nvidia-driver\n  * nvidia-driver-cuda\n\nReboot is required to fully utilize these updates.\nMore information: https://access.redhat.com/solutions/27943"
}
TASK [gpu-validation : Reboot to apply system updates] *************************
task path: /var/lib/ansible/ansible/gpu-validation/tasks/reboot_if_needed.yaml:18
..
changed: [gpu-validation-0] => {
    "changed": true,
    "elapsed": 32,
    "rebooted": true
}
TASK [gpu-validation : Run lspci command and save output] **********************
task path: /var/lib/ansible/ansible/gpu-validation/tasks/gpus.yaml:2
..
ok: [gpu-validation-0] => {
    "changed": false,
    "cmd": [
        "lspci",
        "-nn"
    ],
    "delta": "0:00:00.030867",
    "end": "2026-05-27 12:58:23.141233",
    "invocation": {
        "module_args": {
            "_raw_params": "lspci -nn",
            "_uses_shell": false,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true
        }
    },
    "msg": "",
    "rc": 0,
    "start": "2026-05-27 12:58:23.110366",
    "stderr": "",
    "stderr_lines": [],
...
    "stdout_lines": [
        "00:00.0 Host bridge [0600]: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller [8086:29c0]",
        "00:01.0 VGA compatible controller [0300]: Red Hat, Inc. Virtio 1.0 GPU [1af4:1050] (rev 01)",
        "00:02.0 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:02.1 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:02.2 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:02.3 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:02.4 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:02.5 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:02.6 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:02.7 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:03.0 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:03.1 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:03.2 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:03.3 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:03.4 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:03.5 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:03.6 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:03.7 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:04.0 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:04.1 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:04.2 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:04.3 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:04.4 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:04.5 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:04.6 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:04.7 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:05.0 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]",
        "00:1f.0 ISA bridge [0601]: Intel Corporation 82801IB (ICH9) LPC Interface Controller [8086:2918] (rev 02)",
        "00:1f.2 SATA controller [0106]: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922] (rev 02)",
        "00:1f.3 SMBus [0c05]: Intel Corporation 82801I (ICH9 Family) SMBus Controller [8086:2930] (rev 02)",
        "01:00.0 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe-to-PCI bridge [1b36:000e]",
        "02:01.0 USB controller [0c03]: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] [8086:7020] (rev 01)",
        "03:00.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)",
        "04:00.0 SCSI storage controller [0100]: Red Hat, Inc. Virtio 1.0 block device [1af4:1042] (rev 01)",
        "05:00.0 3D controller [0302]: NVIDIA Corporation AD104GL [L4] [10de:27b8] (rev a1)",
        "06:00.0 Unclassified device [00ff]: Red Hat, Inc. Virtio 1.0 balloon [1af4:1045] (rev 01)",
        "07:00.0 Unclassified device [00ff]: Red Hat, Inc. Virtio 1.0 RNG [1af4:1044] (rev 01)"
    ]
}
TASK [gpu-validation : Set found_nvidia to true if NVIDIA is found] ************
task path: /var/lib/ansible/ansible/gpu-validation/tasks/gpus.yaml:7
ok: [gpu-validation-0] => {
    "ansible_facts": {
        "found_nvidia": true
    },
    "changed": false
}
TASK [gpu-validation : TEST[gpus] Check if GPUs in Passthrough mode are present in RHEL AI VM (lspci)] ***
task path: /var/lib/ansible/ansible/gpu-validation/tasks/gpus_assertions.yaml:2
..
ok: [gpu-validation-0] => (item={'key': '10de:27b8', 'value': 1}) => {
    "ansible_loop_var": "item",
    "changed": false,
...
    "delta": "0:00:00.012665",
    "end": "2026-05-27 12:58:23.950263",
    "failed_when_result": false,
    "invocation": {
        "module_args": {
            "_raw_params": "set -o pipefail\necho \"00:00.0 Host bridge [0600]: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller [8086:29c0]\n00:01.0 VGA compatible controller [0300]: Red Hat, Inc. Virtio 1.0 GPU [1af4:1050] (rev 01)\n00:02.0 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:02.1 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:02.2 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:02.3 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:02.4 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:02.5 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:02.6 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:02.7 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:03.0 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:03.1 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:03.2 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:03.3 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:03.4 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:03.5 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:03.6 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:03.7 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:04.0 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:04.1 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:04.2 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:04.3 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:04.4 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:04.5 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:04.6 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:04.7 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:05.0 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe Root port [1b36:000c]\n00:1f.0 ISA bridge [0601]: Intel Corporation 82801IB (ICH9) LPC Interface Controller [8086:2918] (rev 02)\n00:1f.2 SATA controller [0106]: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922] (rev 02)\n00:1f.3 SMBus [0c05]: Intel Corporation 82801I (ICH9 Family) SMBus Controller [8086:2930] (rev 02)\n01:00.0 PCI bridge [0604]: Red Hat, Inc. QEMU PCIe-to-PCI bridge [1b36:000e]\n02:01.0 USB controller [0c03]: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] [8086:7020] (rev 01)\n03:00.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)\n04:00.0 SCSI storage controller [0100]: Red Hat, Inc. Virtio 1.0 block device [1af4:1042] (rev 01)\n05:00.0 3D controller [0302]: NVIDIA Corporation AD104GL [L4] [10de:27b8] (rev a1)\n06:00.0 Unclassified device [00ff]: Red Hat, Inc. Virtio 1.0 balloon [1af4:1045] (rev 01)\n07:00.0 Unclassified device [00ff]: Red Hat, Inc. Virtio 1.0 RNG [1af4:1044] (rev 01)\" | grep '10de:27b8' | wc -l\n",
            "_uses_shell": true,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true
        }
    },
    "item": {
        "key": "10de:27b8",
        "value": 1
    },
    "msg": "",
    "rc": 0,
    "start": "2026-05-27 12:58:23.937598",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "1",
    "stdout_lines": [
        "1"
    ]
}
TASK [gpu-validation : Run nvidia-smi to list NVIDIA GPUs and count them] ******
task path: /var/lib/ansible/ansible/gpu-validation/tasks/nvidia.yaml:2
..
ok: [gpu-validation-0] => {
    "changed": false,
    "cmd": "set -o pipefail && nvidia-smi --list-gpus | wc -l",
    "delta": "0:00:00.063196",
    "end": "2026-05-27 12:58:24.818226",
    "invocation": {
        "module_args": {
            "_raw_params": "set -o pipefail && nvidia-smi --list-gpus | wc -l",
            "_uses_shell": true,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true
        }
    },
    "msg": "",
    "rc": 0,
    "start": "2026-05-27 12:58:24.755030",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "1",
    "stdout_lines": [
        "1"
    ]
}
TASK [gpu-validation : Get some statistics about the NVIDIA GPU] ***************
task path: /var/lib/ansible/ansible/gpu-validation/tasks/nvidia.yaml:8
...
ok: [gpu-validation-0] => {
    "changed": false,
    "cmd": [
        "nvidia-smi",
        "--query-gpu=utilization.gpu,memory.used,memory.free,temperature.gpu",
        "--format=csv,nounits,noheader"
    ],
    "delta": "0:00:00.045165",
    "end": "2026-05-27 12:58:25.670481",
    "invocation": {
        "module_args": {
            "_raw_params": "nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.free,temperature.gpu --format=csv,nounits,noheader",
            "_uses_shell": false,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true
        }
    },
    "msg": "",
    "rc": 0,
    "start": "2026-05-27 12:58:25.625316",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "0, 0, 22592, 44",
    "stdout_lines": [
        "0, 0, 22592, 44"
    ]
}
TASK [gpu-validation : Display NVIDIA GPU statistics] **************************
task path: /var/lib/ansible/ansible/gpu-validation/tasks/nvidia.yaml:14
ok: [gpu-validation-0] => {
    "msg": "GPU Utilization: 0%\nMemory Used: 0 MiB\nMemory Free: 22592 MiB\nGPU Temperature: 44 C\n"
}
TASK [gpu-validation : TEST[nvidia] Check if GPUs in Passthrough mode are present in RHEL AI VM (nvidia-smi)] ***
task path: /var/lib/ansible/ansible/gpu-validation/tasks/nvidia_assertions.yaml:2
ok: [gpu-validation-0] => {
    "changed": false,
    "msg": "All assertions passed"
}

@MiguelCarpio MiguelCarpio marked this pull request as ready for review May 27, 2026 17:15
Comment thread gpu-validation/tasks/reboot_if_needed.yaml
Comment thread gpu-validation/tasks/reboot_if_needed.yaml Outdated

@skovili skovili left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. just one minor nit.

Comment thread gpu-validation/tasks/reboot_if_needed.yaml
Uses needs-restarting to detect when kernel or driver updates
require a reboot. Automatically reboots the test VM to load new
kernel modules after driver installation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

@bogdando bogdando left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MiguelCarpio MiguelCarpio merged commit 6f0c09f into rhos-vaf:main Jun 7, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants