From d4ab140cf69947c4b7a7cfadf9c3ff854916610d Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Thu, 28 May 2026 21:31:33 +0300 Subject: [PATCH 1/2] docs(gpu): drop manual KubeVirt patch step now that the platform auto-wires permittedHostDevices MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Step 2 of the GPU Passthrough guide instructed operators to `kubectl edit kubevirt -n cozy-kubevirt` and hand-paste a permittedHostDevices.pciHostDevices block. cozystack/cozystack#2768 removes the need for that step: when cozystack.gpu-operator is in bundles.enabledPackages, the platform now mirrors the chosen GPU variant into the KubeVirt CR automatically — appending HostDevices to the feature-gate list and rendering a starter NVIDIA pciHostDevices table covering Hopper, Ada Lovelace, Ampere, Turing and Volta. The new step 2 documents the contract (what the platform auto-injects and why), the verification recipe, the escape hatch via .gpu.permittedHostDevices / .gpu.replaceDefaults, and the manual Package-CR override path used by operators who need overrides the bundle does not expose (driver settings, custom node selectors, validator / dcgmExporter tweaks) — in that flow they also hand-craft the matching cozystack.kubevirt Package CR. Only next/virtualization/gpu.md is updated; v1.4 and earlier describe releases that still require the manual patch and stay as-is. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- content/en/docs/next/virtualization/gpu.md | 50 ++++++++++++++-------- 1 file changed, 32 insertions(+), 18 deletions(-) diff --git a/content/en/docs/next/virtualization/gpu.md b/content/en/docs/next/virtualization/gpu.md index bc71d894..745de72d 100644 --- a/content/en/docs/next/virtualization/gpu.md +++ b/content/en/docs/next/virtualization/gpu.md @@ -100,32 +100,46 @@ Allocatable: For example, the database entry for A10 reads `2236 GA102GL [A10]`, which results in a resource name `nvidia.com/GA102GL_A10`. {{% /alert %}} -## 2. Update the KubeVirt Custom Resource +## 2. KubeVirt is wired automatically -Next, we will update the KubeVirt Custom Resource, as documented in the -[KubeVirt user guide](https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices), -so that the passthrough GPUs are permitted and can be requested by a KubeVirt VM. +When `cozystack.gpu-operator` is in `bundles.enabledPackages`, Cozystack mirrors the chosen GPU variant into the `KubeVirt` Custom Resource for you. There is no `kubectl edit kubevirt` step. -Adjust the `pciVendorSelector` and `resourceName` values to match your specific GPU model. -Setting `externalResourceProvider=true` indicates that this resource is provided by an external device plugin, -in this case the `sandbox-device-plugin` which is deployed by the Operator. +Specifically, the platform injects: + +- `HostDevices` into `spec.configuration.developerConfiguration.featureGates` (current KubeVirt splits this from the `GPU` gate; the admission webhook rejects `domain.devices.hostDevices` without it). +- A starter `spec.configuration.permittedHostDevices.pciHostDevices` table covering common NVIDIA datacenter GPUs — Hopper (H100, H200), Ada Lovelace (L4, L40, L40S), Ampere (A100 PCIe/SXM, A40, A30, A10), Turing (T4), Volta (V100, V100S). PCI vendor:device pairs are stable; `resourceName` slugs follow the `__
_` convention `nvidia-sandbox-device-plugin` v25.x emits (e.g. `nvidia.com/GA102GL_A10`). `externalResourceProvider: true` is set on every entry because the resources are advertised by the sandbox plugin, not by KubeVirt's in-tree device plugin. + +Verify the resulting CR: ```bash -kubectl edit kubevirt -n cozy-kubevirt +kubectl -n cozy-kubevirt get kubevirt kubevirt -o yaml \ + | yq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}' ``` -example config: + +### Extending or replacing the NVIDIA defaults + +If your cluster ships a GPU not in the default table, or your `nvidia-sandbox-device-plugin` version emits a different `resourceName` (check with `kubectl describe node | grep nvidia.com/`), extend the defaults via platform values: + ```yaml - ... - spec: - configuration: - permittedHostDevices: - pciHostDevices: - - externalResourceProvider: true - pciVendorSelector: 10DE:2236 - resourceName: nvidia.com/GA102GL_A10 - ... +# Platform Package values +gpu: + # Append (default) — your entries land alongside the NVIDIA table. + # Set to true to drop the NVIDIA table entirely (useful for non-NVIDIA-only + # clusters or strict allowlists). With replaceDefaults: true and an empty + # list below, the rendered CR carries no permittedHostDevices block at all + # and the admission webhook rejects every GPU VM — supply your own list. + replaceDefaults: false + permittedHostDevices: + pciHostDevices: + - pciVendorSelector: "10DE:2236" + resourceName: nvidia.com/GA102GL_A10 + externalResourceProvider: true ``` +### Manual Package-CR override path + +If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `components.kubevirt.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist. + ## 3. Create a Virtual Machine We are now ready to create a VM. From 5ba523c6cde01d02fee060511e52798021e72a78 Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Wed, 3 Jun 2026 02:20:07 +0300 Subject: [PATCH 2/2] docs(gpu): add pre-upgrade migration steps for hand-edited permittedHostDevices The bundle now owns spec.configuration.permittedHostDevices, so the first reconcile after upgrade overwrites manual kubectl-edit entries with the NVIDIA default table. Tell operators to move custom entries into .gpu.permittedHostDevices and verify each resourceName against node-advertised names before upgrading, since the default slugs (e.g. TU104GL_T4) differ from legacy names (e.g. TU104GL_TESLA_T4) and a mismatch silently rejects GPU VMs. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- content/en/docs/next/virtualization/gpu.md | 23 ++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/content/en/docs/next/virtualization/gpu.md b/content/en/docs/next/virtualization/gpu.md index 745de72d..3e25b017 100644 --- a/content/en/docs/next/virtualization/gpu.md +++ b/content/en/docs/next/virtualization/gpu.md @@ -136,6 +136,29 @@ gpu: externalResourceProvider: true ``` +### Upgrading from a hand-edited KubeVirt CR + +Earlier Cozystack releases left `spec.configuration.permittedHostDevices` for operators to hand-edit (`kubectl edit kubevirt`). The bundle now **owns** that field: the first reconcile after the upgrade replaces your manual entries with the rendered NVIDIA default table. + +Before upgrading: + +1. Dump your current entries: + + ```bash + kubectl get kubevirt -n cozy-kubevirt -o yaml \ + | yq '.items[0].spec.configuration.permittedHostDevices' + ``` + +2. Move any custom entries into the Platform Package values under `.gpu.permittedHostDevices` (set `.gpu.replaceDefaults: true` if you want only your own list instead of appending to the NVIDIA defaults). + +3. Verify every `resourceName` against what your nodes actually advertise — the default table uses `nvidia-sandbox-device-plugin` slugs (e.g. `nvidia.com/TU104GL_T4`) that differ from legacy driver names (e.g. `TU104GL_TESLA_T4`): + + ```bash + kubectl describe node | grep nvidia.com/ + ``` + +A `resourceName` mismatch is silent until a GPU VM restarts or migrates, at which point the admission webhook rejects it. + ### Manual Package-CR override path If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `components.kubevirt.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist.