Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 55 additions & 18 deletions content/en/docs/next/virtualization/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,32 +100,69 @@ Allocatable:
For example, the database entry for A10 reads `2236 GA102GL [A10]`, which results in a resource name `nvidia.com/GA102GL_A10`.
{{% /alert %}}

## 2. Update the KubeVirt Custom Resource
## 2. KubeVirt is wired automatically

Next, we will update the KubeVirt Custom Resource, as documented in the
[KubeVirt user guide](https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices),
so that the passthrough GPUs are permitted and can be requested by a KubeVirt VM.
When `cozystack.gpu-operator` is in `bundles.enabledPackages`, Cozystack mirrors the chosen GPU variant into the `KubeVirt` Custom Resource for you. There is no `kubectl edit kubevirt` step.

Adjust the `pciVendorSelector` and `resourceName` values to match your specific GPU model.
Setting `externalResourceProvider=true` indicates that this resource is provided by an external device plugin,
in this case the `sandbox-device-plugin` which is deployed by the Operator.
Specifically, the platform injects:

- `HostDevices` into `spec.configuration.developerConfiguration.featureGates` (current KubeVirt splits this from the `GPU` gate; the admission webhook rejects `domain.devices.hostDevices` without it).
- A starter `spec.configuration.permittedHostDevices.pciHostDevices` table covering common NVIDIA datacenter GPUs — Hopper (H100, H200), Ada Lovelace (L4, L40, L40S), Ampere (A100 PCIe/SXM, A40, A30, A10), Turing (T4), Volta (V100, V100S). PCI vendor:device pairs are stable; `resourceName` slugs follow the `<arch>_<model>_<form>_<mem>` convention `nvidia-sandbox-device-plugin` v25.x emits (e.g. `nvidia.com/GA102GL_A10`). `externalResourceProvider: true` is set on every entry because the resources are advertised by the sandbox plugin, not by KubeVirt's in-tree device plugin.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check the actual resource naming pattern in nvidia-sandbox-device-plugin documentation

# Search for resourceName examples and naming documentation
rg -C5 'resourceName.*nvidia\.com' --type=md

# Also check for any naming convention documentation
rg -C3 'arch.*model.*form.*mem|naming.*convention' --type=md

Repository: cozystack/website

Length of output: 42109


Fix resourceName naming-convention wording (line 110)

  • Line 110 states resourceName slugs follow <arch>_<model>_<form>_<mem>, but the document’s examples only use two components (e.g., nvidia.com/GA102GL_A10, nvidia.com/TU104GL_T4). Update the convention text to match the actual slug format (and clarify when/if <form> and <mem> are present) or adjust the examples accordingly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@content/en/docs/next/virtualization/gpu.md` at line 110, Update the wording
around resourceName in the
spec.configuration.permittedHostDevices.pciHostDevices paragraph to reflect the
actual slug format used by nvidia-sandbox-device-plugin (v25.x): state that
resourceName slugs are typically two-component identifiers like
`nvidia.com/GA102GL_A10` or `nvidia.com/TU104GL_T4` and clarify that optional
`<form>` and `<mem>` components may be appended for more specific devices (i.e.,
`<arch>_<model>` is the common case, with optional `_ <form>_ <mem>` when
present); keep the note about externalResourceProvider: true and mention the
plugin as the source of these resource names.


Verify the resulting CR:

```bash
kubectl edit kubevirt -n cozy-kubevirt
kubectl -n cozy-kubevirt get kubevirt kubevirt -o yaml \
| yq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}'
Comment on lines +115 to +116
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using yq can sometimes lead to compatibility issues depending on whether the user has the Python-based yq (which supports full jq syntax) or the Go-based yq (which has a different expression syntax) installed.

Using kubectl ... -o json | jq ... is much more portable, standard, and guaranteed to work across different environments since jq is universally standardized.

Suggested change
kubectl -n cozy-kubevirt get kubevirt kubevirt -o yaml \
| yq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}'
kubectl -n cozy-kubevirt get kubevirt kubevirt -o json \
| jq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}'

```
example config:

### Extending or replacing the NVIDIA defaults

If your cluster ships a GPU not in the default table, or your `nvidia-sandbox-device-plugin` version emits a different `resourceName` (check with `kubectl describe node <node> | grep nvidia.com/`), extend the defaults via platform values:

```yaml
...
spec:
configuration:
permittedHostDevices:
pciHostDevices:
- externalResourceProvider: true
pciVendorSelector: 10DE:2236
resourceName: nvidia.com/GA102GL_A10
...
# Platform Package values
gpu:
# Append (default) — your entries land alongside the NVIDIA table.
# Set to true to drop the NVIDIA table entirely (useful for non-NVIDIA-only
# clusters or strict allowlists). With replaceDefaults: true and an empty
# list below, the rendered CR carries no permittedHostDevices block at all
# and the admission webhook rejects every GPU VM — supply your own list.
replaceDefaults: false
permittedHostDevices:
pciHostDevices:
- pciVendorSelector: "10DE:2236"
resourceName: nvidia.com/GA102GL_A10
externalResourceProvider: true
```

### Upgrading from a hand-edited KubeVirt CR

Earlier Cozystack releases left `spec.configuration.permittedHostDevices` for operators to hand-edit (`kubectl edit kubevirt`). The bundle now **owns** that field: the first reconcile after the upgrade replaces your manual entries with the rendered NVIDIA default table.

Before upgrading:

1. Dump your current entries:

```bash
kubectl get kubevirt -n cozy-kubevirt -o yaml \
| yq '.items[0].spec.configuration.permittedHostDevices'
```

2. Move any custom entries into the Platform Package values under `.gpu.permittedHostDevices` (set `.gpu.replaceDefaults: true` if you want only your own list instead of appending to the NVIDIA defaults).

3. Verify every `resourceName` against what your nodes actually advertise — the default table uses `nvidia-sandbox-device-plugin` slugs (e.g. `nvidia.com/TU104GL_T4`) that differ from legacy driver names (e.g. `TU104GL_TESLA_T4`):

```bash
kubectl describe node <node> | grep nvidia.com/
```

A `resourceName` mismatch is silent until a GPU VM restarts or migrates, at which point the admission webhook rejects it.

### Manual Package-CR override path

If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `components.kubevirt.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When creating a standalone cozystack.kubevirt Package CR directly, the configuration values should be defined under spec.values rather than components.kubevirt.values. The components.<name>.values structure is used when configuring components within the umbrella cozystack-platform package.

Updating this path ensures the standalone Package CR is configured correctly.

Suggested change
If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `components.kubevirt.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist.
If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `spec.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist.


## 3. Create a Virtual Machine

We are now ready to create a VM.
Expand Down