-
Notifications
You must be signed in to change notification settings - Fork 29
docs(gpu): drop manual KubeVirt patch step now that the platform auto-wires permittedHostDevices #556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs(gpu): drop manual KubeVirt patch step now that the platform auto-wires permittedHostDevices #556
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -100,32 +100,69 @@ Allocatable: | |||||||||
| For example, the database entry for A10 reads `2236 GA102GL [A10]`, which results in a resource name `nvidia.com/GA102GL_A10`. | ||||||||||
| {{% /alert %}} | ||||||||||
|
|
||||||||||
| ## 2. Update the KubeVirt Custom Resource | ||||||||||
| ## 2. KubeVirt is wired automatically | ||||||||||
|
|
||||||||||
| Next, we will update the KubeVirt Custom Resource, as documented in the | ||||||||||
| [KubeVirt user guide](https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices), | ||||||||||
| so that the passthrough GPUs are permitted and can be requested by a KubeVirt VM. | ||||||||||
| When `cozystack.gpu-operator` is in `bundles.enabledPackages`, Cozystack mirrors the chosen GPU variant into the `KubeVirt` Custom Resource for you. There is no `kubectl edit kubevirt` step. | ||||||||||
|
|
||||||||||
| Adjust the `pciVendorSelector` and `resourceName` values to match your specific GPU model. | ||||||||||
| Setting `externalResourceProvider=true` indicates that this resource is provided by an external device plugin, | ||||||||||
| in this case the `sandbox-device-plugin` which is deployed by the Operator. | ||||||||||
| Specifically, the platform injects: | ||||||||||
|
|
||||||||||
| - `HostDevices` into `spec.configuration.developerConfiguration.featureGates` (current KubeVirt splits this from the `GPU` gate; the admission webhook rejects `domain.devices.hostDevices` without it). | ||||||||||
| - A starter `spec.configuration.permittedHostDevices.pciHostDevices` table covering common NVIDIA datacenter GPUs — Hopper (H100, H200), Ada Lovelace (L4, L40, L40S), Ampere (A100 PCIe/SXM, A40, A30, A10), Turing (T4), Volta (V100, V100S). PCI vendor:device pairs are stable; `resourceName` slugs follow the `<arch>_<model>_<form>_<mem>` convention `nvidia-sandbox-device-plugin` v25.x emits (e.g. `nvidia.com/GA102GL_A10`). `externalResourceProvider: true` is set on every entry because the resources are advertised by the sandbox plugin, not by KubeVirt's in-tree device plugin. | ||||||||||
|
|
||||||||||
| Verify the resulting CR: | ||||||||||
|
|
||||||||||
| ```bash | ||||||||||
| kubectl edit kubevirt -n cozy-kubevirt | ||||||||||
| kubectl -n cozy-kubevirt get kubevirt kubevirt -o yaml \ | ||||||||||
| | yq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}' | ||||||||||
|
Comment on lines
+115
to
+116
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using Using
Suggested change
|
||||||||||
| ``` | ||||||||||
| example config: | ||||||||||
|
|
||||||||||
| ### Extending or replacing the NVIDIA defaults | ||||||||||
|
|
||||||||||
| If your cluster ships a GPU not in the default table, or your `nvidia-sandbox-device-plugin` version emits a different `resourceName` (check with `kubectl describe node <node> | grep nvidia.com/`), extend the defaults via platform values: | ||||||||||
|
|
||||||||||
| ```yaml | ||||||||||
| ... | ||||||||||
| spec: | ||||||||||
| configuration: | ||||||||||
| permittedHostDevices: | ||||||||||
| pciHostDevices: | ||||||||||
| - externalResourceProvider: true | ||||||||||
| pciVendorSelector: 10DE:2236 | ||||||||||
| resourceName: nvidia.com/GA102GL_A10 | ||||||||||
| ... | ||||||||||
| # Platform Package values | ||||||||||
| gpu: | ||||||||||
| # Append (default) — your entries land alongside the NVIDIA table. | ||||||||||
| # Set to true to drop the NVIDIA table entirely (useful for non-NVIDIA-only | ||||||||||
| # clusters or strict allowlists). With replaceDefaults: true and an empty | ||||||||||
| # list below, the rendered CR carries no permittedHostDevices block at all | ||||||||||
| # and the admission webhook rejects every GPU VM — supply your own list. | ||||||||||
| replaceDefaults: false | ||||||||||
| permittedHostDevices: | ||||||||||
| pciHostDevices: | ||||||||||
| - pciVendorSelector: "10DE:2236" | ||||||||||
| resourceName: nvidia.com/GA102GL_A10 | ||||||||||
| externalResourceProvider: true | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| ### Upgrading from a hand-edited KubeVirt CR | ||||||||||
|
|
||||||||||
| Earlier Cozystack releases left `spec.configuration.permittedHostDevices` for operators to hand-edit (`kubectl edit kubevirt`). The bundle now **owns** that field: the first reconcile after the upgrade replaces your manual entries with the rendered NVIDIA default table. | ||||||||||
|
|
||||||||||
| Before upgrading: | ||||||||||
|
|
||||||||||
| 1. Dump your current entries: | ||||||||||
|
|
||||||||||
| ```bash | ||||||||||
| kubectl get kubevirt -n cozy-kubevirt -o yaml \ | ||||||||||
| | yq '.items[0].spec.configuration.permittedHostDevices' | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 2. Move any custom entries into the Platform Package values under `.gpu.permittedHostDevices` (set `.gpu.replaceDefaults: true` if you want only your own list instead of appending to the NVIDIA defaults). | ||||||||||
|
|
||||||||||
| 3. Verify every `resourceName` against what your nodes actually advertise — the default table uses `nvidia-sandbox-device-plugin` slugs (e.g. `nvidia.com/TU104GL_T4`) that differ from legacy driver names (e.g. `TU104GL_TESLA_T4`): | ||||||||||
|
|
||||||||||
| ```bash | ||||||||||
| kubectl describe node <node> | grep nvidia.com/ | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| A `resourceName` mismatch is silent until a GPU VM restarts or migrates, at which point the admission webhook rejects it. | ||||||||||
|
|
||||||||||
| ### Manual Package-CR override path | ||||||||||
|
|
||||||||||
| If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `components.kubevirt.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist. | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When creating a standalone Updating this path ensures the standalone Package CR is configured correctly.
Suggested change
|
||||||||||
|
|
||||||||||
| ## 3. Create a Virtual Machine | ||||||||||
|
|
||||||||||
| We are now ready to create a VM. | ||||||||||
|
|
||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: cozystack/website
Length of output: 42109
Fix
resourceNamenaming-convention wording (line 110)resourceNameslugs follow<arch>_<model>_<form>_<mem>, but the document’s examples only use two components (e.g.,nvidia.com/GA102GL_A10,nvidia.com/TU104GL_T4). Update the convention text to match the actual slug format (and clarify when/if<form>and<mem>are present) or adjust the examples accordingly.🤖 Prompt for AI Agents