cozystack/cozystack#2780 makes GPU passthrough work out-of-the-box for tenant Kubernetes clusters (the kubernetes app), but the behavior is not documented on the site.
What now happens automatically:
- Node-groups that declare
gpus are labeled gpu=on automatically, so HAMi's device plugin advertises nvidia.com/gpu without manual node labeling.
- The tenant gpu-operator loads the NVIDIA driver with
NVreg_NvLinkDisable=1 by default, so single-SXM-GPU passthrough (no NVSwitch in the VM) no longer hangs on Fabric State: In Progress / system not yet initialized.
- Both defaults are overridable via
addons.gpuOperator.valuesOverride.
content/en/docs/next/kubernetes/gpu-sharing.md currently covers HAMi fractional sharing only — it does not mention the GPU node-group passthrough flow, the automatic gpu=on labeling, or the NVreg_NvLinkDisable=1 default and when to override it (multi-GPU NVLink topologies).
Suggested: add a short section to the tenant-cluster GPU docs describing what works automatically and the override escape hatch. This complements the container-variant guide (#555) and the VM-passthrough KubeVirt auto-wiring (#556).
Non-blocking documentation follow-up surfaced while reviewing cozystack/cozystack#2780.
cozystack/cozystack#2780 makes GPU passthrough work out-of-the-box for tenant Kubernetes clusters (the
kubernetesapp), but the behavior is not documented on the site.What now happens automatically:
gpusare labeledgpu=onautomatically, so HAMi's device plugin advertisesnvidia.com/gpuwithout manual node labeling.NVreg_NvLinkDisable=1by default, so single-SXM-GPU passthrough (no NVSwitch in the VM) no longer hangs onFabric State: In Progress/system not yet initialized.addons.gpuOperator.valuesOverride.content/en/docs/next/kubernetes/gpu-sharing.mdcurrently covers HAMi fractional sharing only — it does not mention the GPU node-group passthrough flow, the automaticgpu=onlabeling, or theNVreg_NvLinkDisable=1default and when to override it (multi-GPU NVLink topologies).Suggested: add a short section to the tenant-cluster GPU docs describing what works automatically and the override escape hatch. This complements the container-variant guide (#555) and the VM-passthrough KubeVirt auto-wiring (#556).
Non-blocking documentation follow-up surfaced while reviewing cozystack/cozystack#2780.