Skip to content

Document out-of-the-box GPU passthrough for tenant Kubernetes clusters (gpu=on auto-label + NvLinkDisable default) #561

@lexfrei

Description

cozystack/cozystack#2780 makes GPU passthrough work out-of-the-box for tenant Kubernetes clusters (the kubernetes app), but the behavior is not documented on the site.

What now happens automatically:

  • Node-groups that declare gpus are labeled gpu=on automatically, so HAMi's device plugin advertises nvidia.com/gpu without manual node labeling.
  • The tenant gpu-operator loads the NVIDIA driver with NVreg_NvLinkDisable=1 by default, so single-SXM-GPU passthrough (no NVSwitch in the VM) no longer hangs on Fabric State: In Progress / system not yet initialized.
  • Both defaults are overridable via addons.gpuOperator.valuesOverride.

content/en/docs/next/kubernetes/gpu-sharing.md currently covers HAMi fractional sharing only — it does not mention the GPU node-group passthrough flow, the automatic gpu=on labeling, or the NVreg_NvLinkDisable=1 default and when to override it (multi-GPU NVLink topologies).

Suggested: add a short section to the tenant-cluster GPU docs describing what works automatically and the override escape hatch. This complements the container-variant guide (#555) and the VM-passthrough KubeVirt auto-wiring (#556).

Non-blocking documentation follow-up surfaced while reviewing cozystack/cozystack#2780.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions