Skip to content

[Bug]: bug(recipes): nvidia-dra-driver-gpu 0.4.0 fails to install via Flux #1289

@haarchri

Description

@haarchri

Prerequisites

  • I searched existing issues and found no duplicates
  • I can reproduce this issue consistently
  • This is not a security vulnerability (use Security Advisories instead)

Bug Description

Since the registry.k8s.io migration in #1285, the nvidia-dra-driver-gpu
component fails helm install on every Flux deployment (aicr bundle --deployer flux, OCI or Git source). The chart renders a duplicate YAML
mapping key in the pod-template labels; plain helm install parses leniently
(last key wins) so the helm deployer is unaffected, but Flux's
helm-controller always runs a post-renderer whose strict parser rejects the
manifest:

Helm install failed for release nvidia-dra-driver/nvidia-dra-driver-nvidia-dra-driver-gpu
with chart dra-driver-nvidia-gpu@0.4.0: error while running post render on manifests:
map[string]interface {}(nil): yaml: unmarshal errors:
  line 29: mapping key "nvidia-dra-driver-gpu-component" already defined at line 28

Impact

Blocking (cannot proceed)

Component

Other / Unknown

Regression?

Yes, this worked before (please specify version below)

Steps to Reproduce

Upstream chart defect triggered by an AICR value:

  • The dra-driver-nvidia-gpu@0.4.0 chart writes TWO component labels into
    each workload's pod template, back to back
    (templates/kubeletplugin.yaml:40-41, templates/controller.yaml:37-38):
    the selectorLabels helper emits <nameOverride || .Chart.Name>-component: <name>, and the next line hardcodes nvidia-dra-driver-gpu-component: <name>. With upstream defaults these are two different keys — redundant
    but valid.
  • recipes/components/nvidia-dra-driver-gpu/values.yaml sets
    nameOverride: nvidia-dra-driver-gpu (added in feat(recipes): migrate nvidia-dra-driver-gpu to registry.k8s.io v0.4.0 #1285 to keep the rendered
    workload names nvidia-dra-driver-gpu-* for the health check, the
    conformance validator, and the ai-conformance chainsaw assert). That makes
    the helper-derived key identical to the hardcoded key → duplicate mapping
    key in the same map → invalid YAML per spec.

Reproduce without a cluster:

helm template x oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu \
  --version 0.4.0 --set gpuResourcesEnabledOverride=true \
  --set nameOverride=nvidia-dra-driver-gpu \
  | grep -n "nvidia-dra-driver-gpu-component"
# the key appears twice inside the same labels block (both workloads)

Why CI didn't catch it

#1288 the flux-oci KWOK lane's chainsaw HelmRelease gate has
exists-semantics (passes when ANY HelmRelease is Ready), so the
InstallFailed dra-driver release never failed the lane; and a failed
install creates no pods, so verify_pods saw nothing Pending either.

Expected Behavior

No values-level workaround exists: any nameOverride that yields the
nvidia-dra-driver-gpu-* workload names necessarily collides with the
hardcoded label.

  1. Drop nameOverride from
    recipes/components/nvidia-dra-driver-gpu/values.yaml (keep
    fullnameOverride, which feeds the ServiceAccount/RoleBinding names and
    is not involved in the collision). Workload names become
    dra-driver-nvidia-gpu-controller / dra-driver-nvidia-gpu-kubelet-plugin.
  2. Update the name references in lockstep:
    • recipes/checks/nvidia-dra-driver-gpu/health-check.yaml
    • validators/conformance/dra_support_check.go (extract to named
      constants while touching it)
    • tests/chainsaw/ai-conformance/common/assert-dra-driver.yaml
    • prose mentions in docs/contributor/validator.md and
      docs/contributor/inference-perf-fluctuation.md
      (validators/deployment/expected_resources.go discovers the DaemonSet by
      role suffix and needs no change; historical conformance evidence under
      docs/conformance/** is captured output and must NOT be rewritten.)
  3. File the upstream bug against kubernetes-sigs/dra-driver-nvidia-gpu
    (duplicate label when nameOverride matches the hardcoded prefix) and
    restore the override + nvidia-dra-driver-gpu-* names once a fixed chart
    ships.

Acceptance criteria

  • aicr bundle --deployer flux bundle of any recipe containing
    nvidia-dra-driver-gpu reconciles to Ready=True under helm-controller.
  • Health check, conformance validator, and chainsaw asserts pass against
    the renamed workloads.
  • Upstream issue filed and linked; tracking note in values.yaml for the
    revert.

Actual Behavior

Environment

  • AICR version (CLI aicr version, API image tag, or commit SHA):
  • Install method (release binary / build from source / container image):
  • Platform (eks/gke/aks/oke/kind/lke/bcm/other):
  • Kubernetes version:
  • OS (ubuntu/cos/other) + version:
  • Kernel version:
  • GPU type (h100/h200/gb200/b200/a100/l40/rtx-pro-6000/other):
  • Workload intent (training/inference):

Command / Request Used

No response

Logs / Error Output

Additional Context

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions