You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since the registry.k8s.io migration in #1285, the nvidia-dra-driver-gpu
component fails helm install on every Flux deployment (aicr bundle --deployer flux, OCI or Git source). The chart renders a duplicate YAML
mapping key in the pod-template labels; plain helm install parses leniently
(last key wins) so the helm deployer is unaffected, but Flux's
helm-controller always runs a post-renderer whose strict parser rejects the
manifest:
Helm install failed for release nvidia-dra-driver/nvidia-dra-driver-nvidia-dra-driver-gpu
with chart dra-driver-nvidia-gpu@0.4.0: error while running post render on manifests:
map[string]interface {}(nil): yaml: unmarshal errors:
line 29: mapping key "nvidia-dra-driver-gpu-component" already defined at line 28
Impact
Blocking (cannot proceed)
Component
Other / Unknown
Regression?
Yes, this worked before (please specify version below)
Steps to Reproduce
Upstream chart defect triggered by an AICR value:
The dra-driver-nvidia-gpu@0.4.0 chart writes TWO component labels into
each workload's pod template, back to back
(templates/kubeletplugin.yaml:40-41, templates/controller.yaml:37-38):
the selectorLabels helper emits <nameOverride || .Chart.Name>-component: <name>, and the next line hardcodes nvidia-dra-driver-gpu-component: <name>. With upstream defaults these are two different keys — redundant
but valid.
recipes/components/nvidia-dra-driver-gpu/values.yaml sets nameOverride: nvidia-dra-driver-gpu (added in feat(recipes): migrate nvidia-dra-driver-gpu to registry.k8s.io v0.4.0 #1285 to keep the rendered
workload names nvidia-dra-driver-gpu-* for the health check, the
conformance validator, and the ai-conformance chainsaw assert). That makes
the helper-derived key identical to the hardcoded key → duplicate mapping
key in the same map → invalid YAML per spec.
Reproduce without a cluster:
helm template x oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu \
--version 0.4.0 --set gpuResourcesEnabledOverride=true \
--set nameOverride=nvidia-dra-driver-gpu \
| grep -n "nvidia-dra-driver-gpu-component"# the key appears twice inside the same labels block (both workloads)
Why CI didn't catch it
#1288 the flux-oci KWOK lane's chainsaw HelmRelease gate has
exists-semantics (passes when ANY HelmRelease is Ready), so the InstallFailed dra-driver release never failed the lane; and a failed
install creates no pods, so verify_pods saw nothing Pending either.
Expected Behavior
No values-level workaround exists: any nameOverride that yields the nvidia-dra-driver-gpu-* workload names necessarily collides with the
hardcoded label.
Drop nameOverride from recipes/components/nvidia-dra-driver-gpu/values.yaml (keep fullnameOverride, which feeds the ServiceAccount/RoleBinding names and
is not involved in the collision). Workload names become dra-driver-nvidia-gpu-controller / dra-driver-nvidia-gpu-kubelet-plugin.
prose mentions in docs/contributor/validator.md and docs/contributor/inference-perf-fluctuation.md
(validators/deployment/expected_resources.go discovers the DaemonSet by
role suffix and needs no change; historical conformance evidence under docs/conformance/** is captured output and must NOT be rewritten.)
File the upstream bug against kubernetes-sigs/dra-driver-nvidia-gpu
(duplicate label when nameOverride matches the hardcoded prefix) and
restore the override + nvidia-dra-driver-gpu-* names once a fixed chart
ships.
Acceptance criteria
aicr bundle --deployer flux bundle of any recipe containing nvidia-dra-driver-gpu reconciles to Ready=True under helm-controller.
Health check, conformance validator, and chainsaw asserts pass against
the renamed workloads.
Upstream issue filed and linked; tracking note in values.yaml for the
revert.
Actual Behavior
Environment
AICR version (CLI aicr version, API image tag, or commit SHA):
Prerequisites
Bug Description
Since the registry.k8s.io migration in #1285, the
nvidia-dra-driver-gpucomponent fails
helm installon every Flux deployment (aicr bundle --deployer flux, OCI or Git source). The chart renders a duplicate YAMLmapping key in the pod-template labels; plain
helm installparses leniently(last key wins) so the helm deployer is unaffected, but Flux's
helm-controller always runs a post-renderer whose strict parser rejects the
manifest:
Impact
Blocking (cannot proceed)
Component
Other / Unknown
Regression?
Yes, this worked before (please specify version below)
Steps to Reproduce
Upstream chart defect triggered by an AICR value:
dra-driver-nvidia-gpu@0.4.0chart writes TWO component labels intoeach workload's pod template, back to back
(
templates/kubeletplugin.yaml:40-41,templates/controller.yaml:37-38):the
selectorLabelshelper emits<nameOverride || .Chart.Name>-component: <name>, and the next line hardcodesnvidia-dra-driver-gpu-component: <name>. With upstream defaults these are two different keys — redundantbut valid.
recipes/components/nvidia-dra-driver-gpu/values.yamlsetsnameOverride: nvidia-dra-driver-gpu(added in feat(recipes): migrate nvidia-dra-driver-gpu to registry.k8s.io v0.4.0 #1285 to keep the renderedworkload names
nvidia-dra-driver-gpu-*for the health check, theconformance validator, and the ai-conformance chainsaw assert). That makes
the helper-derived key identical to the hardcoded key → duplicate mapping
key in the same map → invalid YAML per spec.
Reproduce without a cluster:
Why CI didn't catch it
#1288 the flux-oci KWOK lane's chainsaw HelmRelease gate has
exists-semantics (passes when ANY HelmRelease is Ready), so the
InstallFaileddra-driver release never failed the lane; and a failedinstall creates no pods, so
verify_podssaw nothing Pending either.Expected Behavior
No values-level workaround exists: any
nameOverridethat yields thenvidia-dra-driver-gpu-*workload names necessarily collides with thehardcoded label.
nameOverridefromrecipes/components/nvidia-dra-driver-gpu/values.yaml(keepfullnameOverride, which feeds the ServiceAccount/RoleBinding names andis not involved in the collision). Workload names become
dra-driver-nvidia-gpu-controller/dra-driver-nvidia-gpu-kubelet-plugin.recipes/checks/nvidia-dra-driver-gpu/health-check.yamlvalidators/conformance/dra_support_check.go(extract to namedconstants while touching it)
tests/chainsaw/ai-conformance/common/assert-dra-driver.yamldocs/contributor/validator.mdanddocs/contributor/inference-perf-fluctuation.md(
validators/deployment/expected_resources.godiscovers the DaemonSet byrole suffix and needs no change; historical conformance evidence under
docs/conformance/**is captured output and must NOT be rewritten.)(duplicate label when
nameOverridematches the hardcoded prefix) andrestore the override +
nvidia-dra-driver-gpu-*names once a fixed chartships.
Acceptance criteria
aicr bundle --deployer fluxbundle of any recipe containingnvidia-dra-driver-gpureconciles toReady=Trueunder helm-controller.the renamed workloads.
revert.
Actual Behavior
Environment
aicr version, API image tag, or commit SHA):Command / Request Used
No response
Logs / Error Output
Additional Context