Skip to content

feat(helm): update chart gpu-operator-charts ( v1.4.1 ➔ v1.5.0 )#3208

Open
renovate[bot] wants to merge 1 commit into
mainfrom
renovate/gpu-operator-charts-1.x
Open

feat(helm): update chart gpu-operator-charts ( v1.4.1 ➔ v1.5.0 )#3208
renovate[bot] wants to merge 1 commit into
mainfrom
renovate/gpu-operator-charts-1.x

Conversation

@renovate
Copy link
Copy Markdown
Contributor

@renovate renovate Bot commented May 7, 2026

This PR contains the following updates:

Package Update Change
gpu-operator-charts minor v1.4.1v1.5.0

Release Notes

ROCm/gpu-operator (gpu-operator-charts)

v1.5.0: gpu-operator-charts-v1.5.0

Compare Source

GPU Operator v1.5.0 Release Notes

The AMD GPU Operator v1.5.0 release introduces support for Dynamic Resource Allocation (DRA) as an alternative to the device plugin, Auto Node Remediation (ANR) for automated recovery of unhealthy GPU worker nodes, and Node Problem Detector (NPD) integration for surfacing GPU-related node conditions. The release also brings broader configurability across the operator (KMM, kubelet socket path, custom package repositories, host network configs, global image pull secrets) and updates the managed stack to ROCm 7.2.1.

Release Highlights
  • Dynamic Resource Allocation (DRA) Driver Support

    • The GPU Operator can now deploy and manage the AMD GPU DRA Driver as an alternative to the traditional Kubernetes Device Plugin for exposing AMD Instinct GPUs to workloads.
    • New spec.draDriver section in DeviceConfig enables and configures the DRA driver (image, command-line arguments, image pull policy, etc.). The default image is rocm/k8s-gpu-dra-driver:latest.
    • The operator enforces mutual exclusion between draDriver.enable=true and devicePlugin.enableDevicePlugin=true on the same DeviceConfig.
    • DRA DeviceClass is created automatically by the operator. The operator dynamically discovers the highest available DRA API version (resource.k8s.io v1v1beta2v1beta1) so it works across Kubernetes versions from 1.32 to 1.35 and OpenShift 4.21.
    • Required RBAC for the DRA driver on OpenShift (including SecurityContextConstraints) is created by the operator out-of-the-box.
    • Requirements: Kubernetes 1.32+ with DynamicResourceAllocation feature gate enabled, the amdgpu kernel module loaded on worker nodes, and CDI enabled in the container runtime (default in containerd 2.0+ / CRI-O).
    • Refer to the DRA Driver documentation for details.
  • Auto Node Remediation (ANR) for GPU Worker Nodes

    • The GPU Operator can now automatically remediate GPU worker nodes that become unhealthy due to GPU-related issues, restoring them to a healthy state without manual intervention.
    • Remediation is driven by Argo Workflows and configured through the DeviceConfig CR. Customizable behaviors include:
      • nodeRemediationLabels / node selector for which nodes ANR applies to
      • maxParallelWorkflows to throttle the number of concurrent remediations
      • recoveryPolicy and drainPolicy to control how nodes are drained, recovered, and reset
      • Sequential / time-window conditions, custom taints, and a TTL for cleaning up completed workflows
      • Reboot step is supported and reboot-step failures are now handled gracefully by the workflow
      • Workflow image / ConfigMapImage can be overridden, and global image pull secrets are honored throughout the workflow
    • Pulls workflow status directly from the Argo API server to avoid race conditions, and fails the workflow promptly when a step pod is in ImagePullBackOff.
    • OpenShift support for ANR is included in this release (in addition to vanilla Kubernetes).
    • Requires Argo Workflows v4.0.3.
    • Refer to the Auto Node Remediation documentation for details.
  • Node Problem Detector (NPD) Integration

    • The GPU Operator now integrates with Node Problem Detector to surface AMD GPU problems (e.g. inband-RAS errors reported by Device Metrics Exporter) as node conditions, which can in turn drive Auto Node Remediation.
    • Reference RBAC, custom problem rules, and authentication configurations are provided for bearer token, RBAC HTTP, root CA, and mTLS-protected metrics endpoints.
    • Refer to the Node Problem Detector documentation for details.
  • Enhanced KMM (Kernel Module Management) Configuration Control

    • Independent Control of KMM Installation and Usage
      • New helm parameters provide separate control over KMM installation and resource watching:
        • kmm.enabled: Controls KMM subchart installation (default: true)
        • kmm.watch: Controls GPU operator watching KMM resources (default: true)
      • Supports multiple deployment scenarios: use existing KMM installations (enabled=false, watch=true), skip KMM entirely for alternative driver solutions (enabled=false, watch=false), or install KMM without asking for GPU Operator to use it (enabled=true, watch=false)
      • Fully backward compatible: existing configurations with kmm.enabled=false continue to work without changes
  • Custom Package Repository URLs for Driver Image Build

    • Two new optional fields under spec.driver.imageBuild allow specifying custom package repository and GPG key URLs for in-cluster amdgpu driver image builds:
      • packageRepoURL: full URL to the package repo
      • gpgKeyURL: full URL to the GPG key
    • Useful for environments where repo.radeon.com is mirrored, restructured, or unreachable.
  • Configurable Kubelet Socket Paths

    • New spec.devicePlugin.kubeletSocketPath field on the DeviceConfig CR allows users to override the default kubelet device-plugins directory (/var/lib/kubelet/device-plugins).
    • New spec.metricsExporter.podResourceAPISocketPath field allows users to override the default kubelet pod-resources directory (/var/lib/kubelet/pod-resources).
    • Enables the operator, device plugin, and metrics exporter to work on Kubernetes distributions such as MicroK8s, k3s, and other custom setups where the kubelet sockets live at non-standard paths.
  • Host Network Support for Device Plugin and Metrics Exporter

    • New optional hostNetwork field on spec.devicePlugin and spec.metricsExporter allows their respective DaemonSet pods to run in the host network namespace, simplifying integration with host-level monitoring and networking stacks.
  • Global Image Pull Secrets Injection

    • A new spec.commonConfig.imageRegistrySecrets field on DeviceConfig (and global.imagePullSecrets in the Helm chart) lets users specify image pull secrets that are injected into all operator-managed workloads (driver build/sign pods, device plugin, metrics exporter, node labeller, DCM, DRA driver, ANR workflow, test runner, etc.), reducing duplicated configuration in air-gapped and private-registry environments.
  • Open Source GPU Validation Cluster Example

    • The example/gpu-validation-cluster/ directory has been open-sourced, providing a turnkey reference for standing up a GPU validation cluster (Dockerfile, configs, and helper scripts) for users who want to evaluate the GPU / AINIC end-to-end.
  • Node Feature Discovery (NFD) Upgrade

    • Upgraded NFD helm chart dependency from v0.16.1 to v0.18.3
  • Device Metrics Exporter Enhancements

    • Unix Domain Socket for IPC: GPU Agent now communicates with the exporter over /var/run/gpuagent.sock, replacing TCP/IP for lower latency and improved security.
    • New Metrics:
      • GPU_PROCESS_CU_OCCUPANCY — per-process Compute Unit occupancy, with a process_id label.
      • GPU_ECC_DEFERRED_* — deferred ECC error counts for each ECC-supported block.
    • Profiler SamplingInterval: configurable profiler sampling window (default 1000 µs / 1 ms).
    • Configurable Health Polling Rate: new PollingRate field in HealthService (under CommonConfig) accepts duration strings (30s, 5m, 1h, 23h10m15s); default 30 s, min 30 s, max 24 h.
    • KFD_PROCESS_ID Label Now Optional: disabled by default to reduce metric cardinality and Prometheus storage cost. Users who need it can re-enable it via the exporter ConfigMap.
Platform Support
  • All v1.5.0 features are validated on vanilla Kubernetes 1.34, 1.35, plus OpenShift 4.21.
Fixes
  1. Helm upgrade failure

    • Fixed a helm upgrade hang/failure caused by changes in the Argo Workflows v4 CRDs.
  2. DCM: missing default ConfigMap when spec.configManager.config is omitted

    • Device Config Manager now mounts a default ConfigMap when no custom configuration is provided, instead of failing to start.
  3. CVE remediations

    • Base image updated to reduce CVE exposure.
    • grpc upgraded to address CVE-2026-33186.
Known Limitations

Note: All current and historical limitations for the GPU Operator, including their latest statuses and any associated workarounds or fixes, are tracked in the following documentation page: Known Issues and Limitations.
Please refer to this page regularly for the most up-to-date information.

  • DRA Driver and Device Plugin are mutually exclusive
    • The DRA driver and the traditional Device Plugin cannot be enabled simultaneously on the same DeviceConfig. The operator validates this and rejects configurations where both draDriver.enable=true and devicePlugin.enableDevicePlugin=true.

Configuration

📅 Schedule: (in timezone America/New_York)

  • Branch creation
    • At any time (no schedule defined)
  • Automerge
    • At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

qgr1-cluster-0 - kustomization

--- k8s/base/amd-gpu-operator/amd-gpu-operator Kustomization: flux-system/amd-gpu-operator-amd-gpu-operator HelmRelease: amd-gpu-operator/amd-gpu-operator

+++ k8s/base/amd-gpu-operator/amd-gpu-operator Kustomization: flux-system/amd-gpu-operator-amd-gpu-operator HelmRelease: amd-gpu-operator/amd-gpu-operator

@@ -13,13 +13,13 @@

     spec:
       chart: gpu-operator-charts
       sourceRef:
         kind: HelmRepository
         name: amd-gpu-operator-charts
         namespace: flux-system
-      version: v1.4.1
+      version: v1.5.0
   install:
     createNamespace: true
     remediation:
       retries: 50
     timeout: 15m
   interval: 30m

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

qgr1-cluster-0 - helmrelease

--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-config-manager

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-config-manager

@@ -25,12 +25,13 @@

   - nodes
   verbs:
   - get
   - list
   - watch
   - update
+  - patch
 - apiGroups:
   - apps
   resources:
   - daemonsets
   verbs:
   - get
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-manager-role

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-manager-role

@@ -80,29 +80,32 @@

   - update
   - watch
 - apiGroups:
   - amd.com
   resources:
   - deviceconfigs
+  - remediationworkflowstatuses
   verbs:
   - create
   - get
   - list
   - patch
   - update
   - watch
 - apiGroups:
   - amd.com
   resources:
   - deviceconfigs/finalizers
+  - remediationworkflowstatuses/finalizers
   verbs:
   - update
 - apiGroups:
   - amd.com
   resources:
   - deviceconfigs/status
+  - remediationworkflowstatuses/status
   verbs:
   - get
   - patch
   - update
 - apiGroups:
   - apiextensions.k8s.io
@@ -132,12 +135,24 @@

   verbs:
   - create
   - get
   - update
   - watch
 - apiGroups:
+  - apps
+  resources:
+  - deployments
+  verbs:
+  - create
+  - delete
+  - get
+  - list
+  - patch
+  - update
+  - watch
+- apiGroups:
   - kmm.sigs.x-k8s.io
   resources:
   - modules
   verbs:
   - create
   - delete
@@ -197,7 +212,21 @@

   resources:
   - nodefeaturediscoveries/finalizers
   - nodefeaturediscoveries/status
   verbs:
   - get
   - update
+- apiGroups:
+  - resource.k8s.io
+  resources:
+  - deviceclasses
+  verbs:
+  - create
+- apiGroups:
+  - security.openshift.io
+  resourceNames:
+  - privileged
+  resources:
+  - securitycontextconstraints
+  verbs:
+  - use
 
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-kmm-controller

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-kmm-controller

@@ -41,24 +41,24 @@

       nodeSelector: {}
       containers:
       - args:
         - --config=controller_config.yaml
         env:
         - name: RELATED_IMAGE_WORKER
-          value: docker.io/rocm/kernel-module-management-worker:v1.4.1
+          value: docker.io/rocm/kernel-module-management-worker:v1.5.0
         - name: OPERATOR_NAMESPACE
           valueFrom:
             fieldRef:
               fieldPath: metadata.namespace
         - name: RELATED_IMAGE_BUILD
           value: gcr.io/kaniko-project/executor:v1.23.2
         - name: RELATED_IMAGE_SIGN
-          value: docker.io/rocm/kernel-module-management-signimage:v1.4.1
+          value: docker.io/rocm/kernel-module-management-signimage:v1.5.0
         - name: KUBERNETES_CLUSTER_DOMAIN
           value: cluster.local
-        image: docker.io/rocm/kernel-module-management-operator:v1.4.1
+        image: docker.io/rocm/kernel-module-management-operator:v1.5.0
         imagePullPolicy: Always
         livenessProbe:
           httpGet:
             path: /healthz
             port: 8081
           initialDelaySeconds: 15
@@ -89,12 +89,15 @@

           subPath: controller_config.yaml
       securityContext:
         runAsNonRoot: true
       serviceAccountName: amd-gpu-operator-kmm-controller
       terminationGracePeriodSeconds: 10
       tolerations:
+      - key: amd-gpu-unhealthy
+        operator: Exists
+        effect: NoSchedule
       - effect: NoSchedule
         key: node-role.kubernetes.io/master
         operator: Equal
         value: ''
       - effect: NoSchedule
         key: node-role.kubernetes.io/control-plane
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-kmm-webhook-server

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-kmm-webhook-server

@@ -45,13 +45,13 @@

         - --enable-module
         - --enable-namespace
         - --enable-preflightvalidation
         env:
         - name: KUBERNETES_CLUSTER_DOMAIN
           value: cluster.local
-        image: docker.io/rocm/kernel-module-management-webhook-server:v1.4.1
+        image: docker.io/rocm/kernel-module-management-webhook-server:v1.5.0
         imagePullPolicy: Always
         livenessProbe:
           httpGet:
             path: /healthz
             port: 8081
           initialDelaySeconds: 15
@@ -85,12 +85,15 @@

           subPath: controller_config.yaml
       securityContext:
         runAsNonRoot: true
       serviceAccountName: amd-gpu-operator-kmm-controller
       terminationGracePeriodSeconds: 10
       tolerations:
+      - key: amd-gpu-unhealthy
+        operator: Exists
+        effect: NoSchedule
       - effect: NoSchedule
         key: node-role.kubernetes.io/master
         operator: Equal
         value: ''
       - effect: NoSchedule
         key: node-role.kubernetes.io/control-plane
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-controller-manager

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-controller-manager

@@ -48,13 +48,15 @@

             fieldRef:
               fieldPath: metadata.namespace
         - name: KUBERNETES_CLUSTER_DOMAIN
           value: cluster.local
         - name: SIM_ENABLE
           value: 'false'
-        image: docker.io/rocm/gpu-operator:v1.4.1
+        - name: KMM_WATCH_ENABLED
+          value: 'true'
+        image: docker.io/rocm/gpu-operator:v1.5.0
         imagePullPolicy: Always
         livenessProbe:
           httpGet:
             path: /healthz
             port: 8081
           initialDelaySeconds: 15
@@ -81,12 +83,15 @@

           subPath: controller_manager_config.yaml
       securityContext:
         runAsNonRoot: true
       serviceAccountName: amd-gpu-operator-gpu-operator-charts-controller-manager
       terminationGracePeriodSeconds: 10
       tolerations:
+      - key: amd-gpu-unhealthy
+        operator: Exists
+        effect: NoSchedule
       - effect: NoSchedule
         key: node-role.kubernetes.io/master
         operator: Equal
         value: ''
       - effect: NoSchedule
         key: node-role.kubernetes.io/control-plane
--- HelmRelease: amd-gpu-operator/amd-gpu-operator DeviceConfig: amd-gpu-operator/default

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator DeviceConfig: amd-gpu-operator/default

@@ -15,14 +15,17 @@

     driverType: container
     image: docker.io/myUserName/driverImage
     imageRegistryTLS:
       insecure: false
       insecureSkipTLSVerify: false
     version: 30.20.1
+    imageBuild:
+      gpgKeyURL: ''
+      packageRepoURL: ''
     upgradePolicy:
-      enable: true
+      enable: false
       maxParallelUpgrades: 3
       maxUnavailableNodes: 25%
       nodeDrainPolicy:
         force: true
         gracePeriodSeconds: -1
         timeoutSeconds: 300
@@ -34,31 +37,35 @@

   commonConfig:
     initContainerImage: busybox:1.37
     utilsContainer:
       image: docker.io/rocm/gpu-operator-utils:v1.4.1
       imagePullPolicy: IfNotPresent
   devicePlugin:
+    enableDevicePlugin: true
     devicePluginImage: rocm/k8s-device-plugin:latest
     devicePluginImagePullPolicy: IfNotPresent
     enableNodeLabeller: true
     nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
     nodeLabellerImagePullPolicy: IfNotPresent
     upgradePolicy:
       maxUnavailable: 1
       upgradeStrategy: RollingUpdate
+    kubeletSocketPath: /var/lib/kubelet/device-plugins
+    hostNetwork: false
   metricsExporter:
     enable: true
     serviceType: ClusterIP
     port: 5000
     nodePort: 32500
     image: docker.io/rocm/device-metrics-exporter:v1.4.2
     imagePullPolicy: IfNotPresent
     upgradePolicy:
       maxUnavailable: 1
       upgradeStrategy: RollingUpdate
     podResourceAPISocketPath: /var/lib/kubelet/pod-resources
+    hostNetwork: false
     rbacConfig:
       enable: false
       image: quay.io/brancz/kube-rbac-proxy:v0.18.1
       disableHttps: false
       staticAuthorization:
         clientName: ''
@@ -77,14 +84,27 @@

       hostPath: /var/log/amd-test-runner
       logsExportSecrets: []
       mountPath: /var/log/amd-test-runner
     upgradePolicy:
       maxUnavailable: 1
       upgradeStrategy: RollingUpdate
+  draDriver:
+    enable: false
+    image: docker.io/rocm/k8s-gpu-dra-driver:latest
+    imagePullPolicy: IfNotPresent
+    upgradePolicy:
+      maxUnavailable: 1
+      upgradeStrategy: RollingUpdate
   configManager:
     enable: false
     image: docker.io/rocm/device-config-manager:v1.4.1
     imagePullPolicy: IfNotPresent
     upgradePolicy:
       maxUnavailable: 1
       upgradeStrategy: RollingUpdate
+  remediationWorkflow:
+    enable: false
+    ttlForFailedWorkflows: 24h
+    testerImage: docker.io/rocm/test-runner:v1.5.0
+    autoStartWorkflow: true
+    rebootTimeout: 15m
 
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/delete-custom-resource-definitions

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/delete-custom-resource-definitions

@@ -17,19 +17,46 @@

   ttlSecondsAfterFinished: 60
   template:
     spec:
       serviceAccountName: amd-gpu-operator-gpu-operator-charts-prune
       containers:
       - name: delete-custom-resource-definitions
-        image: docker.io/rocm/gpu-operator:v1.4.1
+        image: docker.io/rocm/gpu-operator:v1.5.0
         command:
         - /bin/sh
         - -c
         - |
           if kubectl get crds deviceconfigs.amd.com > /dev/null 2>&1; then
             kubectl delete crds deviceconfigs.amd.com
+          fi
+          if kubectl get crds remediationworkflowstatuses.amd.com > /dev/null 2>&1; then
+            kubectl delete crds remediationworkflowstatuses.amd.com
+          fi
+          if kubectl get crds clusterworkflowtemplates.argoproj.io > /dev/null 2>&1; then
+            kubectl delete crds clusterworkflowtemplates.argoproj.io
+          fi
+          if kubectl get crds cronworkflows.argoproj.io > /dev/null 2>&1; then
+            kubectl delete crds cronworkflows.argoproj.io
+          fi
+          if kubectl get crds workflows.argoproj.io > /dev/null 2>&1; then
+            kubectl delete crds workflows.argoproj.io
+          fi
+          if kubectl get crds workflowartifactgctasks.argoproj.io > /dev/null 2>&1; then
+            kubectl delete crds workflowartifactgctasks.argoproj.io
+          fi
+          if kubectl get crds workfloweventbindings.argoproj.io > /dev/null 2>&1; then
+            kubectl delete crds workfloweventbindings.argoproj.io
+          fi
+          if kubectl get crds workflowtaskresults.argoproj.io > /dev/null 2>&1; then
+            kubectl delete crds workflowtaskresults.argoproj.io
+          fi
+          if kubectl get crds workflowtasksets.argoproj.io > /dev/null 2>&1; then
+            kubectl delete crds workflowtasksets.argoproj.io
+          fi
+          if kubectl get crds workflowtemplates.argoproj.io > /dev/null 2>&1; then
+            kubectl delete crds workflowtemplates.argoproj.io
           fi
           if kubectl get crds modules.kmm.sigs.x-k8s.io > /dev/null 2>&1; then
             kubectl delete crds modules.kmm.sigs.x-k8s.io
           fi
           if kubectl get crds nodemodulesconfigs.kmm.sigs.x-k8s.io > /dev/null 2>&1; then
             kubectl delete crds nodemodulesconfigs.kmm.sigs.x-k8s.io
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/delete-leftover-deviceconfigs

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/delete-leftover-deviceconfigs

@@ -17,13 +17,13 @@

   ttlSecondsAfterFinished: 60
   template:
     spec:
       serviceAccountName: amd-gpu-operator-gpu-operator-charts-pre-delete
       containers:
       - name: delete-leftover-deviceconfigs
-        image: docker.io/rocm/gpu-operator:v1.4.1
+        image: docker.io/rocm/gpu-operator:v1.5.0
         command:
         - /bin/sh
         - -c
         - |
           installed=$(kubectl api-resources -owide | grep -i amd.com | grep -i deviceconfig)
           if [ -z ${installed} ] ; then
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/pre-upgrade-check

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/pre-upgrade-check

@@ -17,13 +17,13 @@

   ttlSecondsAfterFinished: 60
   template:
     spec:
       serviceAccountName: pre-upgrade-check-sa
       containers:
       - name: pre-upgrade-check
-        image: docker.io/rocm/gpu-operator:v1.4.1
+        image: docker.io/rocm/gpu-operator:v1.5.0
         command:
         - /bin/sh
         - -c
         - |
           # Ignore the lack of CRDs, probably haven't actually been installed yet
           # this provides idempotentcy when "things" don't understand the difference between
@@ -37,13 +37,13 @@

           deviceconfigs=$(kubectl get deviceconfigs -n amd-gpu-operator -o json)
 
           echo "DeviceConfigs JSON:"
           echo "$deviceconfigs" | jq .
 
           # Check if any UpgradeState is in the blocked states
-          blocked_states='["Upgrade-Not-Started", "Upgrade-Started", "Install-In-Progress", "Upgrade-In-Progress"]'
+          blocked_states='["Upgrade-Not-Started", "Upgrade-Started", "Upgrade-In-Progress"]'
           if echo "$deviceconfigs" | jq --argjson blocked_states "$blocked_states" -e '
               .items[] |
               .status.nodeModuleStatus // {} |
               to_entries |
               any(.value.status as $state | ($blocked_states | index($state)))' > /dev/null; then
             echo "Upgrade blocked: Some DeviceConfigs are in a disallowed UpgradeState."
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/upgrade-crd

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/upgrade-crd

@@ -34,17 +34,26 @@

               matchExpressions:
               - key: node-role.kubernetes.io/control-plane
                 operator: Exists
             weight: 1
       containers:
       - name: upgrade-crd
-        image: docker.io/rocm/gpu-operator:v1.4.1
+        image: docker.io/rocm/gpu-operator:v1.5.0
         imagePullPolicy: Always
         command:
         - /bin/sh
         - -c
         - |
-          kubectl apply -f /opt/helm-charts-crds-k8s/deviceconfig-crd.yaml
-          kubectl apply -f /opt/helm-charts-crds-k8s/module-crd.yaml
-          kubectl apply -f /opt/helm-charts-crds-k8s/nodemodulesconfig-crd.yaml
+          kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/deviceconfig-crd.yaml
+          kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/remediationworkflowstatus-crd.yaml
+          kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/module-crd.yaml
+          kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/nodemodulesconfig-crd.yaml
+          kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/clusterworkflowtemplate-crd.yaml
+          kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/cronworkflow-crd.yaml
+          kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/workflowartifactgctask-crd.yaml
+          kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/workflow-crd.yaml
+          kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/workfloweventbinding-crd.yaml
+          kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/workflowtaskresult-crd.yaml
+          kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/workflowtaskset-crd.yaml
+          kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/workflowtemplate-crd.yaml
       restartPolicy: OnFailure
 
--- HelmRelease: amd-gpu-operator/amd-gpu-operator PriorityClass: amd-gpu-operator/amd-gpu-operator-workflow-controller-pc

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator PriorityClass: amd-gpu-operator/amd-gpu-operator-workflow-controller-pc

@@ -0,0 +1,7 @@

+---
+apiVersion: scheduling.k8s.io/v1
+kind: PriorityClass
+metadata:
+  name: amd-gpu-operator-workflow-controller-pc
+value: 1000000
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ServiceAccount: amd-gpu-operator/argo

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ServiceAccount: amd-gpu-operator/argo

@@ -0,0 +1,7 @@

+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: argo
+  namespace: amd-gpu-operator
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ServiceAccount: amd-gpu-operator/amd-gpu-operator-dra-driver

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ServiceAccount: amd-gpu-operator/amd-gpu-operator-dra-driver

@@ -0,0 +1,12 @@

+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: amd-gpu-operator-dra-driver
+  labels:
+    app.kubernetes.io/component: amd-gpu
+    app.kubernetes.io/part-of: amd-gpu
+    app.kubernetes.io/name: gpu-operator-charts
+    app.kubernetes.io/instance: amd-gpu-operator
+    app.kubernetes.io/managed-by: Helm
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ConfigMap: amd-gpu-operator/amd-gpu-operator-workflow-controller-config

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ConfigMap: amd-gpu-operator/amd-gpu-operator-workflow-controller-config

@@ -0,0 +1,9 @@

+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: amd-gpu-operator-workflow-controller-config
+  namespace: amd-gpu-operator
+data:
+  instanceID: amd-gpu-operator-remediation-workflow
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-argo-workflow

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-argo-workflow

@@ -0,0 +1,39 @@

+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: amd-gpu-operator-gpu-operator-charts-argo-workflow
+  labels:
+    app.kubernetes.io/component: amd-gpu
+    app.kubernetes.io/part-of: amd-gpu
+    app.kubernetes.io/name: gpu-operator-charts
+    app.kubernetes.io/instance: amd-gpu-operator
+    app.kubernetes.io/managed-by: Helm
+rules:
+- apiGroups:
+  - argoproj.io
+  resources:
+  - '*'
+  verbs:
+  - '*'
+- apiGroups:
+  - batch
+  resources:
+  - jobs
+  verbs:
+  - '*'
+- apiGroups:
+  - ''
+  resources:
+  - pods/log
+  verbs:
+  - '*'
+- apiGroups:
+  - ''
+  resources:
+  - serviceaccounts
+  verbs:
+  - get
+  - list
+  - watch
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-dra-driver

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-dra-driver

@@ -0,0 +1,37 @@

+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: amd-gpu-operator-gpu-operator-charts-dra-driver
+  labels:
+    app.kubernetes.io/component: amd-gpu
+    app.kubernetes.io/part-of: amd-gpu
+    app.kubernetes.io/name: gpu-operator-charts
+    app.kubernetes.io/instance: amd-gpu-operator
+    app.kubernetes.io/managed-by: Helm
+rules:
+- apiGroups:
+  - resource.k8s.io
+  resources:
+  - resourceclaims
+  verbs:
+  - get
+- apiGroups:
+  - ''
+  resources:
+  - nodes
+  verbs:
+  - get
+- apiGroups:
+  - resource.k8s.io
+  resources:
+  - resourceslices
+  verbs:
+  - get
+  - list
+  - watch
+  - create
+  - update
+  - patch
+  - delete
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-aggregate-to-admin

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-aggregate-to-admin

@@ -0,0 +1,35 @@

+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  labels:
+    rbac.authorization.k8s.io/aggregate-to-admin: 'true'
+  name: argo-aggregate-to-admin
+rules:
+- apiGroups:
+  - argoproj.io
+  resources:
+  - workflows
+  - workflows/finalizers
+  - workfloweventbindings
+  - workfloweventbindings/finalizers
+  - workflowtemplates
+  - workflowtemplates/finalizers
+  - cronworkflows
+  - cronworkflows/finalizers
+  - clusterworkflowtemplates
+  - clusterworkflowtemplates/finalizers
+  - workflowtasksets
+  - workflowtasksets/finalizers
+  - workflowtaskresults
+  - workflowtaskresults/finalizers
+  verbs:
+  - create
+  - delete
+  - deletecollection
+  - get
+  - list
+  - patch
+  - update
+  - watch
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-aggregate-to-edit

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-aggregate-to-edit

@@ -0,0 +1,33 @@

+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  labels:
+    rbac.authorization.k8s.io/aggregate-to-edit: 'true'
+  name: argo-aggregate-to-edit
+rules:
+- apiGroups:
+  - argoproj.io
+  resources:
+  - workflows
+  - workflows/finalizers
+  - workfloweventbindings
+  - workfloweventbindings/finalizers
+  - workflowtemplates
+  - workflowtemplates/finalizers
+  - cronworkflows
+  - cronworkflows/finalizers
+  - clusterworkflowtemplates
+  - clusterworkflowtemplates/finalizers
+  - workflowtaskresults
+  - workflowtaskresults/finalizers
+  verbs:
+  - create
+  - delete
+  - deletecollection
+  - get
+  - list
+  - patch
+  - update
+  - watch
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-aggregate-to-view

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-aggregate-to-view

@@ -0,0 +1,28 @@

+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  labels:
+    rbac.authorization.k8s.io/aggregate-to-view: 'true'
+  name: argo-aggregate-to-view
+rules:
+- apiGroups:
+  - argoproj.io
+  resources:
+  - workflows
+  - workflows/finalizers
+  - workfloweventbindings
+  - workfloweventbindings/finalizers
+  - workflowtemplates
+  - workflowtemplates/finalizers
+  - cronworkflows
+  - cronworkflows/finalizers
+  - clusterworkflowtemplates
+  - clusterworkflowtemplates/finalizers
+  - workflowtaskresults
+  - workflowtaskresults/finalizers
+  verbs:
+  - get
+  - list
+  - watch
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-cluster-role

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-cluster-role

@@ -0,0 +1,124 @@

+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: argo-cluster-role
+rules:
+- apiGroups:
+  - ''
+  resources:
+  - pods
+  - pods/exec
+  verbs:
+  - create
+  - get
+  - list
+  - watch
+  - update
+  - patch
+  - delete
+- apiGroups:
+  - ''
+  resources:
+  - configmaps
+  - nodes
+  - namespaces
+  verbs:
+  - get
+  - watch
+  - list
+- apiGroups:
+  - ''
+  resources:
+  - persistentvolumeclaims
+  - persistentvolumeclaims/finalizers
+  verbs:
+  - create
+  - update
+  - delete
+  - get
+- apiGroups:
+  - argoproj.io
+  resources:
+  - workflows
+  - workflows/finalizers
+  - workflowtasksets
+  - workflowtasksets/finalizers
+  - workflowartifactgctasks
+  verbs:
+  - get
+  - list
+  - watch
+  - update
+  - patch
+  - delete
+  - create
+- apiGroups:
+  - argoproj.io
+  resources:
+  - workflowtemplates
+  - workflowtemplates/finalizers
+  - clusterworkflowtemplates
+  - clusterworkflowtemplates/finalizers
+  verbs:
+  - get
+  - list
+  - watch
+- apiGroups:
+  - argoproj.io
+  resources:
+  - workflowtaskresults
+  verbs:
+  - get
+  - list
+  - watch
+  - create
+  - update
+  - patch
+  - delete
+  - deletecollection
+- apiGroups:
+  - ''
+  resources:
+  - serviceaccounts
+  verbs:
+  - get
+  - list
+- apiGroups:
+  - argoproj.io
+  resources:
+  - cronworkflows
+  - cronworkflows/finalizers
+  verbs:
+  - get
+  - list
+  - watch
+  - update
+  - patch
+  - delete
+- apiGroups:
+  - ''
+  resources:
+  - events
+  verbs:
+  - create
+  - patch
+  - get
+  - list
+- apiGroups:
+  - policy
+  resources:
+  - poddisruptionbudgets
+  verbs:
+  - create
+  - get
+  - delete
+- apiGroups:
+  - ''
+  resourceNames:
+  - argo-workflows-agent-ca-certificates
+  resources:
+  - secrets
+  verbs:
+  - get
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRoleBinding: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-argo-workflow

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRoleBinding: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-argo-workflow

@@ -0,0 +1,20 @@

+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: amd-gpu-operator-gpu-operator-charts-argo-workflow
+  labels:
+    app.kubernetes.io/component: amd-gpu
+    app.kubernetes.io/part-of: amd-gpu
+    app.kubernetes.io/name: gpu-operator-charts
+    app.kubernetes.io/instance: amd-gpu-operator
+    app.kubernetes.io/managed-by: Helm
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: amd-gpu-operator-gpu-operator-charts-argo-workflow
+subjects:
+- kind: ServiceAccount
+  name: amd-gpu-operator-gpu-operator-charts-controller-manager
+  namespace: amd-gpu-operator
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRoleBinding: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-dra-driver

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRoleBinding: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-dra-driver

@@ -0,0 +1,20 @@

+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: amd-gpu-operator-gpu-operator-charts-dra-driver
+  labels:
+    app.kubernetes.io/component: amd-gpu
+    app.kubernetes.io/part-of: amd-gpu
+    app.kubernetes.io/name: gpu-operator-charts
+    app.kubernetes.io/instance: amd-gpu-operator
+    app.kubernetes.io/managed-by: Helm
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: amd-gpu-operator-gpu-operator-charts-dra-driver
+subjects:
+- kind: ServiceAccount
+  name: amd-gpu-operator-dra-driver
+  namespace: amd-gpu-operator
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRoleBinding: amd-gpu-operator/argo-binding

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRoleBinding: amd-gpu-operator/argo-binding

@@ -0,0 +1,14 @@

+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: argo-binding
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: argo-cluster-role
+subjects:
+- kind: ServiceAccount
+  name: argo
+  namespace: amd-gpu-operator
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Role: amd-gpu-operator/argo-role

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Role: amd-gpu-operator/argo-role

@@ -0,0 +1,22 @@

+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role
+metadata:
+  name: argo-role
+  namespace: amd-gpu-operator
+rules:
+- apiGroups:
+  - coordination.k8s.io
+  resources:
+  - leases
+  verbs:
+  - create
+  - get
+  - update
+- apiGroups:
+  - ''
+  resources:
+  - secrets
+  verbs:
+  - get
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator RoleBinding: amd-gpu-operator/argo-binding

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator RoleBinding: amd-gpu-operator/argo-binding

@@ -0,0 +1,15 @@

+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+  name: argo-binding
+  namespace: amd-gpu-operator
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: Role
+  name: argo-role
+subjects:
+- kind: ServiceAccount
+  name: argo
+  namespace: amd-gpu-operator
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-workflow-controller

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-workflow-controller

@@ -0,0 +1,66 @@

+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: amd-gpu-operator-workflow-controller
+  namespace: amd-gpu-operator
+spec:
+  selector:
+    matchLabels:
+      app: amd-gpu-operator-workflow-controller
+  template:
+    metadata:
+      labels:
+        app: amd-gpu-operator-workflow-controller
+    spec:
+      affinity:
+        nodeAffinity:
+          preferredDuringSchedulingIgnoredDuringExecution:
+          - preference:
+              matchExpressions:
+              - key: node-role.kubernetes.io/control-plane
+                operator: Exists
+            weight: 1
+      nodeSelector: {}
+      containers:
+      - name: workflow-controller
+        command:
+        - workflow-controller
+        args:
+        - --configmap
+        - amd-gpu-operator-workflow-controller-config
+        env:
+        - name: LEADER_ELECTION_IDENTITY
+          valueFrom:
+            fieldRef:
+              apiVersion: v1
+              fieldPath: metadata.name
+        image: quay.io/argoproj/workflow-controller:v4.0.3
+        livenessProbe:
+          failureThreshold: 3
+          httpGet:
+            path: /healthz
+            port: 6060
+          initialDelaySeconds: 90
+          periodSeconds: 60
+          timeoutSeconds: 30
+        ports:
+        - containerPort: 9090
+          name: metrics
+        - containerPort: 6060
+        securityContext:
+          allowPrivilegeEscalation: false
+          capabilities:
+            drop:
+            - ALL
+          readOnlyRootFilesystem: true
+          runAsNonRoot: true
+      priorityClassName: amd-gpu-operator-workflow-controller-pc
+      securityContext:
+        runAsNonRoot: true
+      serviceAccountName: argo
+      tolerations:
+      - key: amd-gpu-unhealthy
+        operator: Exists
+        effect: NoSchedule
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator DeviceClass: amd-gpu-operator/gpu.amd.com

+++ HelmRelease: amd-gpu-operator/amd-gpu-operator DeviceClass: amd-gpu-operator/gpu.amd.com

@@ -0,0 +1,16 @@

+---
+apiVersion: resource.k8s.io/v1
+kind: DeviceClass
+metadata:
+  name: gpu.amd.com
+  labels:
+    app.kubernetes.io/component: amd-gpu
+    app.kubernetes.io/part-of: amd-gpu
+    app.kubernetes.io/name: gpu-operator-charts
+    app.kubernetes.io/instance: amd-gpu-operator
+    app.kubernetes.io/managed-by: Helm
+spec:
+  selectors:
+  - cel:
+      expression: device.driver == 'gpu.amd.com'
+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants