feat(helm): update chart gpu-operator-charts ( v1.4.1 ➔ v1.5.0 )#3208
Open
renovate[bot] wants to merge 1 commit into
Open
feat(helm): update chart gpu-operator-charts ( v1.4.1 ➔ v1.5.0 )#3208renovate[bot] wants to merge 1 commit into
renovate[bot] wants to merge 1 commit into
Conversation
qgr1-cluster-0 - kustomization--- k8s/base/amd-gpu-operator/amd-gpu-operator Kustomization: flux-system/amd-gpu-operator-amd-gpu-operator HelmRelease: amd-gpu-operator/amd-gpu-operator
+++ k8s/base/amd-gpu-operator/amd-gpu-operator Kustomization: flux-system/amd-gpu-operator-amd-gpu-operator HelmRelease: amd-gpu-operator/amd-gpu-operator
@@ -13,13 +13,13 @@
spec:
chart: gpu-operator-charts
sourceRef:
kind: HelmRepository
name: amd-gpu-operator-charts
namespace: flux-system
- version: v1.4.1
+ version: v1.5.0
install:
createNamespace: true
remediation:
retries: 50
timeout: 15m
interval: 30m |
qgr1-cluster-0 - helmrelease--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-config-manager
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-config-manager
@@ -25,12 +25,13 @@
- nodes
verbs:
- get
- list
- watch
- update
+ - patch
- apiGroups:
- apps
resources:
- daemonsets
verbs:
- get
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-manager-role
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-manager-role
@@ -80,29 +80,32 @@
- update
- watch
- apiGroups:
- amd.com
resources:
- deviceconfigs
+ - remediationworkflowstatuses
verbs:
- create
- get
- list
- patch
- update
- watch
- apiGroups:
- amd.com
resources:
- deviceconfigs/finalizers
+ - remediationworkflowstatuses/finalizers
verbs:
- update
- apiGroups:
- amd.com
resources:
- deviceconfigs/status
+ - remediationworkflowstatuses/status
verbs:
- get
- patch
- update
- apiGroups:
- apiextensions.k8s.io
@@ -132,12 +135,24 @@
verbs:
- create
- get
- update
- watch
- apiGroups:
+ - apps
+ resources:
+ - deployments
+ verbs:
+ - create
+ - delete
+ - get
+ - list
+ - patch
+ - update
+ - watch
+- apiGroups:
- kmm.sigs.x-k8s.io
resources:
- modules
verbs:
- create
- delete
@@ -197,7 +212,21 @@
resources:
- nodefeaturediscoveries/finalizers
- nodefeaturediscoveries/status
verbs:
- get
- update
+- apiGroups:
+ - resource.k8s.io
+ resources:
+ - deviceclasses
+ verbs:
+ - create
+- apiGroups:
+ - security.openshift.io
+ resourceNames:
+ - privileged
+ resources:
+ - securitycontextconstraints
+ verbs:
+ - use
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-kmm-controller
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-kmm-controller
@@ -41,24 +41,24 @@
nodeSelector: {}
containers:
- args:
- --config=controller_config.yaml
env:
- name: RELATED_IMAGE_WORKER
- value: docker.io/rocm/kernel-module-management-worker:v1.4.1
+ value: docker.io/rocm/kernel-module-management-worker:v1.5.0
- name: OPERATOR_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: RELATED_IMAGE_BUILD
value: gcr.io/kaniko-project/executor:v1.23.2
- name: RELATED_IMAGE_SIGN
- value: docker.io/rocm/kernel-module-management-signimage:v1.4.1
+ value: docker.io/rocm/kernel-module-management-signimage:v1.5.0
- name: KUBERNETES_CLUSTER_DOMAIN
value: cluster.local
- image: docker.io/rocm/kernel-module-management-operator:v1.4.1
+ image: docker.io/rocm/kernel-module-management-operator:v1.5.0
imagePullPolicy: Always
livenessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
@@ -89,12 +89,15 @@
subPath: controller_config.yaml
securityContext:
runAsNonRoot: true
serviceAccountName: amd-gpu-operator-kmm-controller
terminationGracePeriodSeconds: 10
tolerations:
+ - key: amd-gpu-unhealthy
+ operator: Exists
+ effect: NoSchedule
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Equal
value: ''
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-kmm-webhook-server
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-kmm-webhook-server
@@ -45,13 +45,13 @@
- --enable-module
- --enable-namespace
- --enable-preflightvalidation
env:
- name: KUBERNETES_CLUSTER_DOMAIN
value: cluster.local
- image: docker.io/rocm/kernel-module-management-webhook-server:v1.4.1
+ image: docker.io/rocm/kernel-module-management-webhook-server:v1.5.0
imagePullPolicy: Always
livenessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
@@ -85,12 +85,15 @@
subPath: controller_config.yaml
securityContext:
runAsNonRoot: true
serviceAccountName: amd-gpu-operator-kmm-controller
terminationGracePeriodSeconds: 10
tolerations:
+ - key: amd-gpu-unhealthy
+ operator: Exists
+ effect: NoSchedule
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Equal
value: ''
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-controller-manager
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-controller-manager
@@ -48,13 +48,15 @@
fieldRef:
fieldPath: metadata.namespace
- name: KUBERNETES_CLUSTER_DOMAIN
value: cluster.local
- name: SIM_ENABLE
value: 'false'
- image: docker.io/rocm/gpu-operator:v1.4.1
+ - name: KMM_WATCH_ENABLED
+ value: 'true'
+ image: docker.io/rocm/gpu-operator:v1.5.0
imagePullPolicy: Always
livenessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
@@ -81,12 +83,15 @@
subPath: controller_manager_config.yaml
securityContext:
runAsNonRoot: true
serviceAccountName: amd-gpu-operator-gpu-operator-charts-controller-manager
terminationGracePeriodSeconds: 10
tolerations:
+ - key: amd-gpu-unhealthy
+ operator: Exists
+ effect: NoSchedule
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Equal
value: ''
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
--- HelmRelease: amd-gpu-operator/amd-gpu-operator DeviceConfig: amd-gpu-operator/default
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator DeviceConfig: amd-gpu-operator/default
@@ -15,14 +15,17 @@
driverType: container
image: docker.io/myUserName/driverImage
imageRegistryTLS:
insecure: false
insecureSkipTLSVerify: false
version: 30.20.1
+ imageBuild:
+ gpgKeyURL: ''
+ packageRepoURL: ''
upgradePolicy:
- enable: true
+ enable: false
maxParallelUpgrades: 3
maxUnavailableNodes: 25%
nodeDrainPolicy:
force: true
gracePeriodSeconds: -1
timeoutSeconds: 300
@@ -34,31 +37,35 @@
commonConfig:
initContainerImage: busybox:1.37
utilsContainer:
image: docker.io/rocm/gpu-operator-utils:v1.4.1
imagePullPolicy: IfNotPresent
devicePlugin:
+ enableDevicePlugin: true
devicePluginImage: rocm/k8s-device-plugin:latest
devicePluginImagePullPolicy: IfNotPresent
enableNodeLabeller: true
nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
nodeLabellerImagePullPolicy: IfNotPresent
upgradePolicy:
maxUnavailable: 1
upgradeStrategy: RollingUpdate
+ kubeletSocketPath: /var/lib/kubelet/device-plugins
+ hostNetwork: false
metricsExporter:
enable: true
serviceType: ClusterIP
port: 5000
nodePort: 32500
image: docker.io/rocm/device-metrics-exporter:v1.4.2
imagePullPolicy: IfNotPresent
upgradePolicy:
maxUnavailable: 1
upgradeStrategy: RollingUpdate
podResourceAPISocketPath: /var/lib/kubelet/pod-resources
+ hostNetwork: false
rbacConfig:
enable: false
image: quay.io/brancz/kube-rbac-proxy:v0.18.1
disableHttps: false
staticAuthorization:
clientName: ''
@@ -77,14 +84,27 @@
hostPath: /var/log/amd-test-runner
logsExportSecrets: []
mountPath: /var/log/amd-test-runner
upgradePolicy:
maxUnavailable: 1
upgradeStrategy: RollingUpdate
+ draDriver:
+ enable: false
+ image: docker.io/rocm/k8s-gpu-dra-driver:latest
+ imagePullPolicy: IfNotPresent
+ upgradePolicy:
+ maxUnavailable: 1
+ upgradeStrategy: RollingUpdate
configManager:
enable: false
image: docker.io/rocm/device-config-manager:v1.4.1
imagePullPolicy: IfNotPresent
upgradePolicy:
maxUnavailable: 1
upgradeStrategy: RollingUpdate
+ remediationWorkflow:
+ enable: false
+ ttlForFailedWorkflows: 24h
+ testerImage: docker.io/rocm/test-runner:v1.5.0
+ autoStartWorkflow: true
+ rebootTimeout: 15m
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/delete-custom-resource-definitions
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/delete-custom-resource-definitions
@@ -17,19 +17,46 @@
ttlSecondsAfterFinished: 60
template:
spec:
serviceAccountName: amd-gpu-operator-gpu-operator-charts-prune
containers:
- name: delete-custom-resource-definitions
- image: docker.io/rocm/gpu-operator:v1.4.1
+ image: docker.io/rocm/gpu-operator:v1.5.0
command:
- /bin/sh
- -c
- |
if kubectl get crds deviceconfigs.amd.com > /dev/null 2>&1; then
kubectl delete crds deviceconfigs.amd.com
+ fi
+ if kubectl get crds remediationworkflowstatuses.amd.com > /dev/null 2>&1; then
+ kubectl delete crds remediationworkflowstatuses.amd.com
+ fi
+ if kubectl get crds clusterworkflowtemplates.argoproj.io > /dev/null 2>&1; then
+ kubectl delete crds clusterworkflowtemplates.argoproj.io
+ fi
+ if kubectl get crds cronworkflows.argoproj.io > /dev/null 2>&1; then
+ kubectl delete crds cronworkflows.argoproj.io
+ fi
+ if kubectl get crds workflows.argoproj.io > /dev/null 2>&1; then
+ kubectl delete crds workflows.argoproj.io
+ fi
+ if kubectl get crds workflowartifactgctasks.argoproj.io > /dev/null 2>&1; then
+ kubectl delete crds workflowartifactgctasks.argoproj.io
+ fi
+ if kubectl get crds workfloweventbindings.argoproj.io > /dev/null 2>&1; then
+ kubectl delete crds workfloweventbindings.argoproj.io
+ fi
+ if kubectl get crds workflowtaskresults.argoproj.io > /dev/null 2>&1; then
+ kubectl delete crds workflowtaskresults.argoproj.io
+ fi
+ if kubectl get crds workflowtasksets.argoproj.io > /dev/null 2>&1; then
+ kubectl delete crds workflowtasksets.argoproj.io
+ fi
+ if kubectl get crds workflowtemplates.argoproj.io > /dev/null 2>&1; then
+ kubectl delete crds workflowtemplates.argoproj.io
fi
if kubectl get crds modules.kmm.sigs.x-k8s.io > /dev/null 2>&1; then
kubectl delete crds modules.kmm.sigs.x-k8s.io
fi
if kubectl get crds nodemodulesconfigs.kmm.sigs.x-k8s.io > /dev/null 2>&1; then
kubectl delete crds nodemodulesconfigs.kmm.sigs.x-k8s.io
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/delete-leftover-deviceconfigs
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/delete-leftover-deviceconfigs
@@ -17,13 +17,13 @@
ttlSecondsAfterFinished: 60
template:
spec:
serviceAccountName: amd-gpu-operator-gpu-operator-charts-pre-delete
containers:
- name: delete-leftover-deviceconfigs
- image: docker.io/rocm/gpu-operator:v1.4.1
+ image: docker.io/rocm/gpu-operator:v1.5.0
command:
- /bin/sh
- -c
- |
installed=$(kubectl api-resources -owide | grep -i amd.com | grep -i deviceconfig)
if [ -z ${installed} ] ; then
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/pre-upgrade-check
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/pre-upgrade-check
@@ -17,13 +17,13 @@
ttlSecondsAfterFinished: 60
template:
spec:
serviceAccountName: pre-upgrade-check-sa
containers:
- name: pre-upgrade-check
- image: docker.io/rocm/gpu-operator:v1.4.1
+ image: docker.io/rocm/gpu-operator:v1.5.0
command:
- /bin/sh
- -c
- |
# Ignore the lack of CRDs, probably haven't actually been installed yet
# this provides idempotentcy when "things" don't understand the difference between
@@ -37,13 +37,13 @@
deviceconfigs=$(kubectl get deviceconfigs -n amd-gpu-operator -o json)
echo "DeviceConfigs JSON:"
echo "$deviceconfigs" | jq .
# Check if any UpgradeState is in the blocked states
- blocked_states='["Upgrade-Not-Started", "Upgrade-Started", "Install-In-Progress", "Upgrade-In-Progress"]'
+ blocked_states='["Upgrade-Not-Started", "Upgrade-Started", "Upgrade-In-Progress"]'
if echo "$deviceconfigs" | jq --argjson blocked_states "$blocked_states" -e '
.items[] |
.status.nodeModuleStatus // {} |
to_entries |
any(.value.status as $state | ($blocked_states | index($state)))' > /dev/null; then
echo "Upgrade blocked: Some DeviceConfigs are in a disallowed UpgradeState."
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/upgrade-crd
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Job: amd-gpu-operator/upgrade-crd
@@ -34,17 +34,26 @@
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
weight: 1
containers:
- name: upgrade-crd
- image: docker.io/rocm/gpu-operator:v1.4.1
+ image: docker.io/rocm/gpu-operator:v1.5.0
imagePullPolicy: Always
command:
- /bin/sh
- -c
- |
- kubectl apply -f /opt/helm-charts-crds-k8s/deviceconfig-crd.yaml
- kubectl apply -f /opt/helm-charts-crds-k8s/module-crd.yaml
- kubectl apply -f /opt/helm-charts-crds-k8s/nodemodulesconfig-crd.yaml
+ kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/deviceconfig-crd.yaml
+ kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/remediationworkflowstatus-crd.yaml
+ kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/module-crd.yaml
+ kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/nodemodulesconfig-crd.yaml
+ kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/clusterworkflowtemplate-crd.yaml
+ kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/cronworkflow-crd.yaml
+ kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/workflowartifactgctask-crd.yaml
+ kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/workflow-crd.yaml
+ kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/workfloweventbinding-crd.yaml
+ kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/workflowtaskresult-crd.yaml
+ kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/workflowtaskset-crd.yaml
+ kubectl apply --server-side --force-conflicts -f /opt/helm-charts-crds-k8s/workflowtemplate-crd.yaml
restartPolicy: OnFailure
--- HelmRelease: amd-gpu-operator/amd-gpu-operator PriorityClass: amd-gpu-operator/amd-gpu-operator-workflow-controller-pc
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator PriorityClass: amd-gpu-operator/amd-gpu-operator-workflow-controller-pc
@@ -0,0 +1,7 @@
+---
+apiVersion: scheduling.k8s.io/v1
+kind: PriorityClass
+metadata:
+ name: amd-gpu-operator-workflow-controller-pc
+value: 1000000
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ServiceAccount: amd-gpu-operator/argo
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ServiceAccount: amd-gpu-operator/argo
@@ -0,0 +1,7 @@
+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+ name: argo
+ namespace: amd-gpu-operator
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ServiceAccount: amd-gpu-operator/amd-gpu-operator-dra-driver
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ServiceAccount: amd-gpu-operator/amd-gpu-operator-dra-driver
@@ -0,0 +1,12 @@
+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+ name: amd-gpu-operator-dra-driver
+ labels:
+ app.kubernetes.io/component: amd-gpu
+ app.kubernetes.io/part-of: amd-gpu
+ app.kubernetes.io/name: gpu-operator-charts
+ app.kubernetes.io/instance: amd-gpu-operator
+ app.kubernetes.io/managed-by: Helm
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ConfigMap: amd-gpu-operator/amd-gpu-operator-workflow-controller-config
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ConfigMap: amd-gpu-operator/amd-gpu-operator-workflow-controller-config
@@ -0,0 +1,9 @@
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: amd-gpu-operator-workflow-controller-config
+ namespace: amd-gpu-operator
+data:
+ instanceID: amd-gpu-operator-remediation-workflow
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-argo-workflow
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-argo-workflow
@@ -0,0 +1,39 @@
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+ name: amd-gpu-operator-gpu-operator-charts-argo-workflow
+ labels:
+ app.kubernetes.io/component: amd-gpu
+ app.kubernetes.io/part-of: amd-gpu
+ app.kubernetes.io/name: gpu-operator-charts
+ app.kubernetes.io/instance: amd-gpu-operator
+ app.kubernetes.io/managed-by: Helm
+rules:
+- apiGroups:
+ - argoproj.io
+ resources:
+ - '*'
+ verbs:
+ - '*'
+- apiGroups:
+ - batch
+ resources:
+ - jobs
+ verbs:
+ - '*'
+- apiGroups:
+ - ''
+ resources:
+ - pods/log
+ verbs:
+ - '*'
+- apiGroups:
+ - ''
+ resources:
+ - serviceaccounts
+ verbs:
+ - get
+ - list
+ - watch
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-dra-driver
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-dra-driver
@@ -0,0 +1,37 @@
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+ name: amd-gpu-operator-gpu-operator-charts-dra-driver
+ labels:
+ app.kubernetes.io/component: amd-gpu
+ app.kubernetes.io/part-of: amd-gpu
+ app.kubernetes.io/name: gpu-operator-charts
+ app.kubernetes.io/instance: amd-gpu-operator
+ app.kubernetes.io/managed-by: Helm
+rules:
+- apiGroups:
+ - resource.k8s.io
+ resources:
+ - resourceclaims
+ verbs:
+ - get
+- apiGroups:
+ - ''
+ resources:
+ - nodes
+ verbs:
+ - get
+- apiGroups:
+ - resource.k8s.io
+ resources:
+ - resourceslices
+ verbs:
+ - get
+ - list
+ - watch
+ - create
+ - update
+ - patch
+ - delete
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-aggregate-to-admin
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-aggregate-to-admin
@@ -0,0 +1,35 @@
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+ labels:
+ rbac.authorization.k8s.io/aggregate-to-admin: 'true'
+ name: argo-aggregate-to-admin
+rules:
+- apiGroups:
+ - argoproj.io
+ resources:
+ - workflows
+ - workflows/finalizers
+ - workfloweventbindings
+ - workfloweventbindings/finalizers
+ - workflowtemplates
+ - workflowtemplates/finalizers
+ - cronworkflows
+ - cronworkflows/finalizers
+ - clusterworkflowtemplates
+ - clusterworkflowtemplates/finalizers
+ - workflowtasksets
+ - workflowtasksets/finalizers
+ - workflowtaskresults
+ - workflowtaskresults/finalizers
+ verbs:
+ - create
+ - delete
+ - deletecollection
+ - get
+ - list
+ - patch
+ - update
+ - watch
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-aggregate-to-edit
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-aggregate-to-edit
@@ -0,0 +1,33 @@
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+ labels:
+ rbac.authorization.k8s.io/aggregate-to-edit: 'true'
+ name: argo-aggregate-to-edit
+rules:
+- apiGroups:
+ - argoproj.io
+ resources:
+ - workflows
+ - workflows/finalizers
+ - workfloweventbindings
+ - workfloweventbindings/finalizers
+ - workflowtemplates
+ - workflowtemplates/finalizers
+ - cronworkflows
+ - cronworkflows/finalizers
+ - clusterworkflowtemplates
+ - clusterworkflowtemplates/finalizers
+ - workflowtaskresults
+ - workflowtaskresults/finalizers
+ verbs:
+ - create
+ - delete
+ - deletecollection
+ - get
+ - list
+ - patch
+ - update
+ - watch
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-aggregate-to-view
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-aggregate-to-view
@@ -0,0 +1,28 @@
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+ labels:
+ rbac.authorization.k8s.io/aggregate-to-view: 'true'
+ name: argo-aggregate-to-view
+rules:
+- apiGroups:
+ - argoproj.io
+ resources:
+ - workflows
+ - workflows/finalizers
+ - workfloweventbindings
+ - workfloweventbindings/finalizers
+ - workflowtemplates
+ - workflowtemplates/finalizers
+ - cronworkflows
+ - cronworkflows/finalizers
+ - clusterworkflowtemplates
+ - clusterworkflowtemplates/finalizers
+ - workflowtaskresults
+ - workflowtaskresults/finalizers
+ verbs:
+ - get
+ - list
+ - watch
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-cluster-role
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRole: amd-gpu-operator/argo-cluster-role
@@ -0,0 +1,124 @@
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+ name: argo-cluster-role
+rules:
+- apiGroups:
+ - ''
+ resources:
+ - pods
+ - pods/exec
+ verbs:
+ - create
+ - get
+ - list
+ - watch
+ - update
+ - patch
+ - delete
+- apiGroups:
+ - ''
+ resources:
+ - configmaps
+ - nodes
+ - namespaces
+ verbs:
+ - get
+ - watch
+ - list
+- apiGroups:
+ - ''
+ resources:
+ - persistentvolumeclaims
+ - persistentvolumeclaims/finalizers
+ verbs:
+ - create
+ - update
+ - delete
+ - get
+- apiGroups:
+ - argoproj.io
+ resources:
+ - workflows
+ - workflows/finalizers
+ - workflowtasksets
+ - workflowtasksets/finalizers
+ - workflowartifactgctasks
+ verbs:
+ - get
+ - list
+ - watch
+ - update
+ - patch
+ - delete
+ - create
+- apiGroups:
+ - argoproj.io
+ resources:
+ - workflowtemplates
+ - workflowtemplates/finalizers
+ - clusterworkflowtemplates
+ - clusterworkflowtemplates/finalizers
+ verbs:
+ - get
+ - list
+ - watch
+- apiGroups:
+ - argoproj.io
+ resources:
+ - workflowtaskresults
+ verbs:
+ - get
+ - list
+ - watch
+ - create
+ - update
+ - patch
+ - delete
+ - deletecollection
+- apiGroups:
+ - ''
+ resources:
+ - serviceaccounts
+ verbs:
+ - get
+ - list
+- apiGroups:
+ - argoproj.io
+ resources:
+ - cronworkflows
+ - cronworkflows/finalizers
+ verbs:
+ - get
+ - list
+ - watch
+ - update
+ - patch
+ - delete
+- apiGroups:
+ - ''
+ resources:
+ - events
+ verbs:
+ - create
+ - patch
+ - get
+ - list
+- apiGroups:
+ - policy
+ resources:
+ - poddisruptionbudgets
+ verbs:
+ - create
+ - get
+ - delete
+- apiGroups:
+ - ''
+ resourceNames:
+ - argo-workflows-agent-ca-certificates
+ resources:
+ - secrets
+ verbs:
+ - get
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRoleBinding: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-argo-workflow
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRoleBinding: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-argo-workflow
@@ -0,0 +1,20 @@
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+ name: amd-gpu-operator-gpu-operator-charts-argo-workflow
+ labels:
+ app.kubernetes.io/component: amd-gpu
+ app.kubernetes.io/part-of: amd-gpu
+ app.kubernetes.io/name: gpu-operator-charts
+ app.kubernetes.io/instance: amd-gpu-operator
+ app.kubernetes.io/managed-by: Helm
+roleRef:
+ apiGroup: rbac.authorization.k8s.io
+ kind: ClusterRole
+ name: amd-gpu-operator-gpu-operator-charts-argo-workflow
+subjects:
+- kind: ServiceAccount
+ name: amd-gpu-operator-gpu-operator-charts-controller-manager
+ namespace: amd-gpu-operator
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRoleBinding: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-dra-driver
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRoleBinding: amd-gpu-operator/amd-gpu-operator-gpu-operator-charts-dra-driver
@@ -0,0 +1,20 @@
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+ name: amd-gpu-operator-gpu-operator-charts-dra-driver
+ labels:
+ app.kubernetes.io/component: amd-gpu
+ app.kubernetes.io/part-of: amd-gpu
+ app.kubernetes.io/name: gpu-operator-charts
+ app.kubernetes.io/instance: amd-gpu-operator
+ app.kubernetes.io/managed-by: Helm
+roleRef:
+ apiGroup: rbac.authorization.k8s.io
+ kind: ClusterRole
+ name: amd-gpu-operator-gpu-operator-charts-dra-driver
+subjects:
+- kind: ServiceAccount
+ name: amd-gpu-operator-dra-driver
+ namespace: amd-gpu-operator
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRoleBinding: amd-gpu-operator/argo-binding
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator ClusterRoleBinding: amd-gpu-operator/argo-binding
@@ -0,0 +1,14 @@
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+ name: argo-binding
+roleRef:
+ apiGroup: rbac.authorization.k8s.io
+ kind: ClusterRole
+ name: argo-cluster-role
+subjects:
+- kind: ServiceAccount
+ name: argo
+ namespace: amd-gpu-operator
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Role: amd-gpu-operator/argo-role
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Role: amd-gpu-operator/argo-role
@@ -0,0 +1,22 @@
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role
+metadata:
+ name: argo-role
+ namespace: amd-gpu-operator
+rules:
+- apiGroups:
+ - coordination.k8s.io
+ resources:
+ - leases
+ verbs:
+ - create
+ - get
+ - update
+- apiGroups:
+ - ''
+ resources:
+ - secrets
+ verbs:
+ - get
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator RoleBinding: amd-gpu-operator/argo-binding
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator RoleBinding: amd-gpu-operator/argo-binding
@@ -0,0 +1,15 @@
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+ name: argo-binding
+ namespace: amd-gpu-operator
+roleRef:
+ apiGroup: rbac.authorization.k8s.io
+ kind: Role
+ name: argo-role
+subjects:
+- kind: ServiceAccount
+ name: argo
+ namespace: amd-gpu-operator
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-workflow-controller
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator Deployment: amd-gpu-operator/amd-gpu-operator-workflow-controller
@@ -0,0 +1,66 @@
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+ name: amd-gpu-operator-workflow-controller
+ namespace: amd-gpu-operator
+spec:
+ selector:
+ matchLabels:
+ app: amd-gpu-operator-workflow-controller
+ template:
+ metadata:
+ labels:
+ app: amd-gpu-operator-workflow-controller
+ spec:
+ affinity:
+ nodeAffinity:
+ preferredDuringSchedulingIgnoredDuringExecution:
+ - preference:
+ matchExpressions:
+ - key: node-role.kubernetes.io/control-plane
+ operator: Exists
+ weight: 1
+ nodeSelector: {}
+ containers:
+ - name: workflow-controller
+ command:
+ - workflow-controller
+ args:
+ - --configmap
+ - amd-gpu-operator-workflow-controller-config
+ env:
+ - name: LEADER_ELECTION_IDENTITY
+ valueFrom:
+ fieldRef:
+ apiVersion: v1
+ fieldPath: metadata.name
+ image: quay.io/argoproj/workflow-controller:v4.0.3
+ livenessProbe:
+ failureThreshold: 3
+ httpGet:
+ path: /healthz
+ port: 6060
+ initialDelaySeconds: 90
+ periodSeconds: 60
+ timeoutSeconds: 30
+ ports:
+ - containerPort: 9090
+ name: metrics
+ - containerPort: 6060
+ securityContext:
+ allowPrivilegeEscalation: false
+ capabilities:
+ drop:
+ - ALL
+ readOnlyRootFilesystem: true
+ runAsNonRoot: true
+ priorityClassName: amd-gpu-operator-workflow-controller-pc
+ securityContext:
+ runAsNonRoot: true
+ serviceAccountName: argo
+ tolerations:
+ - key: amd-gpu-unhealthy
+ operator: Exists
+ effect: NoSchedule
+
--- HelmRelease: amd-gpu-operator/amd-gpu-operator DeviceClass: amd-gpu-operator/gpu.amd.com
+++ HelmRelease: amd-gpu-operator/amd-gpu-operator DeviceClass: amd-gpu-operator/gpu.amd.com
@@ -0,0 +1,16 @@
+---
+apiVersion: resource.k8s.io/v1
+kind: DeviceClass
+metadata:
+ name: gpu.amd.com
+ labels:
+ app.kubernetes.io/component: amd-gpu
+ app.kubernetes.io/part-of: amd-gpu
+ app.kubernetes.io/name: gpu-operator-charts
+ app.kubernetes.io/instance: amd-gpu-operator
+ app.kubernetes.io/managed-by: Helm
+spec:
+ selectors:
+ - cel:
+ expression: device.driver == 'gpu.amd.com'
+ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
v1.4.1→v1.5.0Release Notes
ROCm/gpu-operator (gpu-operator-charts)
v1.5.0: gpu-operator-charts-v1.5.0Compare Source
GPU Operator v1.5.0 Release Notes
The AMD GPU Operator v1.5.0 release introduces support for Dynamic Resource Allocation (DRA) as an alternative to the device plugin, Auto Node Remediation (ANR) for automated recovery of unhealthy GPU worker nodes, and Node Problem Detector (NPD) integration for surfacing GPU-related node conditions. The release also brings broader configurability across the operator (KMM, kubelet socket path, custom package repositories, host network configs, global image pull secrets) and updates the managed stack to ROCm 7.2.1.
Release Highlights
Dynamic Resource Allocation (DRA) Driver Support
spec.draDriversection inDeviceConfigenables and configures the DRA driver (image, command-line arguments, image pull policy, etc.). The default image isrocm/k8s-gpu-dra-driver:latest.draDriver.enable=trueanddevicePlugin.enableDevicePlugin=trueon the sameDeviceConfig.DeviceClassis created automatically by the operator. The operator dynamically discovers the highest available DRA API version (resource.k8s.iov1→v1beta2→v1beta1) so it works across Kubernetes versions from 1.32 to 1.35 and OpenShift 4.21.SecurityContextConstraints) is created by the operator out-of-the-box.DynamicResourceAllocationfeature gate enabled, the amdgpu kernel module loaded on worker nodes, and CDI enabled in the container runtime (default in containerd 2.0+ / CRI-O).Auto Node Remediation (ANR) for GPU Worker Nodes
DeviceConfigCR. Customizable behaviors include:nodeRemediationLabels/ node selector for which nodes ANR applies tomaxParallelWorkflowsto throttle the number of concurrent remediationsrecoveryPolicyanddrainPolicyto control how nodes are drained, recovered, and resetConfigMapImagecan be overridden, and global image pull secrets are honored throughout the workflowImagePullBackOff.Node Problem Detector (NPD) Integration
Enhanced KMM (Kernel Module Management) Configuration Control
kmm.enabled: Controls KMM subchart installation (default:true)kmm.watch: Controls GPU operator watching KMM resources (default:true)enabled=false, watch=true), skip KMM entirely for alternative driver solutions (enabled=false, watch=false), or install KMM without asking for GPU Operator to use it (enabled=true, watch=false)kmm.enabled=falsecontinue to work without changesCustom Package Repository URLs for Driver Image Build
spec.driver.imageBuildallow specifying custom package repository and GPG key URLs for in-cluster amdgpu driver image builds:packageRepoURL: full URL to the package repogpgKeyURL: full URL to the GPG keyrepo.radeon.comis mirrored, restructured, or unreachable.Configurable Kubelet Socket Paths
spec.devicePlugin.kubeletSocketPathfield on theDeviceConfigCR allows users to override the default kubelet device-plugins directory (/var/lib/kubelet/device-plugins).spec.metricsExporter.podResourceAPISocketPathfield allows users to override the default kubelet pod-resources directory (/var/lib/kubelet/pod-resources).Host Network Support for Device Plugin and Metrics Exporter
hostNetworkfield onspec.devicePluginandspec.metricsExporterallows their respective DaemonSet pods to run in the host network namespace, simplifying integration with host-level monitoring and networking stacks.Global Image Pull Secrets Injection
spec.commonConfig.imageRegistrySecretsfield onDeviceConfig(andglobal.imagePullSecretsin the Helm chart) lets users specify image pull secrets that are injected into all operator-managed workloads (driver build/sign pods, device plugin, metrics exporter, node labeller, DCM, DRA driver, ANR workflow, test runner, etc.), reducing duplicated configuration in air-gapped and private-registry environments.Open Source GPU Validation Cluster Example
example/gpu-validation-cluster/directory has been open-sourced, providing a turnkey reference for standing up a GPU validation cluster (Dockerfile, configs, and helper scripts) for users who want to evaluate the GPU / AINIC end-to-end.Node Feature Discovery (NFD) Upgrade
Device Metrics Exporter Enhancements
/var/run/gpuagent.sock, replacing TCP/IP for lower latency and improved security.GPU_PROCESS_CU_OCCUPANCY— per-process Compute Unit occupancy, with aprocess_idlabel.GPU_ECC_DEFERRED_*— deferred ECC error counts for each ECC-supported block.SamplingInterval: configurable profiler sampling window (default 1000 µs / 1 ms).PollingRatefield inHealthService(underCommonConfig) accepts duration strings (30s,5m,1h,23h10m15s); default 30 s, min 30 s, max 24 h.KFD_PROCESS_IDLabel Now Optional: disabled by default to reduce metric cardinality and Prometheus storage cost. Users who need it can re-enable it via the exporterConfigMap.Platform Support
Fixes
Helm upgrade failure
DCM: missing default
ConfigMapwhenspec.configManager.configis omittedConfigMapwhen no custom configuration is provided, instead of failing to start.CVE remediations
grpcupgraded to address CVE-2026-33186.Known Limitations
DeviceConfig. The operator validates this and rejects configurations where bothdraDriver.enable=trueanddevicePlugin.enableDevicePlugin=true.Configuration
📅 Schedule: (in timezone America/New_York)
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.