Summary
The DCM partition profile documentation (docs/dcm/applying-partition-profiles.rst) has three issues in the toleration step that can cause control plane disruptions on OpenShift clusters.
Bug 1 — Wrong namespace for OpenShift
The OpenShift commands target kube-system, which is nearly empty on OCP. Critical control plane components run in openshift-* namespaces (e.g. openshift-dns, openshift-ovn-kubernetes, openshift-multus, openshift-ingress).
Current (line ~49):
oc get deployments -n kube-system -o json | jq -r '.items[] | .metadata.name' | xargs -I {} oc patch deployment {} -n kube-system ...
Expected: Patch deployments/daemonsets across the relevant openshift-* namespaces where pods are scheduled on GPU nodes. The existing note (line ~33) mentions this but the actual commands don't implement it.
Bug 2 — "op": "add" replaces existing tolerations
The JSON patch uses "op": "add" on /spec/template/spec/tolerations, which replaces the entire tolerations array with only the DCM toleration. Any pre-existing tolerations (e.g. node-role.kubernetes.io/master, node.kubernetes.io/unreachable) are silently dropped.
Current:
[{"op": "add", "path": "/spec/template/spec/tolerations", "value": [{"key": "amd-dcm", ...}]}]
Expected: Read existing tolerations, check if amd-dcm already exists, append if not, then write back the merged list.
Bug 3 — Cleanup blindly removes first toleration
The cleanup step (line ~355) uses "op": "remove", "path": "/spec/template/spec/tolerations/0" which removes the toleration at index 0 regardless of what it is. If the DCM toleration isn't first in the array, this deletes a legitimate toleration instead.
Current:
[{"op": "remove", "path": "/spec/template/spec/tolerations/0"}]
Expected: Filter out only the amd-dcm toleration by key, preserving all others.
Impact
- Bug 1: On OpenShift, the toleration step is a no-op against
kube-system, leaving openshift-* DaemonSets unprotected. Applying the NoExecute taint evicts DNS, networking, and ingress pods from the GPU node.
- Bug 2: Drops existing tolerations, which can prevent pods from scheduling on tainted nodes (e.g. control-plane nodes with
NoSchedule).
- Bug 3: Silently removes a wrong toleration during cleanup, potentially breaking pod scheduling.
Affected file
docs/dcm/applying-partition-profiles.rst — lines 43-65 (add tolerations) and lines 344-361 (remove tolerations).
Summary
The DCM partition profile documentation (
docs/dcm/applying-partition-profiles.rst) has three issues in the toleration step that can cause control plane disruptions on OpenShift clusters.Bug 1 — Wrong namespace for OpenShift
The OpenShift commands target
kube-system, which is nearly empty on OCP. Critical control plane components run inopenshift-*namespaces (e.g.openshift-dns,openshift-ovn-kubernetes,openshift-multus,openshift-ingress).Current (line ~49):
Expected: Patch deployments/daemonsets across the relevant
openshift-*namespaces where pods are scheduled on GPU nodes. The existing note (line ~33) mentions this but the actual commands don't implement it.Bug 2 —
"op": "add"replaces existing tolerationsThe JSON patch uses
"op": "add"on/spec/template/spec/tolerations, which replaces the entire tolerations array with only the DCM toleration. Any pre-existing tolerations (e.g.node-role.kubernetes.io/master,node.kubernetes.io/unreachable) are silently dropped.Current:
[{"op": "add", "path": "/spec/template/spec/tolerations", "value": [{"key": "amd-dcm", ...}]}]Expected: Read existing tolerations, check if
amd-dcmalready exists, append if not, then write back the merged list.Bug 3 — Cleanup blindly removes first toleration
The cleanup step (line ~355) uses
"op": "remove", "path": "/spec/template/spec/tolerations/0"which removes the toleration at index 0 regardless of what it is. If the DCM toleration isn't first in the array, this deletes a legitimate toleration instead.Current:
[{"op": "remove", "path": "/spec/template/spec/tolerations/0"}]Expected: Filter out only the
amd-dcmtoleration by key, preserving all others.Impact
kube-system, leavingopenshift-*DaemonSets unprotected. Applying theNoExecutetaint evicts DNS, networking, and ingress pods from the GPU node.NoSchedule).Affected file
docs/dcm/applying-partition-profiles.rst— lines 43-65 (add tolerations) and lines 344-361 (remove tolerations).