From ed15c45cdb793154ab6e9d4c1ad9da0f9ca90530 Mon Sep 17 00:00:00 2001
From: Yuan Chen <yuanchen97@gmail.com>
Date: Wed, 10 Jun 2026 18:41:18 -0700
Subject: [PATCH] feat(recipes): add A100 GKE COS training Kubeflow overlay
 chain

Add A100 GKE overlays (issue #1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (#1294), AKS (#1295), and EKS (#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning reuses the h100 profile (tuning-gke.yaml,
accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke
package ships baked-in profiles only for gke-h100 / gke-b200, with no
separate A100 target; per the nodewright maintainer the A100-vs-H100
deltas in the DGX Base OS tunings pertain only to baremetal and do not
apply in EKS/GKE, so h100 is the correct tuning profile for A100 here.
The recipe criteria stays a100; only the tuning profile selector is h100.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: #1002
---
 docs/integrator/gke-tcpxo-networking.md       |   4 +-
 recipes/overlays/a100-any.yaml                |  54 ++++++++++
 .../a100-gke-cos-training-kubeflow.yaml       |  49 +++++++++
 recipes/overlays/a100-gke-cos-training.yaml   | 102 ++++++++++++++++++
 4 files changed, 208 insertions(+), 1 deletion(-)
 create mode 100644 recipes/overlays/a100-any.yaml
 create mode 100644 recipes/overlays/a100-gke-cos-training-kubeflow.yaml
 create mode 100644 recipes/overlays/a100-gke-cos-training.yaml

diff --git a/docs/integrator/gke-tcpxo-networking.md b/docs/integrator/gke-tcpxo-networking.md
index 28b869867..6935f7b3a 100644
--- a/docs/integrator/gke-tcpxo-networking.md
+++ b/docs/integrator/gke-tcpxo-networking.md
@@ -1,6 +1,8 @@
 # GKE TCPXO Networking Prerequisites
 
-For `*-gke-cos-training*` recipes, GPUDirect TCPXO enables high-speed inter-node GPU communication on GKE. Without it, NCCL falls back to TCP (~4 GB/s vs ~340 GB/s with TCPXO).
+For the **H100 GKE COS training** recipes (`h100-gke-cos-training*`, on `a3-megagpu-8g` nodes), GPUDirect TCPXO enables high-speed inter-node GPU communication on GKE. Without it, the NVIDIA Collective Communications Library (NCCL) falls back to TCP (~4 GB/s vs ~340 GB/s with TCPXO).
+
+> **A100 (a2) exception:** the `a100-gke-cos-training*` recipes intentionally omit the `gke-nccl-tcpxo` component — GPUDirect TCPXO targets H100 `a3-megagpu-8g` nodes, not the A100 `a2-highgpu`/`a2-ultragpu` machine family. The prerequisites below do **not** apply to A100 GKE recipes, and the generated A100 bundle does not install the TCPXO DaemonSets.
 
 ## Infrastructure Prerequisites
 
diff --git a/recipes/overlays/a100-any.yaml b/recipes/overlays/a100-any.yaml
new file mode 100644
index 000000000..7199be02c
--- /dev/null
+++ b/recipes/overlays/a100-any.yaml
@@ -0,0 +1,54 @@
+# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Cross-cutting overlay applied via criteria-wildcard matching.
+#
+# Carries the deployment-phase floor (the 4 standard checks plus the
+# gpu-operator version pin) and applies to every A100 query regardless
+# of service or intent, so every concrete A100 leaf (training or
+# inference, any service) inherits the version pin.
+#
+# Per-field union merge (see pkg/recipe/metadata.go) means concrete leaves
+# that declare their own `deployment:` block add to or override this floor
+# without dropping its inherited checks — same-name constraints from the
+# leaf win, additional checks are appended.
+#
+# See docs/contributor/recipe.md#criteria-wildcard-overlays for details.
+
+kind: RecipeMetadata
+apiVersion: aicr.nvidia.com/v1alpha1
+metadata:
+  name: a100-any
+
+spec:
+  base: base
+
+  criteria:
+    service: any
+    accelerator: a100
+
+  validation:
+    deployment:
+      checks:
+        - operator-health
+        - expected-resources
+        - gpu-operator-version
+        - check-nvidia-smi
+      constraints:
+        # A100 has been supported since the early gpu-operator releases
+        # (v22.9). Floor at the same generation baseline as H100/H200
+        # (v24.6.0) rather than the Blackwell floor; concrete leaves can
+        # tighten if they pin to a later working version.
+        - name: Deployment.gpu-operator.version
+          value: ">= v24.6.0"
diff --git a/recipes/overlays/a100-gke-cos-training-kubeflow.yaml b/recipes/overlays/a100-gke-cos-training-kubeflow.yaml
new file mode 100644
index 000000000..22520f5d8
--- /dev/null
+++ b/recipes/overlays/a100-gke-cos-training-kubeflow.yaml
@@ -0,0 +1,49 @@
+# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+kind: RecipeMetadata
+apiVersion: aicr.nvidia.com/v1alpha1
+metadata:
+  name: a100-gke-cos-training-kubeflow
+
+spec:
+  # Inherits from a100-gke-cos-training recipe (A100 + GKE COS + training settings)
+  # This overlay adds Kubeflow Training Operator for distributed training with TrainJob
+  base: a100-gke-cos-training
+
+  criteria:
+    service: gke
+    accelerator: a100
+    os: cos
+    intent: training
+    platform: kubeflow
+
+  # Constraints for A100 on GKE COS for Kubeflow training workloads
+  constraints:
+    - name: K8s.server.version
+      value: ">= 1.30"
+
+  # Kubeflow Training Operator for TrainJob support.
+  # Declared inline (not via the platform-kubeflow mixin) to match the GKE COS
+  # pattern in h100-gke-cos-training-kubeflow.
+  componentRefs:
+    - name: kubeflow-trainer
+      type: Helm
+      valuesFile: components/kubeflow-trainer/values.yaml
+      manifestFiles:
+        - components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml
+      dependencyRefs:
+        - cert-manager
+        - kube-prometheus-stack
+        - gpu-operator
diff --git a/recipes/overlays/a100-gke-cos-training.yaml b/recipes/overlays/a100-gke-cos-training.yaml
new file mode 100644
index 000000000..aa3dba504
--- /dev/null
+++ b/recipes/overlays/a100-gke-cos-training.yaml
@@ -0,0 +1,102 @@
+# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+kind: RecipeMetadata
+apiVersion: aicr.nvidia.com/v1alpha1
+metadata:
+  name: a100-gke-cos-training
+
+spec:
+  # Inherits from gke-cos-training recipe (GKE COS + training settings)
+  base: gke-cos-training
+
+  criteria:
+    service: gke
+    accelerator: a100
+    os: cos
+    intent: training
+
+  # Specific constraints for A100 on GKE COS training workloads.
+  # A100 has no IMEX/NVLink ComputeDomain requirement, so the recipe keeps
+  # the GKE COS training baseline rather than the H100 1.32 floor.
+  constraints:
+    - name: K8s.server.version
+      value: ">= 1.30"
+
+  componentRefs:
+    # A100-specific GPU Operator overrides (inherits valuesFile from gke-cos-training).
+    #
+    # gke-nccl-tcpxo is intentionally omitted: GPUDirect-TCPXO targets H100
+    # a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family.
+    # Note this is networking, separate from the nodewright tuning below.
+    - name: gpu-operator
+      type: Helm
+      dependencyRefs:
+        - nfd
+        - cert-manager
+        - kube-prometheus-stack
+        - nodewright-customizations
+      overrides:
+        cdi:
+          enabled: true
+
+    # A100 reuses the h100 nodewright tuning profile (tuning-gke.yaml). The
+    # nvidia-tuning-gke package ships baked-in profiles only for gke-h100 /
+    # gke-b200, with no separate A100 target — but per the nodewright maintainer
+    # the h100 tuning is the correct profile for A100 on EKS/GKE: the
+    # A100-vs-H100 deltas in the DGX Base OS tunings pertain only to baremetal
+    # and do not apply in GKE, so h100 is effectively the A100 tuning here. This
+    # mirrors h100-gke-cos-training. The recipe criteria above stays a100; only
+    # this tuning profile selector is h100.
+    - name: nodewright-customizations
+      type: Helm
+      manifestFiles:
+        - components/nodewright-customizations/manifests/tuning-gke.yaml
+      overrides:
+        accelerator: h100
+        intent: multiNodeTraining
+      dependencyRefs:
+        - nodewright-operator
+
+    - name: nfd
+      type: Helm
+      overrides:
+        topologyUpdater:
+          enable: true
+
+  # Validation checks for A100 on GKE COS training workloads.
+  # Defined at the intent layer (not OS-specific) so all variants inherit them.
+  #
+  # The deployment-phase floor (4 standard checks + gpu-operator version pin)
+  # is contributed by the a100-any cross-cutting overlay and is not duplicated
+  # here.
+  #
+  # Performance gating is intentionally omitted until an empirical A100-on-GKE
+  # NCCL baseline is established. The H100 GKE training overlay pins an absolute
+  # nccl-all-reduce-bw floor (>= 250) calibrated on 8-GPU H100 NVLink nodes;
+  # that value is neither fabric-class aware (https://github.com/NVIDIA/aicr/issues/1256)
+  # nor valid for A100, so carrying it would only false-fail healthy runs.
+  validation:
+    conformance:
+      checks:
+        - platform-health
+        - gpu-operator-health
+        - dra-support
+        - accelerator-metrics
+        - ai-service-metrics
+        - gang-scheduling
+        - pod-autoscaling
+        - cluster-autoscaling
+        - robust-controller
+        - secure-accelerator-access