Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/integrator/gke-tcpxo-networking.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# GKE TCPXO Networking Prerequisites

For `*-gke-cos-training*` recipes, GPUDirect TCPXO enables high-speed inter-node GPU communication on GKE. Without it, NCCL falls back to TCP (~4 GB/s vs ~340 GB/s with TCPXO).
For the **H100 GKE COS training** recipes (`h100-gke-cos-training*`, on `a3-megagpu-8g` nodes), GPUDirect TCPXO enables high-speed inter-node GPU communication on GKE. Without it, NCCL falls back to TCP (~4 GB/s vs ~340 GB/s with TCPXO).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Define NCCL at first mention.

Line 3 introduces NCCL without expansion. Expand once (e.g., “NVIDIA Collective Communications Library (NCCL)”) at first use.

As per coding guidelines: “Define acronyms on first use in documentation.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/integrator/gke-tcpxo-networking.md` at line 3, Expand the acronym NCCL
at first mention in the sentence containing "NCCL falls back to TCP" by
replacing the first occurrence with its full form "NVIDIA Collective
Communications Library (NCCL)" so subsequent uses can keep the short form;
update the phrase in the line that reads "Without it, NCCL falls back to TCP (~4
GB/s vs ~340 GB/s with TCPXO)." to include the expanded form.

Source: Coding guidelines


> **A100 (a2) exception:** the `a100-gke-cos-training*` recipes intentionally omit the `gke-nccl-tcpxo` component — GPUDirect TCPXO targets H100 `a3-megagpu-8g` nodes, not the A100 `a2-highgpu`/`a2-ultragpu` machine family. The prerequisites below do **not** apply to A100 GKE recipes, and the generated A100 bundle does not install the TCPXO DaemonSets.

## Infrastructure Prerequisites

Expand Down
54 changes: 54 additions & 0 deletions recipes/overlays/a100-any.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Cross-cutting overlay applied via criteria-wildcard matching.
#
# Carries the deployment-phase floor (the 4 standard checks plus the
# gpu-operator version pin) and applies to every A100 query regardless
# of service or intent, so every concrete A100 leaf (training or
# inference, any service) inherits the version pin.
#
# Per-field union merge (see pkg/recipe/metadata.go) means concrete leaves
# that declare their own `deployment:` block add to or override this floor
# without dropping its inherited checks — same-name constraints from the
# leaf win, additional checks are appended.
#
# See docs/contributor/recipe.md#criteria-wildcard-overlays for details.

kind: RecipeMetadata
apiVersion: aicr.nvidia.com/v1alpha1
metadata:
name: a100-any

spec:
base: base

criteria:
service: any
accelerator: a100

validation:
deployment:
checks:
- operator-health
- expected-resources
- gpu-operator-version
- check-nvidia-smi
constraints:
# A100 has been supported since the early gpu-operator releases
# (v22.9). Floor at the same generation baseline as H100/H200
# (v24.6.0) rather than the Blackwell floor; concrete leaves can
# tighten if they pin to a later working version.
- name: Deployment.gpu-operator.version
value: ">= v24.6.0"
Comment on lines +34 to +54

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required mixins to satisfy overlay schema conventions.

spec currently omits mixins. Per the repository rule for overlays, this file should explicitly include base, mixins, criteria, and constraints.

Suggested minimal fix
 spec:
   base: base
+  mixins: []
 
   criteria:
     service: any
     accelerator: a100

As per coding guidelines: “Recipe overlays must specify base, mixins, criteria, and constraints; criteria must match defined enums.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-any.yaml` around lines 34 - 54, The overlay is missing
the required mixins field and thus doesn't follow the overlay schema; update the
spec to include a mixins entry alongside base, criteria and constraints (e.g.,
add a top-level spec.mixins array with the appropriate mixin names or an empty
list if none are needed) so the file defines spec.base, spec.mixins,
spec.criteria and spec.constraints; ensure criteria.service and
criteria.accelerator remain unchanged and constraints (like
Deployment.gpu-operator.version) stay under
spec.validation.deployment.checks/constraints.

Source: Coding guidelines

49 changes: 49 additions & 0 deletions recipes/overlays/a100-gke-cos-training-kubeflow.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

kind: RecipeMetadata
apiVersion: aicr.nvidia.com/v1alpha1
metadata:
name: a100-gke-cos-training-kubeflow

spec:
# Inherits from a100-gke-cos-training recipe (A100 + GKE COS + training settings)
# This overlay adds Kubeflow Training Operator for distributed training with TrainJob
base: a100-gke-cos-training

criteria:
service: gke
accelerator: a100
os: cos
intent: training
platform: kubeflow

# Constraints for A100 on GKE COS for Kubeflow training workloads
constraints:
- name: K8s.server.version
value: ">= 1.30"
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Comment on lines +20 to +35

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add explicit mixins field to satisfy overlay schema requirement.

The overlay spec has base, criteria, and constraints, but the mixins field is missing. Per coding guidelines, "Recipe overlays must specify base, mixins, criteria, and constraints." Add mixins: [] (if intentionally empty) or list applicable mixins.

🛠️ Suggested fix
 spec:
   # Inherits from a100-gke-cos-training recipe (A100 + GKE COS + training settings)
   # This overlay adds Kubeflow Training Operator for distributed training with TrainJob
   base: a100-gke-cos-training
+
+  mixins: []

   criteria:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-gke-cos-training-kubeflow.yaml` around lines 20 - 35,
The overlay spec is missing the required mixins field; update the recipe overlay
to include a mixins entry (either an explicit empty list or the applicable mixin
names) alongside the existing base: a100-gke-cos-training, criteria and
constraints so it conforms to the overlay schema — add mixins: [] if none apply
or list the relevant mixin identifiers.

Source: Coding guidelines


# Kubeflow Training Operator for TrainJob support.
# Declared inline (not via the platform-kubeflow mixin) to match the GKE COS
# pattern in h100-gke-cos-training-kubeflow.
componentRefs:
- name: kubeflow-trainer
type: Helm
valuesFile: components/kubeflow-trainer/values.yaml
manifestFiles:
Comment on lines +41 to +44

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pin kubeflow-trainer with an explicit version in componentRefs.

On Lines 41-44, this componentRef omits version. Add the version pin alongside name, type, and valuesFile to align with overlay component reference requirements.

As per coding guidelines: “Reference components in recipe overlays with componentRefs containing name, type, version, and valuesFile.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-gke-cos-training-kubeflow.yaml` around lines 41 - 44,
The kubeflow-trainer componentRef is missing an explicit version; update the
componentRefs entry for kubeflow-trainer (the block containing name:
kubeflow-trainer, type: Helm, valuesFile:
components/kubeflow-trainer/values.yaml) to include a version field (e.g.,
version: "<pin-version>")—fetch the correct version from the kubeflow-trainer
chart/registry or components metadata and add that version string so the
componentRef includes name, type, version, and valuesFile as required by overlay
guidelines.

Source: Coding guidelines

- components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml
dependencyRefs:
- cert-manager
- kube-prometheus-stack
- gpu-operator
89 changes: 89 additions & 0 deletions recipes/overlays/a100-gke-cos-training.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

kind: RecipeMetadata
apiVersion: aicr.nvidia.com/v1alpha1
metadata:
name: a100-gke-cos-training

spec:
# Inherits from gke-cos-training recipe (GKE COS + training settings)
base: gke-cos-training

criteria:
service: gke
accelerator: a100
os: cos
intent: training

# Specific constraints for A100 on GKE COS training workloads.
# A100 has no IMEX/NVLink ComputeDomain requirement, so the recipe keeps
# the GKE COS training baseline rather than the H100 1.32 floor.
constraints:
- name: K8s.server.version
value: ">= 1.30"
Comment thread
coderabbitai[bot] marked this conversation as resolved.

Comment on lines +20 to +36

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add explicit mixins to satisfy overlay schema convention.

spec includes base, criteria, and constraints, but mixins is missing. On Line 20 onward, add mixins (use mixins: [] if intentionally empty) to match repository overlay requirements.

As per coding guidelines: “Recipe overlays must specify base, mixins, criteria, and constraints.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-gke-cos-training.yaml` around lines 20 - 36, The
overlay's spec is missing the required mixins field; in the spec block alongside
base: gke-cos-training, criteria and constraints add an explicit mixins entry
(e.g., mixins: [] if none) so the recipe includes base, mixins, criteria, and
constraints per repository convention; update the spec section (look for the
spec: block and the existing base/criteria/constraints keys) to insert mixins.

Source: Coding guidelines

componentRefs:
# A100-specific GPU Operator overrides (inherits valuesFile from gke-cos-training).
#
# Nodewright tuning is intentionally omitted. The nvidia-tuning-gke package
# ships baked-in profiles only for gke-h100 / gke-b200; there is no A100
# target. The EKS nvidia-tuned "generic" profile is not a fallback here: it
# applies reboot/bootloader changes, but GKE COS is immutable and
# tuning-gke.yaml deliberately limits itself to non-disruptive tuning. The
# nodewright-operator itself is still inherited from gke-cos.
#
# gke-nccl-tcpxo is also omitted: GPUDirect-TCPXO targets H100 a3-mega
# nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family.
- name: gpu-operator
type: Helm
dependencyRefs:
- nfd
- cert-manager
- kube-prometheus-stack
overrides:
cdi:
enabled: true

- name: nfd
type: Helm
overrides:
topologyUpdater:
enable: true

# Validation checks for A100 on GKE COS training workloads.
# Defined at the intent layer (not OS-specific) so all variants inherit them.
#
# The deployment-phase floor (4 standard checks + gpu-operator version pin)
# is contributed by the a100-any cross-cutting overlay and is not duplicated
# here.
#
# Performance gating is intentionally omitted until an empirical A100-on-GKE
# NCCL baseline is established. The H100 GKE training overlay pins an absolute
# nccl-all-reduce-bw floor (>= 250) calibrated on 8-GPU H100 NVLink nodes;
# that value is neither fabric-class aware (https://github.com/NVIDIA/aicr/issues/1256)
# nor valid for A100, so carrying it would only false-fail healthy runs.
validation:
conformance:
checks:
- platform-health
- gpu-operator-health
- dra-support
- accelerator-metrics
- ai-service-metrics
- gang-scheduling
- pod-autoscaling
- cluster-autoscaling
- robust-controller
- secure-accelerator-access
Loading