WIP: feat(recipes): add A100 GKE COS training Kubeflow overlay chain by yuanchen8911 · Pull Request #1306 · NVIDIA/aicr

yuanchen8911 · 2026-06-11T01:41:44Z

Summary

Add A100 GKE overlays (issue #1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment-phase floor. Modeled on the H100 GKE COS training overlays.

Motivation / Context

a100 is a declared accelerator in pkg/recipe/criteria.go but has zero overlays in recipes/overlays/, so aicr recipe --accelerator a100 --service gke ... cannot resolve. Companion to the A100 OKE (#1294), AKS (#1295), and EKS (#1305) PRs; this slice covers GKE.

GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training → a100-gke-cos-training → a100-gke-cos-training-kubeflow.

Fixes: N/A (incremental — part of #1002)
Related: #1002, #1294, #1295, #1305, #969, #1256

A100 overlay series — tracked in #1002: #1294 (OKE) · #1295 (AKS) · #1305 (EKS) · #1306 (GKE) ← this PR

Coordination: a100-any.yaml (the cross-service A100 deployment floor) is byte-identical across #1294, #1295, #1305, and this PR. Only one needs to introduce it — whichever lands first, the others drop the duplicate on rebase.

Type of Change

New feature (non-breaking change that adds functionality)

Component(s) Affected

Recipe engine / data (pkg/recipe)
Docs/examples (docs/, examples/)

Implementation Notes

New overlays (reuse existing gke-cos/gke-cos-training parents and values-gke-cos-training.yaml — no new component values files):

a100-any — deployment-phase floor: 4 standard checks + Deployment.gpu-operator.version >= v24.6.0 (H100/H200-generation baseline; A100 operator-supported since v22.9).
a100-gke-cos-training — base: gke-cos-training; K8s >= 1.30. A100 has no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than H100's 1.32. gpu-operator cdi, nfd topologyUpdater. Conformance mirrors the H100 GKE COS training set.
a100-gke-cos-training-kubeflow — Kubeflow Trainer for distributed TrainJob, declared inline to match the GKE COS pattern (h100-gke-cos-training-kubeflow).

Key decisions documented in-file:

Nodewright tuning omitted. nvidia-tuning-gke ships baked-in profiles only for gke-h100 / gke-b200; there is no A100 target. The EKS nvidia-tuned generic profile is not a fallback on GKE — it applies reboot/bootloader changes, but GKE COS is immutable and tuning-gke.yaml deliberately limits itself to non-disruptive tuning. The nodewright-operator is still inherited from gke-cos.
gke-nccl-tcpxo omitted. GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family.
TCPXO doc scoped. docs/integrator/gke-tcpxo-networking.md previously applied to all *-gke-cos-training* recipes; it now scopes the prerequisites to the H100 (a3-megagpu-8g) recipes and calls out the A100 (a2) exception, so users selecting a100-gke-cos-training are not directed to configure TCPXO the bundle never installs.
Performance gating omitted. The H100 GKE nccl-all-reduce-bw >= 250 floor is calibrated on 8-GPU H100 NVLink nodes — neither fabric-class aware (nccl-all-reduce-bw training gate is a fixed absolute fabric-specific busbw value applied to SKU-agnostic recipes → false-fails EKS/H100 small SKUs #1256) nor valid for A100. An A100-on-GKE NCCL baseline is deferred to a follow-up.

Testing

go test ./pkg/recipe/...                 # PASS (incl. TestOverlayValidationPhaseFloor auto-discovery)
yamllint recipes/overlays/a100-*.yaml    # clean
# End-to-end resolution of the new leaf:
aicr recipe --service gke --accelerator a100 --os cos --intent training --platform kubeflow
#   -> components=12 overlays=7; K8s '>= 1.30'; Deployment.gpu-operator.version '>= v24.6.0';
#      kubeflow-trainer present; no nodewright-customizations, no gke-nccl-tcpxo;
#      gpu-operator inherits values-gke-cos-training.yaml

Full make qualify not required: this touches only YAML overlay files (zero .go changes), so the Go lint/test/e2e gates cannot regress from it. The embedded overlays are exercised by go test ./pkg/recipe/... (passes) and yamllint (clean). The only doc change is scoping docs/integrator/gke-tcpxo-networking.md (prose, no new anchors/links).

Risk Assessment

Low — Additive overlays only; no existing recipe or Go path changes. Easy to revert.

Rollout notes: Additive. Other A100 GKE leaves (inference, dynamo) and remaining work tracked under #1002.

Checklist

Tests pass locally (go test ./pkg/recipe/...)
Linter passes (yamllint clean; no .go files changed)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (covered by existing auto-discovery TestOverlayValidationPhaseFloor)
I updated docs if user-facing behavior changed (N/A — no leaf enumeration in docs)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

github-actions · 2026-06-11T01:44:02Z

Recipe evidence check

Affected leaf overlays: 2

Recipe	Pointer	Verify	Digest match
`a100-gke-cos-training-kubeflow`	⚠️ missing	—	—
`a100-gke-cos-training`	⚠️ missing	—	—

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

coderabbitai · 2026-06-11T01:48:27Z

📝 Walkthrough

Walkthrough

Adds three new RecipeMetadata overlays for A100 GPU configurations: a cross-cutting validation floor (a100-any) that sets deployment checks and a GPU Operator version floor (>= v24.6.0); a GKE+COS+training overlay (a100-gke-cos-training) with Kubernetes >= 1.30 and GPU Operator/NFD Helm overrides; and a Kubeflow extension overlay (a100-gke-cos-training-kubeflow) that wires a helm-based kubeflow-trainer component. Also updates docs to note an A100 exception for TCPXO prerequisites.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

area/docs

Suggested reviewers

xdu31
mchmarny
njhensley

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title references adding A100 GKE COS training Kubeflow overlays, which matches the primary change (three new YAML overlay files for A100 GKE). The title is slightly vague with 'WIP' and 'overlay chain' but clearly identifies the main feature.
Description check	✅ Passed	The description comprehensively explains the changes: three new overlay files for A100 GKE COS training, rationale (issue `#1002`), design decisions, testing performed, and risk assessment. It directly relates to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/a100-gke-cos-training-kubeflow.yaml`:
- Around line 34-35: Remove the redundant constraint named K8s.server.version
from this overlay (the value ">= 1.30" duplicates the parent overlay
a100-gke-cos-training); delete the K8s.server.version entry from the
a100-gke-cos-training-kubeflow overlay so the parent's constraint is inherited,
or if you intentionally need to pin it here, add an inline comment next to the
K8s.server.version entry explaining why this overlay must override the parent to
prevent accidental future drift.

In `@recipes/overlays/a100-gke-cos-training.yaml`:
- Around line 34-35: Remove the redundant K8s.server.version constraint from the
a100-gke-cos-training overlay: delete the duplicate key-value pair
`K8s.server.version: ">= 1.30"` in recipes/overlays/a100-gke-cos-training.yaml
(the same constraint is already provided by gke-cos-training.yaml), leaving only
one declaration to avoid unnecessary duplication while preserving behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4d75c63a-bf42-4529-a72e-d8acc1ddd699

📥 Commits

Reviewing files that changed from the base of the PR and between 6c14530 and 769d36f.

📒 Files selected for processing (3)

recipes/overlays/a100-any.yaml
recipes/overlays/a100-gke-cos-training-kubeflow.yaml
recipes/overlays/a100-gke-cos-training.yaml

Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is not a fallback on immutable GKE COS (reboot/bootloader changes). The nodewright-operator is still inherited from gke-cos. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/integrator/gke-tcpxo-networking.md`:
- Line 3: Expand the acronym NCCL at first mention in the sentence containing
"NCCL falls back to TCP" by replacing the first occurrence with its full form
"NVIDIA Collective Communications Library (NCCL)" so subsequent uses can keep
the short form; update the phrase in the line that reads "Without it, NCCL falls
back to TCP (~4 GB/s vs ~340 GB/s with TCPXO)." to include the expanded form.

In `@recipes/overlays/a100-any.yaml`:
- Around line 34-54: The overlay is missing the required mixins field and thus
doesn't follow the overlay schema; update the spec to include a mixins entry
alongside base, criteria and constraints (e.g., add a top-level spec.mixins
array with the appropriate mixin names or an empty list if none are needed) so
the file defines spec.base, spec.mixins, spec.criteria and spec.constraints;
ensure criteria.service and criteria.accelerator remain unchanged and
constraints (like Deployment.gpu-operator.version) stay under
spec.validation.deployment.checks/constraints.

In `@recipes/overlays/a100-gke-cos-training-kubeflow.yaml`:
- Around line 41-44: The kubeflow-trainer componentRef is missing an explicit
version; update the componentRefs entry for kubeflow-trainer (the block
containing name: kubeflow-trainer, type: Helm, valuesFile:
components/kubeflow-trainer/values.yaml) to include a version field (e.g.,
version: "<pin-version>")—fetch the correct version from the kubeflow-trainer
chart/registry or components metadata and add that version string so the
componentRef includes name, type, version, and valuesFile as required by overlay
guidelines.

In `@recipes/overlays/a100-gke-cos-training.yaml`:
- Around line 20-36: The overlay's spec is missing the required mixins field; in
the spec block alongside base: gke-cos-training, criteria and constraints add an
explicit mixins entry (e.g., mixins: [] if none) so the recipe includes base,
mixins, criteria, and constraints per repository convention; update the spec
section (look for the spec: block and the existing base/criteria/constraints
keys) to insert mixins.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9dc3e55f-83f3-4df4-9f8f-5d9f8cf69a52

📥 Commits

Reviewing files that changed from the base of the PR and between 769d36f and 52e7c8b.

📒 Files selected for processing (4)

docs/integrator/gke-tcpxo-networking.md
recipes/overlays/a100-any.yaml
recipes/overlays/a100-gke-cos-training-kubeflow.yaml
recipes/overlays/a100-gke-cos-training.yaml

coderabbitai · 2026-06-11T01:59:23Z

 # GKE TCPXO Networking Prerequisites

-For `*-gke-cos-training*` recipes, GPUDirect TCPXO enables high-speed inter-node GPU communication on GKE. Without it, NCCL falls back to TCP (~4 GB/s vs ~340 GB/s with TCPXO).
+For the **H100 GKE COS training** recipes (`h100-gke-cos-training*`, on `a3-megagpu-8g` nodes), GPUDirect TCPXO enables high-speed inter-node GPU communication on GKE. Without it, NCCL falls back to TCP (~4 GB/s vs ~340 GB/s with TCPXO).


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Define NCCL at first mention.

Line 3 introduces NCCL without expansion. Expand once (e.g., “NVIDIA Collective Communications Library (NCCL)”) at first use.

As per coding guidelines: “Define acronyms on first use in documentation.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/integrator/gke-tcpxo-networking.md` at line 3, Expand the acronym NCCL at first mention in the sentence containing "NCCL falls back to TCP" by replacing the first occurrence with its full form "NVIDIA Collective Communications Library (NCCL)" so subsequent uses can keep the short form; update the phrase in the line that reads "Without it, NCCL falls back to TCP (~4 GB/s vs ~340 GB/s with TCPXO)." to include the expanded form.

Source: Coding guidelines

coderabbitai · 2026-06-11T01:59:23Z

+spec:
+  base: base
+
+  criteria:
+    service: any
+    accelerator: a100
+
+  validation:
+    deployment:
+      checks:
+        - operator-health
+        - expected-resources
+        - gpu-operator-version
+        - check-nvidia-smi
+      constraints:
+        # A100 has been supported since the early gpu-operator releases
+        # (v22.9). Floor at the same generation baseline as H100/H200
+        # (v24.6.0) rather than the Blackwell floor; concrete leaves can
+        # tighten if they pin to a later working version.
+        - name: Deployment.gpu-operator.version
+          value: ">= v24.6.0"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required mixins to satisfy overlay schema conventions.

spec currently omits mixins. Per the repository rule for overlays, this file should explicitly include base, mixins, criteria, and constraints.

Suggested minimal fix

spec: base: base + mixins: [] criteria: service: any accelerator: a100

As per coding guidelines: “Recipe overlays must specify base, mixins, criteria, and constraints; criteria must match defined enums.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/a100-any.yaml` around lines 34 - 54, The overlay is missing the required mixins field and thus doesn't follow the overlay schema; update the spec to include a mixins entry alongside base, criteria and constraints (e.g., add a top-level spec.mixins array with the appropriate mixin names or an empty list if none are needed) so the file defines spec.base, spec.mixins, spec.criteria and spec.constraints; ensure criteria.service and criteria.accelerator remain unchanged and constraints (like Deployment.gpu-operator.version) stay under spec.validation.deployment.checks/constraints.

Source: Coding guidelines

coderabbitai · 2026-06-11T01:59:24Z

+    - name: kubeflow-trainer
+      type: Helm
+      valuesFile: components/kubeflow-trainer/values.yaml
+      manifestFiles:


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pin kubeflow-trainer with an explicit version in componentRefs.

On Lines 41-44, this componentRef omits version. Add the version pin alongside name, type, and valuesFile to align with overlay component reference requirements.

As per coding guidelines: “Reference components in recipe overlays with componentRefs containing name, type, version, and valuesFile.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/a100-gke-cos-training-kubeflow.yaml` around lines 41 - 44, The kubeflow-trainer componentRef is missing an explicit version; update the componentRefs entry for kubeflow-trainer (the block containing name: kubeflow-trainer, type: Helm, valuesFile: components/kubeflow-trainer/values.yaml) to include a version field (e.g., version: "<pin-version>")—fetch the correct version from the kubeflow-trainer chart/registry or components metadata and add that version string so the componentRef includes name, type, version, and valuesFile as required by overlay guidelines.

Source: Coding guidelines

coderabbitai · 2026-06-11T01:59:24Z

+spec:
+  # Inherits from gke-cos-training recipe (GKE COS + training settings)
+  base: gke-cos-training
+
+  criteria:
+    service: gke
+    accelerator: a100
+    os: cos
+    intent: training
+
+  # Specific constraints for A100 on GKE COS training workloads.
+  # A100 has no IMEX/NVLink ComputeDomain requirement, so the recipe keeps
+  # the GKE COS training baseline rather than the H100 1.32 floor.
+  constraints:
+    - name: K8s.server.version
+      value: ">= 1.30"
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add explicit mixins to satisfy overlay schema convention.

spec includes base, criteria, and constraints, but mixins is missing. On Line 20 onward, add mixins (use mixins: [] if intentionally empty) to match repository overlay requirements.

As per coding guidelines: “Recipe overlays must specify base, mixins, criteria, and constraints.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/a100-gke-cos-training.yaml` around lines 20 - 36, The overlay's spec is missing the required mixins field; in the spec block alongside base: gke-cos-training, criteria and constraints add an explicit mixins entry (e.g., mixins: [] if none) so the recipe includes base, mixins, criteria, and constraints per repository convention; update the spec section (look for the spec: block and the existing base/criteria/constraints keys) to insert mixins.

Source: Coding guidelines

yuanchen8911 added area/recipes theme/recipes Recipe expansion, overlays, mixins, and component registry labels Jun 11, 2026

github-actions Bot added the size/M label Jun 11, 2026

This was referenced Jun 11, 2026

WIP: feat(recipes): add A100 OKE training Kubeflow overlay chain #1294

Draft

WIP: feat(recipes): add A100 AKS training Kubeflow overlay chain #1295

Draft

WIP: feat(recipes): add A100 EKS training Kubeflow overlay chain #1305

Draft

yuanchen8911 mentioned this pull request Jun 11, 2026

feat(recipes): add concrete A100 service-bound overlays #1002

Open

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread recipes/overlays/a100-gke-cos-training-kubeflow.yaml

Comment thread recipes/overlays/a100-gke-cos-training.yaml

yuanchen8911 force-pushed the feat/a100-gke-overlays branch from 769d36f to 52e7c8b Compare June 11, 2026 01:54

github-actions Bot added the area/docs label Jun 11, 2026

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

mchmarny assigned yuanchen8911 Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: feat(recipes): add A100 GKE COS training Kubeflow overlay chain#1306

WIP: feat(recipes): add A100 GKE COS training Kubeflow overlay chain#1306
yuanchen8911 wants to merge 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/a100-gke-overlays

yuanchen8911 commented Jun 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Uh oh!

coderabbitai Bot Jun 11, 2026

Uh oh!

coderabbitai Bot Jun 11, 2026

Uh oh!

coderabbitai Bot Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yuanchen8911 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Recipe evidence check

How to refresh evidence

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yuanchen8911 commented Jun 11, 2026 •

edited

Loading

github-actions Bot commented Jun 11, 2026 •

edited

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading