WIP: feat(recipes): add A100 AKS training Kubeflow overlay chain#1295
WIP: feat(recipes): add A100 AKS training Kubeflow overlay chain#1295yuanchen8911 wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughAdds four new RecipeMetadata overlays: a100-any (cross-cutting baseline with deployment validation and gpu-operator version >= v24.6.0), a100-aks-training (AKS training overlay inheriting aks-training with K8s.server.version >= 1.30, gpu-operator gdrcopy enabled, nfd topologyUpdater enabled, and intent-level conformance checks), a100-aks-ubuntu-training (inherits a100-aks-training, applies os-ubuntu mixin and Ubuntu criteria), and a100-aks-ubuntu-training-kubeflow (adds platform=kubeflow and platform-kubeflow mixin). Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml`:
- Around line 32-35: The mixins list in the overlay
a100-aks-ubuntu-training-kubeflow currently repeats os-ubuntu; remove the
redundant entry so only platform-kubeflow remains in the mixins block. Edit the
mixins for the a100-aks-ubuntu-training-kubeflow recipe (spec.mixins / mixins)
and delete "os-ubuntu", leaving the inherited os-ubuntu from
a100-aks-ubuntu-training and keeping "platform-kubeflow" only.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 02e03bf0-2daf-4ef7-811d-4e6e7215b0fd
📒 Files selected for processing (4)
recipes/overlays/a100-aks-training.yamlrecipes/overlays/a100-aks-ubuntu-training-kubeflow.yamlrecipes/overlays/a100-aks-ubuntu-training.yamlrecipes/overlays/a100-any.yaml
| # Ubuntu OS constraints + Kubeflow trainer via mixins | ||
| mixins: | ||
| - os-ubuntu | ||
| - platform-kubeflow |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Analyze mixin declaration patterns across overlay inheritance chains
# Find all overlays that declare os-ubuntu mixin
echo "=== Overlays declaring os-ubuntu mixin ==="
rg -l "os-ubuntu" recipes/overlays/*.yaml | while read file; do
base=$(yq '.spec.base' "$file")
mixins=$(yq '.spec.mixins[]' "$file" 2>/dev/null | tr '\n' ', ' | sed 's/,$//')
echo "$file -> base: $base, mixins: [$mixins]"
done
# Check if base recipes that declare os-ubuntu have children that redeclare it
echo -e "\n=== Checking for mixin redeclaration in child overlays ==="
for base_file in recipes/overlays/*ubuntu*.yaml; do
base_name=$(basename "$base_file" .yaml)
has_os_ubuntu=$(yq '.spec.mixins[] | select(. == "os-ubuntu")' "$base_file" 2>/dev/null)
if [ -n "$has_os_ubuntu" ]; then
echo "Base: $base_name declares os-ubuntu"
# Find children that inherit from this base
rg -l "base: $base_name" recipes/overlays/*.yaml | while read child; do
child_mixins=$(yq '.spec.mixins[] | select(. == "os-ubuntu")' "$child" 2>/dev/null)
if [ -n "$child_mixins" ]; then
echo " ↳ Child: $(basename "$child") REDECLARES os-ubuntu"
else
echo " ↳ Child: $(basename "$child") does NOT redeclare os-ubuntu"
fi
done
fi
doneRepository: NVIDIA/aicr
Length of output: 2466
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
import glob, os, sys, yaml, json
# Load all overlay YAMLs under recipes/overlays
paths = sorted(glob.glob("recipes/overlays/**/*.yaml", recursive=True))
def load_yaml(p):
with open(p, "r", encoding="utf-8") as f:
return yaml.safe_load(f)
# 1) Find all overlays that explicitly mention os-ubuntu in spec.mixins
os_ubuntu_overlays = []
for p in paths:
try:
doc = load_yaml(p)
except Exception:
continue
spec = doc.get("spec") if isinstance(doc, dict) else None
mixins = spec.get("mixins") if isinstance(spec, dict) else None
if isinstance(mixins, list) and "os-ubuntu" in mixins:
base = spec.get("base") if isinstance(spec, dict) else None
os_ubuntu_overlays.append((p, base, mixins))
print("=== Overlays where spec.mixins contains os-ubuntu ===")
for p, base, mixins in os_ubuntu_overlays:
print(f"{p} -> base={base}, mixins={mixins}")
# 2) Trace the base chain for the specific overlay and report mixins along the chain
target = "recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml"
def build_base_map():
m = {}
for p in paths:
try:
doc = load_yaml(p)
except Exception:
continue
spec = doc.get("spec") if isinstance(doc, dict) else None
base = spec.get("base") if isinstance(spec, dict) else None
name = doc.get("metadata", {}).get("name") if isinstance(doc.get("metadata"), dict) else None
# map by metadata.name when available, else by filename
if name:
m[name] = p
m[p] = p
# also map base identifiers to paths later if they correspond to metadata.name
return m
target_doc = load_yaml(target)
target_spec = target_doc.get("spec", {})
target_base = target_spec.get("base")
target_mixins = target_spec.get("mixins")
print("\n=== Target overlay info ===")
print(f"file={target}")
print(f"spec.base={target_base}")
print(f"spec.mixins={target_mixins}")
# Resolve base chain: base may be "some-overlay.yaml" or metadata.name or relative ref.
# We'll attempt common patterns:
# - If base endswith .yaml, resolve to that file path.
# - Else, treat base as a metadata.name and look up overlays with matching metadata.name.
name_to_path = {}
for p in paths:
try:
doc = load_yaml(p)
except Exception:
continue
md = doc.get("metadata", {})
if isinstance(md, dict) and "name" in md:
name_to_path[md["name"]] = p
def resolve_base(base):
if base is None:
return None
if isinstance(base, str) and base.endswith(".yaml"):
# base might be relative to recipes/overlays
candidate = os.path.normpath(os.path.join("recipes/overlays", base))
if os.path.exists(candidate):
return candidate
if os.path.exists(base):
return base
# try absolute relative
if os.path.exists(os.path.join("recipes/overlays", os.path.basename(base))):
return os.path.join("recipes/overlays", os.path.basename(base))
return None
if isinstance(base, str):
return name_to_path.get(base)
return None
chain = []
seen = set()
cur = target
while cur and cur not in seen:
seen.add(cur)
doc = load_yaml(cur)
spec = doc.get("spec", {}) if isinstance(doc, dict) else {}
base = spec.get("base")
mixins = spec.get("mixins")
chain.append((cur, base, mixins))
cur = resolve_base(base)
print("\n=== Inheritance chain (child -> ancestors via spec.base) ===")
for i, (f, b, mix) in enumerate(chain):
print(f"{i}: {f}")
print(f" base: {b}")
print(f" mixins: {mix}")
# 3) For each overlay in the chain that contains os-ubuntu, show whether it is also declared in its immediate parent
# (i.e., whether redeclaration occurs at each step).
print("\n=== os-ubuntu redeclaration along chain ===")
for i in range(1, len(chain)):
child = chain[i-0][0] # current? careful
PY
# Print a small, readable excerpt of the target overlay around the relevant lines
echo
echo "=== Excerpt: target overlay ==="
sed -n '1,120p' recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml | nl -ba | sed -n '20,60p'
echo
echo "=== Excerpt: parent overlay candidate(s) based on spec.base ==="
# Extract spec.base from target quickly and then print that file if resolvable
base_line=$(python3 - <<'PY'
import yaml
p="recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml"
with open(p,'r',encoding='utf-8') as f:
doc=yaml.safe_load(f)
print(doc.get('spec',{}).get('base'))
PY
)
echo "spec.base from target: $base_line"
if [[ "$base_line" == *.yaml ]]; then
# try common resolution
if [[ -f "recipes/overlays/$base_line" ]]; then
sed -n '1,120p' "recipes/overlays/$base_line" | nl -ba | sed -n '20,60p'
elif [[ -f "$base_line" ]]; then
sed -n '1,120p' "$base_line" | nl -ba | sed -n '20,60p'
fi
fiRepository: NVIDIA/aicr
Length of output: 3007
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Show relevant YAML snippets (avoid `nl`)
echo "=== Target overlay (mixins + base) ==="
sed -n '1,120p' recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml | cat -n | sed -n '20,60p'
echo
echo "=== Parent overlay (mixins + base) ==="
sed -n '1,120p' recipes/overlays/a100-aks-ubuntu-training.yaml | cat -n | sed -n '20,60p'
# 2) Locate recipe overlay/mixin resolution logic in Go
echo
echo "=== Search for mixins resolution/merge logic ==="
rg -n --hidden --glob '!**/vendor/**' "mixins" pkg/recipe pkg/bundler pkg/recipe* || true
# 3) Narrow to likely overlay resolver files
echo
echo "=== Candidate files in pkg/recipe for overlay processing ==="
rg -n --hidden --glob '!**/vendor/**' "overlay" pkg/recipe || true
# 4) Search for YAML spec struct fields named Mixins
echo
echo "=== Search for struct fields named Mixins ==="
rg -n --hidden --glob '!**/vendor/**' "Mixins" pkg/recipe pkg || trueRepository: NVIDIA/aicr
Length of output: 50367
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== pkg/recipe/metadata.go (mixins merge/accumulation) ==="
sed -n '680,760p' pkg/recipe/metadata.go | nl -ba | sed -n '1,120p'
echo
echo "=== pkg/recipe/metadata_store.go (mergeMixins) ==="
sed -n '520,700p' pkg/recipe/metadata_store.go | nl -ba | sed -n '1,220p'Repository: NVIDIA/aicr
Length of output: 175
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== pkg/recipe/metadata.go: mixins accumulation ==="
# show without nl
sed -n '680,760p' pkg/recipe/metadata.go
echo
echo "=== pkg/recipe/metadata_store.go: mergeMixins ==="
sed -n '520,700p' pkg/recipe/metadata_store.goRepository: NVIDIA/aicr
Length of output: 10009
Remove redundant os-ubuntu from a100-aks-ubuntu-training-kubeflow mixins
a100-aks-ubuntu-training already declares spec.mixins: [os-ubuntu]; inheritance-chain mixins are accumulated into descendants with de-duplication, so a100-aks-ubuntu-training-kubeflow only needs to add platform-kubeflow (dropping the repeated os-ubuntu is safe/reduces noise). [pkg/recipe/metadata.go] [recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml, recipes/overlays/a100-aks-ubuntu-training.yaml]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml` around lines 32 -
35, The mixins list in the overlay a100-aks-ubuntu-training-kubeflow currently
repeats os-ubuntu; remove the redundant entry so only platform-kubeflow remains
in the mixins block. Edit the mixins for the a100-aks-ubuntu-training-kubeflow
recipe (spec.mixins / mixins) and delete "os-ubuntu", leaving the inherited
os-ubuntu from a100-aks-ubuntu-training and keeping "platform-kubeflow" only.
There was a problem hiding this comment.
Keeping this for consistency with the established convention. Every *-ubuntu-training-kubeflow leaf in the repo (gb200-eks, gb200-oke, h100-aks, h100-eks) declares both os-ubuntu and platform-kubeflow in the leaf's mixins block. Removing it from only this leaf would diverge from that pattern, so leaving it as-is intentionally.
There was a problem hiding this comment.
@yuanchen8911, understood — that's a sensible consistency rationale. Explicitly re-declaring both os-ubuntu and platform-kubeflow in every *-ubuntu-training-kubeflow leaf makes the mixin intent self-documenting and keeps the pattern uniform across all siblings. Leaving it as-is makes sense.
✏️ Learnings added
Learnt from: yuanchen8911
Repo: NVIDIA/aicr PR: 1295
File: recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml:32-35
Timestamp: 2026-06-10T20:30:35.631Z
Learning: In NVIDIA/aicr recipe overlays, every `*-ubuntu-training-kubeflow` leaf overlay (e.g., gb200-eks, gb200-oke, h100-aks, h100-eks, a100-aks) explicitly declares both `os-ubuntu` and `platform-kubeflow` in its `spec.mixins` block, even though `os-ubuntu` is already inherited from the parent `*-ubuntu-training` overlay. This is an intentional convention for self-documentation and consistency across sibling overlays. Do not flag this as a redundant mixin redeclaration in `recipes/overlays/*-ubuntu-training-kubeflow.yaml` files.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: yuanchen8911
Repo: NVIDIA/aicr PR: 1046
File: recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml:46-53
Timestamp: 2026-05-28T15:05:30.166Z
Learning: When reviewing an overlay YAML under recipes/overlays, don’t require regenerating BOM/user docs (e.g., running `make bom-docs` and updating `docs/user/container-images.md`) if the overlay reuses chart+version pins that are already present in an existing overlay/registry entry. Concretely, if the overlay’s chart/version pin set (such as the chart name and semver/version pin) exactly matches what’s already reflected in `docs/user/container-images.md`, `make bom-docs` should be a no-op—so BOM regeneration/docs updates are not required. Only flag BOM/doc regeneration when the overlay introduces new (previously unseen) chart/version pins that would change the documented container image/version set.
2c2d1f2 to
c3daa9f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/a100-aks-training.yaml`:
- Around line 40-57: The componentRefs entries for gpu-operator and nfd are
missing required keys; update each componentRef (the entries with name:
gpu-operator and name: nfd) to include a version and a valuesFile key per
overlay guidelines; ensure you add the appropriate semantic version string for
the component (e.g., matching other overlays) to the version field and the path
or filename that inherits/overrides values (e.g., the aks-training values file)
to valuesFile so both entries contain name, type, version, and valuesFile.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: bd0945c0-3115-4f6a-9c43-a31a57227567
📒 Files selected for processing (4)
recipes/overlays/a100-aks-training.yamlrecipes/overlays/a100-aks-ubuntu-training-kubeflow.yamlrecipes/overlays/a100-aks-ubuntu-training.yamlrecipes/overlays/a100-any.yaml
c3daa9f to
286bda5
Compare
Recipe evidence checkAffected leaf overlays: 3
How to refresh evidenceRun on a cluster matching the recipe's aicr snapshot -o snapshot.yaml
aicr validate \
-r recipes/overlays/<slug>.yaml \
-s snapshot.yaml \
--emit-attestation ./out \
--push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yamlThis gate is warning-only and never blocks merge. See ADR-007 for the trust model. |
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training leaf plus its ancestor chain and the cross-cutting deployment floor. Modeled on the H100/H200 EKS training overlays. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the others drop it on rebase. - a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the EKS training baseline rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy, nfd topologyUpdater. - a100-eks-ubuntu-training: + os-ubuntu mixin - a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow Trainer for distributed TrainJob) Nodewright tuning uses the nvidia-tuned generic profile (tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks. nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200; unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is Ampere with no matching profile. The generic profile is the supported baseline for such targets: governor=performance, iommu=pt, BBR, without the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200 profiles bake in. Performance gating is intentionally omitted: the H100/H200 EKS nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink nodes with EFA and is neither fabric-class aware nor valid for A100, so an A100-on-EKS NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
Add A100 AKS overlays (issue NVIDIA#1002): the AKS Ubuntu Kubeflow training leaf plus its ancestor chain and the cross-cutting deployment floor. Modeled on the H100 AKS training overlays (AKS has no GB200 reference). New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE PR (NVIDIA#1294) — only one needs to land; the other drops it on rebase. - a100-aks-training: A100 + AKS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the AKS training baseline rather than the H100 1.32.4 floor). gpu-operator deps + gdrcopy, nfd topologyUpdater; nodewright tuning omitted (packages do not target service: aks). Conformance mirrors the H100 AKS training set. - a100-aks-ubuntu-training: + os-ubuntu mixin - a100-aks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow Trainer for distributed TrainJob) AKS pre-installs the NVIDIA container toolkit and the recipe inherits the AKS gpu-operator values (values-aks-training.yaml) and the InfiniBand network-operator stack unchanged from aks.yaml. Performance gating is intentionally omitted: the H100 AKS nccl-all-reduce-bw floor (>= 100) is calibrated for H100 ND-series nodes and is neither fabric-class aware nor valid for A100, so an A100-on-Azure NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
286bda5 to
d915f05
Compare
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training leaf plus its ancestor chain and the cross-cutting deployment floor. Modeled on the H100/H200 EKS training overlays. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the others drop it on rebase. - a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the EKS training baseline rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy, nfd topologyUpdater. - a100-eks-ubuntu-training: + os-ubuntu mixin - a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow Trainer for distributed TrainJob) Nodewright tuning uses the nvidia-tuned generic profile (tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks. nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200; unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is Ampere with no matching profile. The generic profile is the supported baseline for such targets: governor=performance, iommu=pt, BBR, without the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200 profiles bake in. Performance gating is intentionally omitted: the H100/H200 EKS nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink nodes with EFA and is neither fabric-class aware nor valid for A100, so an A100-on-EKS NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
recipes/overlays/a100-aks-training.yaml (1)
40-57:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd required
versionandvaluesFilekeys to each componentRef.Both
gpu-operatorandnfdcomponentRefs are missing the requiredversionandvaluesFilefields specified in the overlay guidelines.📋 Required fields per coding guidelines
As per coding guidelines: "Reference components in recipe overlays with
componentRefscontaining name, type, version, and valuesFile"Add the appropriate version (matching other overlays) and valuesFile path (e.g.,
values-aks-training.yamlas mentioned in the PR description) to both entries.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/a100-aks-training.yaml` around lines 40 - 57, The componentRefs entries for gpu-operator and nfd are missing required keys; update each componentRef (gpu-operator and nfd) to include a version field (use the same version pattern used in other overlays) and a valuesFile field pointing to the overlay values file (e.g., values-aks-training.yaml) so both entries include name, type, version, and valuesFile as required by the recipe overlay guidelines.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/a100-aks-ubuntu-training.yaml`:
- Around line 36-38: Remove the redundant constraints entry for
K8s.server.version in the a100-aks-ubuntu-training overlay: since
K8s.server.version is already set to ">= 1.30" in the a100-aks-training parent
and the overlay doesn’t change the value, delete the K8s.server.version
constraint block from a100-aks-ubuntu-training.yaml (unless your intent is to
override or clear the parent value, in which case replace it with a different
value or an explicit clear directive).
---
Duplicate comments:
In `@recipes/overlays/a100-aks-training.yaml`:
- Around line 40-57: The componentRefs entries for gpu-operator and nfd are
missing required keys; update each componentRef (gpu-operator and nfd) to
include a version field (use the same version pattern used in other overlays)
and a valuesFile field pointing to the overlay values file (e.g.,
values-aks-training.yaml) so both entries include name, type, version, and
valuesFile as required by the recipe overlay guidelines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 28cd1b3d-310d-4325-968d-c2dbfb95cfff
📒 Files selected for processing (4)
recipes/overlays/a100-aks-training.yamlrecipes/overlays/a100-aks-ubuntu-training-kubeflow.yamlrecipes/overlays/a100-aks-ubuntu-training.yamlrecipes/overlays/a100-any.yaml
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is not a fallback on immutable GKE COS (reboot/bootloader changes). The nodewright-operator is still inherited from gke-cos. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
Summary
Add A100 AKS overlays (issue #1002): the AKS Ubuntu Kubeflow training leaf plus its ancestor chain and the cross-cutting deployment-phase floor. Modeled on the H100 AKS training overlays (AKS has no GB200 reference).
Motivation / Context
a100is a declared accelerator inpkg/recipe/criteria.gobut has zero overlays inrecipes/overlays/, soaicr recipe --accelerator a100 --service aks ...cannot resolve. Companion to the A100 OKE PR (#1294); this slice covers AKS.Fixes: N/A (incremental — part of #1002)
Related: #1002, #1294, #1305, #1306, #969, #1256
A100 overlay series — tracked in #1002: #1294 (OKE) · #1295 (AKS) ← this PR · #1305 (EKS) · #1306 (GKE)
Type of Change
Component(s) Affected
pkg/recipe)Implementation Notes
New overlays (reuse existing
aks/aks-trainingparents andvalues-aks-training.yaml— no new component values files):a100-any— deployment-phase floor: 4 standard checks +Deployment.gpu-operator.version >= v24.6.0(H100/H200-generation baseline; A100 operator-supported since v22.9).a100-aks-training—base: aks-training;K8s >= 1.30. A100 has no NVLink ComputeDomain requirement, so it keeps the AKS training baseline rather than H100's1.32.4. gpu-operator deps +gdrcopy, nfdtopologyUpdater. Nodewright tuning omitted (packages don't targetservice: aks), matchingh100-aks-training. Conformance mirrors the H100 AKS training set.a100-aks-ubuntu-training—+ os-ubuntumixin.a100-aks-ubuntu-training-kubeflow—+ platform-kubeflow(Kubeflow Trainer for distributedTrainJob).AKS pre-installs the NVIDIA container toolkit; the recipe inherits the AKS gpu-operator values and the InfiniBand
network-operatorRDMA stack unchanged fromaks.yaml.nccl-all-reduce-bw >= 100, calibrated for H100 ND-series nodes — neither fabric-class aware (nccl-all-reduce-bw training gate is a fixed absolute fabric-specific busbw value applied to SKU-agnostic recipes → false-fails EKS/H100 small SKUs #1256) nor valid for A100. An A100-on-Azure NCCL baseline is deferred to a follow-up.Testing
Full
make qualifynot required: this touches only YAML overlay files (zero.gochanges), so the Go lint/test/e2e gates cannot regress from it. The embedded overlays are exercised bygo test ./pkg/recipe/...(passes) and yamllint (clean). Nodocs/page enumerates individual overlay leaves, so no doc update is needed.Risk Assessment
Rollout notes: Additive. Other A100 AKS leaves (plain training, inference, dynamo) and remaining services tracked under #1002.
Checklist
go test ./pkg/recipe/...).gofiles changed)TestOverlayValidationPhaseFloor)git commit -S)