Skip to content

WIP: feat(recipes): add A100 AKS training Kubeflow overlay chain#1295

Draft
yuanchen8911 wants to merge 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/a100-aks-overlays
Draft

WIP: feat(recipes): add A100 AKS training Kubeflow overlay chain#1295
yuanchen8911 wants to merge 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/a100-aks-overlays

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Add A100 AKS overlays (issue #1002): the AKS Ubuntu Kubeflow training leaf plus its ancestor chain and the cross-cutting deployment-phase floor. Modeled on the H100 AKS training overlays (AKS has no GB200 reference).

Motivation / Context

a100 is a declared accelerator in pkg/recipe/criteria.go but has zero overlays in recipes/overlays/, so aicr recipe --accelerator a100 --service aks ... cannot resolve. Companion to the A100 OKE PR (#1294); this slice covers AKS.

Fixes: N/A (incremental — part of #1002)
Related: #1002, #1294, #1305, #1306, #969, #1256

A100 overlay series — tracked in #1002: #1294 (OKE) · #1295 (AKS) ← this PR · #1305 (EKS) · #1306 (GKE)

Coordination: a100-any.yaml (the cross-service A100 deployment floor) is shared with #1294. Only one of the two PRs needs to introduce it — whichever lands first, the other will drop the duplicate on rebase.

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)

Implementation Notes

New overlays (reuse existing aks/aks-training parents and values-aks-training.yaml — no new component values files):

  • a100-any — deployment-phase floor: 4 standard checks + Deployment.gpu-operator.version >= v24.6.0 (H100/H200-generation baseline; A100 operator-supported since v22.9).
  • a100-aks-trainingbase: aks-training; K8s >= 1.30. A100 has no NVLink ComputeDomain requirement, so it keeps the AKS training baseline rather than H100's 1.32.4. gpu-operator deps + gdrcopy, nfd topologyUpdater. Nodewright tuning omitted (packages don't target service: aks), matching h100-aks-training. Conformance mirrors the H100 AKS training set.
  • a100-aks-ubuntu-training+ os-ubuntu mixin.
  • a100-aks-ubuntu-training-kubeflow+ platform-kubeflow (Kubeflow Trainer for distributed TrainJob).

AKS pre-installs the NVIDIA container toolkit; the recipe inherits the AKS gpu-operator values and the InfiniBand network-operator RDMA stack unchanged from aks.yaml.

Testing

go test ./pkg/recipe/...                 # PASS (incl. TestOverlayValidationPhaseFloor auto-discovery)
yamllint recipes/overlays/a100-*.yaml    # clean
# End-to-end resolution of the new leaf:
aicr recipe --service aks --accelerator a100 --os ubuntu --intent training --platform kubeflow
#   -> components=13 overlays=8; K8s '>= 1.30'; Deployment.gpu-operator.version '>= v24.6.0';
#      kubeflow-trainer + network-operator present;
#      gpu-operator inherits values-aks-training.yaml

Full make qualify not required: this touches only YAML overlay files (zero .go changes), so the Go lint/test/e2e gates cannot regress from it. The embedded overlays are exercised by go test ./pkg/recipe/... (passes) and yamllint (clean). No docs/ page enumerates individual overlay leaves, so no doc update is needed.

Risk Assessment

  • Low — Additive overlays only; no existing recipe or Go path changes. Easy to revert.

Rollout notes: Additive. Other A100 AKS leaves (plain training, inference, dynamo) and remaining services tracked under #1002.

Checklist

  • Tests pass locally (go test ./pkg/recipe/...)
  • Linter passes (yamllint clean; no .go files changed)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (covered by existing auto-discovery TestOverlayValidationPhaseFloor)
  • I updated docs if user-facing behavior changed (N/A — no leaf enumeration in docs)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 added area/recipes theme/recipes Recipe expansion, overlays, mixins, and component registry labels Jun 10, 2026
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds four new RecipeMetadata overlays: a100-any (cross-cutting baseline with deployment validation and gpu-operator version >= v24.6.0), a100-aks-training (AKS training overlay inheriting aks-training with K8s.server.version >= 1.30, gpu-operator gdrcopy enabled, nfd topologyUpdater enabled, and intent-level conformance checks), a100-aks-ubuntu-training (inherits a100-aks-training, applies os-ubuntu mixin and Ubuntu criteria), and a100-aks-ubuntu-training-kubeflow (adds platform=kubeflow and platform-kubeflow mixin).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Suggested labels

size/M

Suggested reviewers

  • xdu31
  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the main change: adding A100 AKS training Kubeflow overlay chain, which matches the changeset of four new recipe overlay files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description comprehensively documents the changes, motivations, implementation details, testing, and risk assessment, all directly related to the four new YAML overlay files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml`:
- Around line 32-35: The mixins list in the overlay
a100-aks-ubuntu-training-kubeflow currently repeats os-ubuntu; remove the
redundant entry so only platform-kubeflow remains in the mixins block. Edit the
mixins for the a100-aks-ubuntu-training-kubeflow recipe (spec.mixins / mixins)
and delete "os-ubuntu", leaving the inherited os-ubuntu from
a100-aks-ubuntu-training and keeping "platform-kubeflow" only.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 02e03bf0-2daf-4ef7-811d-4e6e7215b0fd

📥 Commits

Reviewing files that changed from the base of the PR and between 8d0da78 and 2c2d1f2.

📒 Files selected for processing (4)
  • recipes/overlays/a100-aks-training.yaml
  • recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/a100-aks-ubuntu-training.yaml
  • recipes/overlays/a100-any.yaml

Comment on lines +32 to +35
# Ubuntu OS constraints + Kubeflow trainer via mixins
mixins:
- os-ubuntu
- platform-kubeflow

@coderabbitai coderabbitai Bot Jun 10, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Analyze mixin declaration patterns across overlay inheritance chains

# Find all overlays that declare os-ubuntu mixin
echo "=== Overlays declaring os-ubuntu mixin ==="
rg -l "os-ubuntu" recipes/overlays/*.yaml | while read file; do
    base=$(yq '.spec.base' "$file")
    mixins=$(yq '.spec.mixins[]' "$file" 2>/dev/null | tr '\n' ', ' | sed 's/,$//')
    echo "$file -> base: $base, mixins: [$mixins]"
done

# Check if base recipes that declare os-ubuntu have children that redeclare it
echo -e "\n=== Checking for mixin redeclaration in child overlays ==="
for base_file in recipes/overlays/*ubuntu*.yaml; do
    base_name=$(basename "$base_file" .yaml)
    has_os_ubuntu=$(yq '.spec.mixins[] | select(. == "os-ubuntu")' "$base_file" 2>/dev/null)
    if [ -n "$has_os_ubuntu" ]; then
        echo "Base: $base_name declares os-ubuntu"
        # Find children that inherit from this base
        rg -l "base: $base_name" recipes/overlays/*.yaml | while read child; do
            child_mixins=$(yq '.spec.mixins[] | select(. == "os-ubuntu")' "$child" 2>/dev/null)
            if [ -n "$child_mixins" ]; then
                echo "  ↳ Child: $(basename "$child") REDECLARES os-ubuntu"
            else
                echo "  ↳ Child: $(basename "$child") does NOT redeclare os-ubuntu"
            fi
        done
    fi
done

Repository: NVIDIA/aicr

Length of output: 2466


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
import glob, os, sys, yaml, json

# Load all overlay YAMLs under recipes/overlays
paths = sorted(glob.glob("recipes/overlays/**/*.yaml", recursive=True))

def load_yaml(p):
    with open(p, "r", encoding="utf-8") as f:
        return yaml.safe_load(f)

# 1) Find all overlays that explicitly mention os-ubuntu in spec.mixins
os_ubuntu_overlays = []
for p in paths:
    try:
        doc = load_yaml(p)
    except Exception:
        continue
    spec = doc.get("spec") if isinstance(doc, dict) else None
    mixins = spec.get("mixins") if isinstance(spec, dict) else None
    if isinstance(mixins, list) and "os-ubuntu" in mixins:
        base = spec.get("base") if isinstance(spec, dict) else None
        os_ubuntu_overlays.append((p, base, mixins))
print("=== Overlays where spec.mixins contains os-ubuntu ===")
for p, base, mixins in os_ubuntu_overlays:
    print(f"{p} -> base={base}, mixins={mixins}")

# 2) Trace the base chain for the specific overlay and report mixins along the chain
target = "recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml"

def build_base_map():
    m = {}
    for p in paths:
        try:
            doc = load_yaml(p)
        except Exception:
            continue
        spec = doc.get("spec") if isinstance(doc, dict) else None
        base = spec.get("base") if isinstance(spec, dict) else None
        name = doc.get("metadata", {}).get("name") if isinstance(doc.get("metadata"), dict) else None
        # map by metadata.name when available, else by filename
        if name:
            m[name] = p
        m[p] = p
        # also map base identifiers to paths later if they correspond to metadata.name
    return m

target_doc = load_yaml(target)
target_spec = target_doc.get("spec", {})
target_base = target_spec.get("base")
target_mixins = target_spec.get("mixins")

print("\n=== Target overlay info ===")
print(f"file={target}")
print(f"spec.base={target_base}")
print(f"spec.mixins={target_mixins}")

# Resolve base chain: base may be "some-overlay.yaml" or metadata.name or relative ref.
# We'll attempt common patterns:
# - If base endswith .yaml, resolve to that file path.
# - Else, treat base as a metadata.name and look up overlays with matching metadata.name.
name_to_path = {}
for p in paths:
    try:
        doc = load_yaml(p)
    except Exception:
        continue
    md = doc.get("metadata", {})
    if isinstance(md, dict) and "name" in md:
        name_to_path[md["name"]] = p

def resolve_base(base):
    if base is None:
        return None
    if isinstance(base, str) and base.endswith(".yaml"):
        # base might be relative to recipes/overlays
        candidate = os.path.normpath(os.path.join("recipes/overlays", base))
        if os.path.exists(candidate):
            return candidate
        if os.path.exists(base):
            return base
        # try absolute relative
        if os.path.exists(os.path.join("recipes/overlays", os.path.basename(base))):
            return os.path.join("recipes/overlays", os.path.basename(base))
        return None
    if isinstance(base, str):
        return name_to_path.get(base)
    return None

chain = []
seen = set()
cur = target
while cur and cur not in seen:
    seen.add(cur)
    doc = load_yaml(cur)
    spec = doc.get("spec", {}) if isinstance(doc, dict) else {}
    base = spec.get("base")
    mixins = spec.get("mixins")
    chain.append((cur, base, mixins))
    cur = resolve_base(base)

print("\n=== Inheritance chain (child -> ancestors via spec.base) ===")
for i, (f, b, mix) in enumerate(chain):
    print(f"{i}: {f}")
    print(f"   base: {b}")
    print(f"   mixins: {mix}")

# 3) For each overlay in the chain that contains os-ubuntu, show whether it is also declared in its immediate parent
# (i.e., whether redeclaration occurs at each step).
print("\n=== os-ubuntu redeclaration along chain ===")
for i in range(1, len(chain)):
    child = chain[i-0][0]  # current? careful
PY

# Print a small, readable excerpt of the target overlay around the relevant lines
echo
echo "=== Excerpt: target overlay ==="
sed -n '1,120p' recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml | nl -ba | sed -n '20,60p'

echo
echo "=== Excerpt: parent overlay candidate(s) based on spec.base ==="
# Extract spec.base from target quickly and then print that file if resolvable
base_line=$(python3 - <<'PY'
import yaml
p="recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml"
with open(p,'r',encoding='utf-8') as f:
    doc=yaml.safe_load(f)
print(doc.get('spec',{}).get('base'))
PY
)
echo "spec.base from target: $base_line"
if [[ "$base_line" == *.yaml ]]; then
  # try common resolution
  if [[ -f "recipes/overlays/$base_line" ]]; then
    sed -n '1,120p' "recipes/overlays/$base_line" | nl -ba | sed -n '20,60p'
  elif [[ -f "$base_line" ]]; then
    sed -n '1,120p' "$base_line" | nl -ba | sed -n '20,60p'
  fi
fi

Repository: NVIDIA/aicr

Length of output: 3007


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Show relevant YAML snippets (avoid `nl`)
echo "=== Target overlay (mixins + base) ==="
sed -n '1,120p' recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml | cat -n | sed -n '20,60p'
echo
echo "=== Parent overlay (mixins + base) ==="
sed -n '1,120p' recipes/overlays/a100-aks-ubuntu-training.yaml | cat -n | sed -n '20,60p'

# 2) Locate recipe overlay/mixin resolution logic in Go
echo
echo "=== Search for mixins resolution/merge logic ==="
rg -n --hidden --glob '!**/vendor/**' "mixins" pkg/recipe pkg/bundler pkg/recipe* || true

# 3) Narrow to likely overlay resolver files
echo
echo "=== Candidate files in pkg/recipe for overlay processing ==="
rg -n --hidden --glob '!**/vendor/**' "overlay" pkg/recipe || true

# 4) Search for YAML spec struct fields named Mixins
echo
echo "=== Search for struct fields named Mixins ==="
rg -n --hidden --glob '!**/vendor/**' "Mixins" pkg/recipe pkg || true

Repository: NVIDIA/aicr

Length of output: 50367


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== pkg/recipe/metadata.go (mixins merge/accumulation) ==="
sed -n '680,760p' pkg/recipe/metadata.go | nl -ba | sed -n '1,120p'

echo
echo "=== pkg/recipe/metadata_store.go (mergeMixins) ==="
sed -n '520,700p' pkg/recipe/metadata_store.go | nl -ba | sed -n '1,220p'

Repository: NVIDIA/aicr

Length of output: 175


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== pkg/recipe/metadata.go: mixins accumulation ==="
# show without nl
sed -n '680,760p' pkg/recipe/metadata.go

echo
echo "=== pkg/recipe/metadata_store.go: mergeMixins ==="
sed -n '520,700p' pkg/recipe/metadata_store.go

Repository: NVIDIA/aicr

Length of output: 10009


Remove redundant os-ubuntu from a100-aks-ubuntu-training-kubeflow mixins

a100-aks-ubuntu-training already declares spec.mixins: [os-ubuntu]; inheritance-chain mixins are accumulated into descendants with de-duplication, so a100-aks-ubuntu-training-kubeflow only needs to add platform-kubeflow (dropping the repeated os-ubuntu is safe/reduces noise). [pkg/recipe/metadata.go] [recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml, recipes/overlays/a100-aks-ubuntu-training.yaml]

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml` around lines 32 -
35, The mixins list in the overlay a100-aks-ubuntu-training-kubeflow currently
repeats os-ubuntu; remove the redundant entry so only platform-kubeflow remains
in the mixins block. Edit the mixins for the a100-aks-ubuntu-training-kubeflow
recipe (spec.mixins / mixins) and delete "os-ubuntu", leaving the inherited
os-ubuntu from a100-aks-ubuntu-training and keeping "platform-kubeflow" only.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this for consistency with the established convention. Every *-ubuntu-training-kubeflow leaf in the repo (gb200-eks, gb200-oke, h100-aks, h100-eks) declares both os-ubuntu and platform-kubeflow in the leaf's mixins block. Removing it from only this leaf would diverge from that pattern, so leaving it as-is intentionally.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuanchen8911, understood — that's a sensible consistency rationale. Explicitly re-declaring both os-ubuntu and platform-kubeflow in every *-ubuntu-training-kubeflow leaf makes the mixin intent self-documenting and keeps the pattern uniform across all siblings. Leaving it as-is makes sense.


✏️ Learnings added
Learnt from: yuanchen8911
Repo: NVIDIA/aicr PR: 1295
File: recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml:32-35
Timestamp: 2026-06-10T20:30:35.631Z
Learning: In NVIDIA/aicr recipe overlays, every `*-ubuntu-training-kubeflow` leaf overlay (e.g., gb200-eks, gb200-oke, h100-aks, h100-eks, a100-aks) explicitly declares both `os-ubuntu` and `platform-kubeflow` in its `spec.mixins` block, even though `os-ubuntu` is already inherited from the parent `*-ubuntu-training` overlay. This is an intentional convention for self-documentation and consistency across sibling overlays. Do not flag this as a redundant mixin redeclaration in `recipes/overlays/*-ubuntu-training-kubeflow.yaml` files.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: yuanchen8911
Repo: NVIDIA/aicr PR: 1046
File: recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml:46-53
Timestamp: 2026-05-28T15:05:30.166Z
Learning: When reviewing an overlay YAML under recipes/overlays, don’t require regenerating BOM/user docs (e.g., running `make bom-docs` and updating `docs/user/container-images.md`) if the overlay reuses chart+version pins that are already present in an existing overlay/registry entry. Concretely, if the overlay’s chart/version pin set (such as the chart name and semver/version pin) exactly matches what’s already reflected in `docs/user/container-images.md`, `make bom-docs` should be a no-op—so BOM regeneration/docs updates are not required. Only flag BOM/doc regeneration when the overlay introduces new (previously unseen) chart/version pins that would change the documented container image/version set.

@yuanchen8911 yuanchen8911 force-pushed the feat/a100-aks-overlays branch from 2c2d1f2 to c3daa9f Compare June 10, 2026 20:27

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/a100-aks-training.yaml`:
- Around line 40-57: The componentRefs entries for gpu-operator and nfd are
missing required keys; update each componentRef (the entries with name:
gpu-operator and name: nfd) to include a version and a valuesFile key per
overlay guidelines; ensure you add the appropriate semantic version string for
the component (e.g., matching other overlays) to the version field and the path
or filename that inherits/overrides values (e.g., the aks-training values file)
to valuesFile so both entries contain name, type, version, and valuesFile.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: bd0945c0-3115-4f6a-9c43-a31a57227567

📥 Commits

Reviewing files that changed from the base of the PR and between 2c2d1f2 and c3daa9f.

📒 Files selected for processing (4)
  • recipes/overlays/a100-aks-training.yaml
  • recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/a100-aks-ubuntu-training.yaml
  • recipes/overlays/a100-any.yaml

Comment thread recipes/overlays/a100-aks-training.yaml
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Recipe evidence check

Affected leaf overlays: 3

Recipe Pointer Verify Digest match
a100-aks-training ⚠️ missing
a100-aks-ubuntu-training-kubeflow ⚠️ missing
a100-aks-ubuntu-training ⚠️ missing

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
Add A100 AKS overlays (issue NVIDIA#1002): the AKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100 AKS training overlays (AKS has no GB200 reference).

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE PR (NVIDIA#1294) — only one needs to land; the other drops it on
  rebase.
- a100-aks-training: A100 + AKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the AKS training baseline
  rather than the H100 1.32.4 floor). gpu-operator deps + gdrcopy, nfd
  topologyUpdater; nodewright tuning omitted (packages do not target
  service: aks). Conformance mirrors the H100 AKS training set.
- a100-aks-ubuntu-training: + os-ubuntu mixin
- a100-aks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

AKS pre-installs the NVIDIA container toolkit and the recipe inherits the
AKS gpu-operator values (values-aks-training.yaml) and the InfiniBand
network-operator stack unchanged from aks.yaml.

Performance gating is intentionally omitted: the H100 AKS
nccl-all-reduce-bw floor (>= 100) is calibrated for H100 ND-series nodes
and is neither fabric-class aware nor valid for A100, so an A100-on-Azure
NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-aks-overlays branch from 286bda5 to d915f05 Compare June 11, 2026 01:34
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
recipes/overlays/a100-aks-training.yaml (1)

40-57: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required version and valuesFile keys to each componentRef.

Both gpu-operator and nfd componentRefs are missing the required version and valuesFile fields specified in the overlay guidelines.

📋 Required fields per coding guidelines

As per coding guidelines: "Reference components in recipe overlays with componentRefs containing name, type, version, and valuesFile"

Add the appropriate version (matching other overlays) and valuesFile path (e.g., values-aks-training.yaml as mentioned in the PR description) to both entries.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-aks-training.yaml` around lines 40 - 57, The
componentRefs entries for gpu-operator and nfd are missing required keys; update
each componentRef (gpu-operator and nfd) to include a version field (use the
same version pattern used in other overlays) and a valuesFile field pointing to
the overlay values file (e.g., values-aks-training.yaml) so both entries include
name, type, version, and valuesFile as required by the recipe overlay
guidelines.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/a100-aks-ubuntu-training.yaml`:
- Around line 36-38: Remove the redundant constraints entry for
K8s.server.version in the a100-aks-ubuntu-training overlay: since
K8s.server.version is already set to ">= 1.30" in the a100-aks-training parent
and the overlay doesn’t change the value, delete the K8s.server.version
constraint block from a100-aks-ubuntu-training.yaml (unless your intent is to
override or clear the parent value, in which case replace it with a different
value or an explicit clear directive).

---

Duplicate comments:
In `@recipes/overlays/a100-aks-training.yaml`:
- Around line 40-57: The componentRefs entries for gpu-operator and nfd are
missing required keys; update each componentRef (gpu-operator and nfd) to
include a version field (use the same version pattern used in other overlays)
and a valuesFile field pointing to the overlay values file (e.g.,
values-aks-training.yaml) so both entries include name, type, version, and
valuesFile as required by the recipe overlay guidelines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 28cd1b3d-310d-4325-968d-c2dbfb95cfff

📥 Commits

Reviewing files that changed from the base of the PR and between 286bda5 and d915f05.

📒 Files selected for processing (4)
  • recipes/overlays/a100-aks-training.yaml
  • recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/a100-aks-ubuntu-training.yaml
  • recipes/overlays/a100-any.yaml

Comment thread recipes/overlays/a100-aks-ubuntu-training.yaml
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/recipes size/L theme/recipes Recipe expansion, overlays, mixins, and component registry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant