WIP: feat(recipes): add A100 EKS training Kubeflow overlay chain#1305
WIP: feat(recipes): add A100 EKS training Kubeflow overlay chain#1305yuanchen8911 wants to merge 1 commit into
Conversation
51ef12d to
6b4a4bd
Compare
📝 WalkthroughWalkthroughThis PR introduces four new recipe overlay manifests for A100 accelerator workloads in the AICR repository. A wildcard Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Recipe evidence checkAffected leaf overlays: 3
How to refresh evidenceRun on a cluster matching the recipe's aicr snapshot -o snapshot.yaml
aicr validate \
-r recipes/overlays/<slug>.yaml \
-s snapshot.yaml \
--emit-attestation ./out \
--push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yamlThis gate is warning-only and never blocks merge. See ADR-007 for the trust model. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/a100-eks-ubuntu-training.yaml`:
- Around line 36-38: Remove the redundant K8s.server.version constraint from the
a100-eks-ubuntu-training overlay: delete the duplicated constraints entry that
sets "K8s.server.version >= 1.30" in the a100-eks-ubuntu-training specialization
and rely on the version floor inherited from the ancestor
a100-eks-training.yaml; only keep an explicit constraint here if this
specialization needs a stricter (higher) minimum than the ancestor.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 8c71c1c5-ab8f-479b-a9cc-e7093dd44fff
📒 Files selected for processing (4)
recipes/overlays/a100-any.yamlrecipes/overlays/a100-eks-training.yamlrecipes/overlays/a100-eks-ubuntu-training-kubeflow.yamlrecipes/overlays/a100-eks-ubuntu-training.yaml
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training leaf plus its ancestor chain and the cross-cutting deployment floor. Modeled on the H100/H200 EKS training overlays. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the others drop it on rebase. - a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the EKS training baseline rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy, nfd topologyUpdater. - a100-eks-ubuntu-training: + os-ubuntu mixin - a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow Trainer for distributed TrainJob) Nodewright tuning uses the nvidia-tuned generic profile (tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks. nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200; unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is Ampere with no matching profile. The generic profile is the supported baseline for such targets: governor=performance, iommu=pt, BBR, without the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200 profiles bake in. Performance gating is intentionally omitted: the H100/H200 EKS nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink nodes with EFA and is neither fabric-class aware nor valid for A100, so an A100-on-EKS NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
6b4a4bd to
331ae98
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
recipes/overlays/a100-eks-ubuntu-training-kubeflow.yaml (1)
38-40: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winRedundant constraint inherited from ancestor overlay.
The
K8s.server.version >= 1.30constraint is already defined in the ancestora100-eks-trainingoverlay (line 34-35) and inherited through the overlay chain (a100-eks-training→a100-eks-ubuntu-training→ this file). Repeating it creates a maintenance burden—if the version floor is raised, multiple files must be updated. Remove this redundant constraint block and rely on inheritance unless this specialization requires a stricter (higher) minimum version.♻️ Proposed cleanup
mixins: - os-ubuntu - platform-kubeflow - # A100 + EKS specific constraints (not covered by mixin) - constraints: - - name: K8s.server.version - value: ">= 1.30" - componentRefs: []🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/a100-eks-ubuntu-training-kubeflow.yaml` around lines 38 - 40, Remove the redundant K8s.server.version constraint block from the overlays/a100-eks-ubuntu-training-kubeflow overlay: locate the constraints: section and delete the entry with name: K8s.server.version and value: ">= 1.30" (it’s inherited from the a100-eks-training overlay); if this overlay truly needs a stricter minimum, replace the value instead of duplicating the same constraint.recipes/overlays/a100-eks-ubuntu-training.yaml (1)
36-38: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winRedundant constraint inherited from base overlay.
The
K8s.server.version >= 1.30constraint is already defined in the basea100-eks-trainingoverlay (line 34-35). Due to overlay inheritance, this constraint is automatically inherited by all descendants. Repeating it here creates a maintenance burden—if the version floor is raised, multiple files must be updated. Remove this redundant constraint block and rely on inheritance unless this specialization requires a stricter (higher) minimum version.♻️ Proposed cleanup
mixins: - os-ubuntu - # A100 + EKS specific constraints (not covered by mixin) - constraints: - - name: K8s.server.version - value: ">= 1.30" - componentRefs: []🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/a100-eks-ubuntu-training.yaml` around lines 36 - 38, Remove the redundant constraints block that defines "K8s.server.version >= 1.30" from the a100-eks-ubuntu-training overlay (the "- name: K8s.server.version / value: \">= 1.30\"" entry) so the overlay inherits the setting from the base a100-eks-training overlay; only reintroduce a constraints entry here if you intend to enforce a stricter minimum version, and after removal run the overlay validation to confirm no other duplicates remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@recipes/overlays/a100-eks-ubuntu-training-kubeflow.yaml`:
- Around line 38-40: Remove the redundant K8s.server.version constraint block
from the overlays/a100-eks-ubuntu-training-kubeflow overlay: locate the
constraints: section and delete the entry with name: K8s.server.version and
value: ">= 1.30" (it’s inherited from the a100-eks-training overlay); if this
overlay truly needs a stricter minimum, replace the value instead of duplicating
the same constraint.
In `@recipes/overlays/a100-eks-ubuntu-training.yaml`:
- Around line 36-38: Remove the redundant constraints block that defines
"K8s.server.version >= 1.30" from the a100-eks-ubuntu-training overlay (the "-
name: K8s.server.version / value: \">= 1.30\"" entry) so the overlay inherits
the setting from the base a100-eks-training overlay; only reintroduce a
constraints entry here if you intend to enforce a stricter minimum version, and
after removal run the overlay validation to confirm no other duplicates remain.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 289c0246-c2c5-4d06-a957-ba6cf54c2d6f
📒 Files selected for processing (4)
recipes/overlays/a100-any.yamlrecipes/overlays/a100-eks-training.yamlrecipes/overlays/a100-eks-ubuntu-training-kubeflow.yamlrecipes/overlays/a100-eks-ubuntu-training.yaml
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is not a fallback on immutable GKE COS (reboot/bootloader changes). The nodewright-operator is still inherited from gke-cos. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
Summary
Add A100 EKS overlays (issue #1002): the EKS Ubuntu Kubeflow training leaf plus its ancestor chain and the cross-cutting deployment-phase floor. Modeled on the H100/H200 EKS training overlays.
Motivation / Context
a100is a declared accelerator inpkg/recipe/criteria.gobut has zero overlays inrecipes/overlays/, soaicr recipe --accelerator a100 --service eks ...cannot resolve. Companion to the A100 OKE (#1294) and AKS (#1295) PRs; this slice covers EKS.Fixes: N/A (incremental — part of #1002)
Related: #1002, #1294, #1295, #1306, #969, #1256
A100 overlay series — tracked in #1002: #1294 (OKE) · #1295 (AKS) · #1305 (EKS) ← this PR · #1306 (GKE)
Type of Change
Component(s) Affected
pkg/recipe)Implementation Notes
New overlays (reuse existing
eks/eks-trainingparents andvalues-eks-training.yaml— no new component values files):a100-any— deployment-phase floor: 4 standard checks +Deployment.gpu-operator.version >= v24.6.0(H100/H200-generation baseline; A100 operator-supported since v22.9).a100-eks-training—base: eks-training;K8s >= 1.30. A100 has no NVLink ComputeDomain requirement, so it keeps the EKS training baseline rather than H100/H200's1.32.4. gpu-operatorcdi+gdrcopy, nfdtopologyUpdater. Conformance mirrors the H100/H200 EKS training set.a100-eks-ubuntu-training—+ os-ubuntumixin.a100-eks-ubuntu-training-kubeflow—+ platform-kubeflow(Kubeflow Trainer for distributedTrainJob).Key decisions documented in-file:
nvidia-tunedgeneric profile (tuning-generic.yaml,accelerator: generic), mirroringrtx-pro-6000-eks.nvidia-setupships baked-in(service, accelerator)configs only foreks-h100/eks-gb200; unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is Ampere with no matching profile. The generic profile is the supported baseline for such targets:governor=performance,iommu=pt, BBR, without the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200 profiles bake in. Note this is a safe baseline, not HGX-optimal — p4d/p4de A100 multi-node training would benefit from IB/NVLink tuning the generic profile omits; a dedicated A100 profile is a possible follow-up.nccl-all-reduce-bw >= 300, calibrated on 8-GPU Hopper NVLink nodes with EFA — neither fabric-class aware (nccl-all-reduce-bw training gate is a fixed absolute fabric-specific busbw value applied to SKU-agnostic recipes → false-fails EKS/H100 small SKUs #1256) nor valid for A100. An A100-on-EKS NCCL baseline is deferred to a follow-up.Testing
Full
make qualifynot required: this touches only YAML overlay files (zero.gochanges), so the Go lint/test/e2e gates cannot regress from it. The embedded overlays are exercised bygo test ./pkg/recipe/...(passes) and yamllint (clean). Nodocs/page enumerates individual overlay leaves, so no doc update is needed.Risk Assessment
Rollout notes: Additive. Other A100 EKS leaves (plain training, inference, dynamo) and remaining services tracked under #1002.
Checklist
go test ./pkg/recipe/...).gofiles changed)TestOverlayValidationPhaseFloor)git commit -S)