-
Notifications
You must be signed in to change notification settings - Fork 58
feat(recipes): add A100 GKE COS training Kubeflow overlay chain #1306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| # Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # Cross-cutting overlay applied via criteria-wildcard matching. | ||
| # | ||
| # Carries the deployment-phase floor (the 4 standard checks plus the | ||
| # gpu-operator version pin) and applies to every A100 query regardless | ||
| # of service or intent, so every concrete A100 leaf (training or | ||
| # inference, any service) inherits the version pin. | ||
| # | ||
| # Per-field union merge (see pkg/recipe/metadata.go) means concrete leaves | ||
| # that declare their own `deployment:` block add to or override this floor | ||
| # without dropping its inherited checks — same-name constraints from the | ||
| # leaf win, additional checks are appended. | ||
| # | ||
| # See docs/contributor/recipe.md#criteria-wildcard-overlays for details. | ||
|
|
||
| kind: RecipeMetadata | ||
| apiVersion: aicr.nvidia.com/v1alpha1 | ||
| metadata: | ||
| name: a100-any | ||
|
|
||
| spec: | ||
| base: base | ||
|
|
||
| criteria: | ||
| service: any | ||
| accelerator: a100 | ||
|
|
||
| validation: | ||
| deployment: | ||
| checks: | ||
| - operator-health | ||
| - expected-resources | ||
| - gpu-operator-version | ||
| - check-nvidia-smi | ||
| constraints: | ||
| # A100 has been supported since the early gpu-operator releases | ||
| # (v22.9). Floor at the same generation baseline as H100/H200 | ||
| # (v24.6.0) rather than the Blackwell floor; concrete leaves can | ||
| # tighten if they pin to a later working version. | ||
| - name: Deployment.gpu-operator.version | ||
| value: ">= v24.6.0" | ||
|
Comment on lines
+34
to
+54
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add required
Suggested minimal fix spec:
base: base
+ mixins: []
criteria:
service: any
accelerator: a100As per coding guidelines: “Recipe overlays must specify base, mixins, criteria, and constraints; criteria must match defined enums.” 🤖 Prompt for AI AgentsSource: Coding guidelines |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| # Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| kind: RecipeMetadata | ||
| apiVersion: aicr.nvidia.com/v1alpha1 | ||
| metadata: | ||
| name: a100-gke-cos-training-kubeflow | ||
|
|
||
| spec: | ||
| # Inherits from a100-gke-cos-training recipe (A100 + GKE COS + training settings) | ||
| # This overlay adds Kubeflow Training Operator for distributed training with TrainJob | ||
| base: a100-gke-cos-training | ||
|
|
||
| criteria: | ||
| service: gke | ||
| accelerator: a100 | ||
| os: cos | ||
| intent: training | ||
| platform: kubeflow | ||
|
|
||
| # Constraints for A100 on GKE COS for Kubeflow training workloads | ||
| constraints: | ||
| - name: K8s.server.version | ||
| value: ">= 1.30" | ||
|
coderabbitai[bot] marked this conversation as resolved.
Comment on lines
+20
to
+35
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add explicit The overlay spec has 🛠️ Suggested fix spec:
# Inherits from a100-gke-cos-training recipe (A100 + GKE COS + training settings)
# This overlay adds Kubeflow Training Operator for distributed training with TrainJob
base: a100-gke-cos-training
+
+ mixins: []
criteria:🤖 Prompt for AI AgentsSource: Coding guidelines |
||
|
|
||
| # Kubeflow Training Operator for TrainJob support. | ||
| # Declared inline (not via the platform-kubeflow mixin) to match the GKE COS | ||
| # pattern in h100-gke-cos-training-kubeflow. | ||
| componentRefs: | ||
| - name: kubeflow-trainer | ||
| type: Helm | ||
| valuesFile: components/kubeflow-trainer/values.yaml | ||
| manifestFiles: | ||
|
Comment on lines
+41
to
+44
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pin On Lines 41-44, this componentRef omits As per coding guidelines: “Reference components in recipe overlays with 🤖 Prompt for AI AgentsSource: Coding guidelines |
||
| - components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml | ||
| dependencyRefs: | ||
| - cert-manager | ||
| - kube-prometheus-stack | ||
| - gpu-operator | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| # Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| kind: RecipeMetadata | ||
| apiVersion: aicr.nvidia.com/v1alpha1 | ||
| metadata: | ||
| name: a100-gke-cos-training | ||
|
|
||
| spec: | ||
| # Inherits from gke-cos-training recipe (GKE COS + training settings) | ||
| base: gke-cos-training | ||
|
|
||
| criteria: | ||
| service: gke | ||
| accelerator: a100 | ||
| os: cos | ||
| intent: training | ||
|
|
||
| # Specific constraints for A100 on GKE COS training workloads. | ||
| # A100 has no IMEX/NVLink ComputeDomain requirement, so the recipe keeps | ||
| # the GKE COS training baseline rather than the H100 1.32 floor. | ||
| constraints: | ||
| - name: K8s.server.version | ||
| value: ">= 1.30" | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
|
Comment on lines
+20
to
+36
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add explicit
As per coding guidelines: “Recipe overlays must specify base, mixins, criteria, and constraints.” 🤖 Prompt for AI AgentsSource: Coding guidelines |
||
| componentRefs: | ||
| # A100-specific GPU Operator overrides (inherits valuesFile from gke-cos-training). | ||
| # | ||
| # Nodewright tuning is intentionally omitted. The nvidia-tuning-gke package | ||
| # ships baked-in profiles only for gke-h100 / gke-b200; there is no A100 | ||
| # target. The EKS nvidia-tuned "generic" profile is not a fallback here: it | ||
| # applies reboot/bootloader changes, but GKE COS is immutable and | ||
| # tuning-gke.yaml deliberately limits itself to non-disruptive tuning. The | ||
| # nodewright-operator itself is still inherited from gke-cos. | ||
| # | ||
| # gke-nccl-tcpxo is also omitted: GPUDirect-TCPXO targets H100 a3-mega | ||
| # nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. | ||
| - name: gpu-operator | ||
| type: Helm | ||
| dependencyRefs: | ||
| - nfd | ||
| - cert-manager | ||
| - kube-prometheus-stack | ||
| overrides: | ||
| cdi: | ||
| enabled: true | ||
|
|
||
| - name: nfd | ||
| type: Helm | ||
| overrides: | ||
| topologyUpdater: | ||
| enable: true | ||
|
|
||
| # Validation checks for A100 on GKE COS training workloads. | ||
| # Defined at the intent layer (not OS-specific) so all variants inherit them. | ||
| # | ||
| # The deployment-phase floor (4 standard checks + gpu-operator version pin) | ||
| # is contributed by the a100-any cross-cutting overlay and is not duplicated | ||
| # here. | ||
| # | ||
| # Performance gating is intentionally omitted until an empirical A100-on-GKE | ||
| # NCCL baseline is established. The H100 GKE training overlay pins an absolute | ||
| # nccl-all-reduce-bw floor (>= 250) calibrated on 8-GPU H100 NVLink nodes; | ||
| # that value is neither fabric-class aware (https://github.com/NVIDIA/aicr/issues/1256) | ||
| # nor valid for A100, so carrying it would only false-fail healthy runs. | ||
| validation: | ||
| conformance: | ||
| checks: | ||
| - platform-health | ||
| - gpu-operator-health | ||
| - dra-support | ||
| - accelerator-metrics | ||
| - ai-service-metrics | ||
| - gang-scheduling | ||
| - pod-autoscaling | ||
| - cluster-autoscaling | ||
| - robust-controller | ||
| - secure-accelerator-access | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define
NCCLat first mention.Line 3 introduces
NCCLwithout expansion. Expand once (e.g., “NVIDIA Collective Communications Library (NCCL)”) at first use.As per coding guidelines: “Define acronyms on first use in documentation.”
🤖 Prompt for AI Agents
Source: Coding guidelines