Skip to content

ci(kwok): add flux-git lane with in-cluster gitea#1290

Open
haarchri wants to merge 2 commits into
NVIDIA:mainfrom
haarchri:feature/kwok-flux-git-lane
Open

ci(kwok): add flux-git lane with in-cluster gitea#1290
haarchri wants to merge 2 commits into
NVIDIA:mainfrom
haarchri:feature/kwok-flux-git-lane

Conversation

@haarchri

@haarchri haarchri commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the flux-git KWOK deployer lane: an in-cluster Gitea receives the filesystem bundle via git push, and Flux reconciles it from a GitRepository source — closing the Git-source round-trip gap in the deployer matrix. Along the way this fixes the chainsaw sync-gate exists-semantics bug (#1288) and the nvidia-dra-driver-gpu duplicate-label install failure under Flux (#1289), both of which the new lane exposed.

Motivation / Context

The KWOK matrix only validated OCI round-trips (argocd-oci, argocd-helm-oci, flux-oci — ADR-008 / PR #956). The flux deployer's Git output shape (sources/gitrepo-*.yaml GitRepository CRs, branch: main default, GitRepository sourceRef on local-chart HelmReleases) had zero end-to-end coverage, so regressions in it would ship undetected. The first live run of the new lane immediately surfaced two such masked bugs — validating the premise. Design rationale in ADR-010 (docs/design/010-kwok-git-source-lanes.md).

Fixes: #1288, #1289
Related: #963 (flux half; argocd-git follows on the same Gitea infra), #843, #1285, kubernetes-sigs/dra-driver-nvidia-gpu#1184

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: KWOK test infra (kwok/, tests/chainsaw/kwok), CI workflows (.github/), conformance validators (validators/), component values (recipes/components, recipes/checks)

Implementation Notes

Testing

unset GITLAB_TOKEN
make qualify                                          # PASS ("Codebase qualification completed")
golangci-lint run -c .golangci.yaml ./validators/...  # 0 issues
go test -race ./validators/deployment/... ./validators/conformance/...  # ok
make bom-docs                                         # regenerated, no diff (values change is name-only)

# Lane end-to-end on a fresh Kind cluster (new 3300 port mapping):
kind delete cluster --name aicr-kwok-test
make kwok-test-deployer RECIPE=eks-training DEPLOYER=flux-git
kubectl get gitrepo -A
NAMESPACE     NAME                                                           URL                                                                       AGE     READY   STATUS
flux-system   aicr-bundle                                                    http://gitea.aicr-registry.svc.cluster.local:3000/aicr/eks-training.git   3m30s   True    stored artifact for revision 'main@sha1:5cfc60766d957eca628620afde8b6f27618fcecb'
flux-system   gitea-aicr-registry-svc-cluster-local-3000-aicr-eks-training   http://gitea.aicr-registry.svc.cluster.local:3000/aicr/eks-training.git   3m29s   True    stored artifact for revision 'main@sha1:5cfc60766d957eca628620afde8b6f27618fcecb'

[...]
INFO] ========================================
[INFO] Results
[INFO] ========================================
  ✓ eks-training
[INFO] Cleaning up for next test...
node "gpu-0" force deleted
node "gpu-1" force deleted
node "gpu-2" force deleted
node "gpu-3" force deleted
node "system-0" force deleted
node "system-1" force deleted
No resources found
[INFO] All 1 recipe(s) passed!
[INFO] Cluster preserved. Delete with: kind delete cluster --name aicr-kwok-test

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

  • Local KWOK clusters created before this PR must be recreated (kind delete cluster --name aicr-kwok-test) to pick up the 3300→30300 port mapping; install-infra.sh exit 71 is the telltale.
  • DRA driver workload names change (nvidia-dra-driver-gpu-*dra-driver-nvidia-gpu-*) for fresh installs; anyone selecting on the old names (dashboards, scripts) must update. Revert planned once the upstream chart fix ships.
  • CI matrix grows: Tier 1 +1 cell per generic overlay, Tier 3 +1 deployer per recipe (batching stays under GitHub's 256-config cap).

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@haarchri haarchri requested review from a team as code owners June 10, 2026 12:22
@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Christopher Haar <christopher.haar@upbound.io>
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4ad53cf3-764d-45ec-a063-fca4e0ff8f97

📥 Commits

Reviewing files that changed from the base of the PR and between 5188632 and 807e52b.

📒 Files selected for processing (3)
  • recipes/components/nvidia-dra-driver-gpu/values-oke.yaml
  • recipes/components/nvidia-dra-driver-gpu/values.yaml
  • validators/deployment/expected_resources_test.go

📝 Walkthrough

Walkthrough

This PR implements a new KWOK flux-git deployer lane that installs an in-cluster Gitea, generates filesystem bundles as Git repositories, pushes them to Gitea, and deploys via Flux GitRepository + Kustomization wrappers. It adds install and diagnostic logic for Gitea, extends runner and CI matrices to include flux-git, and updates validate-scheduling.sh to support bundle generation, deployment, cleanup, and chainsaw-based sync waiting for flux-git. It also fixes Flux sync gates by replacing existence-only asserts with error-polarity polling to enforce "all resources converged", and aligns NVIDIA DRA driver naming by removing problematic nameOverride and updating health checks, validators, tests, and documentation to use chart-rendered dra-driver-nvidia-gpu-* names.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

  • #1289: Implements the DRA chart naming correction (removing nameOverride and updating validators/tests/docs).
  • #963: Implements the flux-git KWOK lane with in-cluster Gitea and flux-git sync tests.
  • #1172: Touches deployer-list locations (action inputs and workflow matrices) which relate to the deployer-list consolidation discussed.

Suggested reviewers

  • mchmarny
  • njhensley
  • xdu31
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly summarizes the main change: adding a flux-git lane with in-cluster Gitea to the KWOK test infrastructure.
Description check ✅ Passed The PR description comprehensively explains the motivations, implementation, testing, and rollout considerations related to the flux-git lane addition.
Linked Issues check ✅ Passed The PR successfully addresses issue #1288 by fixing chainsaw sync-gate exists-semantics bugs. The inverted-error-polarity changes in tests/chainsaw/kwok/flux-sync/chainsaw-test.yaml and tests/chainsaw/kwok/flux-git-sync/chainsaw-test.yaml implement the required all-resources-converged validation.
Out of Scope Changes check ✅ Passed All code changes align with the stated objectives: KWOK flux-git lane infrastructure, chainsaw sync-gate bug fixes, nvidia-dra-driver-gpu name corrections, and supporting documentation updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @haarchri — this is a lot of careful work landing in one PR, and the bundling makes sense (#1288 and #1289 only surfaced because the new lane existed). ADR-010 is comprehensive, the distinct exit codes 70/71/72 follow the existing pattern, and the DRA name rollback is documented in 5 places that cross-reference each other so it's hard to drift.

One actionable: add the upstream kubernetes-sigs/dra-driver-nvidia-gpu issue link in values.yaml so "revert when the chart fix ships" has a closeable trigger. Two nits inline. Nothing blocking.

The workload rename (nvidia-dra-driver-gpu-*dra-driver-nvidia-gpu-*) is a behavior change anyone selecting on the old names will feel — your rollout note already covers it, just flagging for visibility.

Comment thread recipes/components/nvidia-dra-driver-gpu/values.yaml Outdated
Comment thread validators/deployment/expected_resources_test.go
namespace: flux-system
status:
((conditions[?type=='Ready']|[0].status == 'True') || ((conditions[?type=='Released']|[0].status == 'True') && (history|[0].status == 'deployed') && (conditions[?type=='Stalled']|[0].status != 'True'))): true
((conditions[?type=='Ready']|[0].status != 'True') && ((conditions[?type=='Released']|[0].status != 'True') || (history|[0].status != 'deployed') || (conditions[?type=='Stalled']|[0].status == 'True'))): true

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existence + error-polarity pair is exactly the right fix, and the WHY comment ("1 Ready + 12 dependency-blocked HRs passed instantly") is the kind of context that saves the next person hours. Quickly verified the De Morgan'd predicate against the positive form — matches. Nice catch surfacing this from the flux-git lane.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
recipes/components/nvidia-dra-driver-gpu/values.yaml (1)

59-59: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add the upstream issue link to unblock restoration.

The comment references an upstream bug but provides no issue number or link. Without a tracking reference, the "restore the override once a fixed chart ships" guidance on lines 60-61 is unactionable — nobody will know when the fix is available.

📎 Suggested fix
-  kubernetes-sigs/dra-driver-nvidia-gpu; restore the override (and the
+  kubernetes-sigs/dra-driver-nvidia-gpu#<issue-number>; restore the override (and the
   nvidia-dra-driver-gpu-* workload names below) once a fixed chart
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/components/nvidia-dra-driver-gpu/values.yaml` at line 59, The comment
that says "bundle of this component. Upstream bug filed against" and the
follow-up guidance to "restore the override once a fixed chart ships" lacks an
upstream issue reference; update that comment in values.yaml to include the
upstream issue URL or issue/PR number (or a link to the vendor's issue tracker)
so anyone can track the fix, and if the exact issue is unknown add a TODO with a
clear placeholder like "UPSTREAM-ISSUE: <url-or-number>" so the restore action
is actionable and discoverable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@recipes/components/nvidia-dra-driver-gpu/values.yaml`:
- Line 59: The comment that says "bundle of this component. Upstream bug filed
against" and the follow-up guidance to "restore the override once a fixed chart
ships" lacks an upstream issue reference; update that comment in values.yaml to
include the upstream issue URL or issue/PR number (or a link to the vendor's
issue tracker) so anyone can track the fix, and if the exact issue is unknown
add a TODO with a clear placeholder like "UPSTREAM-ISSUE: <url-or-number>" so
the restore action is actionable and discoverable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6295b579-ea28-4986-a7fb-7fa55ee74861

📥 Commits

Reviewing files that changed from the base of the PR and between 7bb7059 and 5188632.

📒 Files selected for processing (21)
  • .github/actions/kwok-test/action.yml
  • .github/workflows/kwok-recipes.yaml
  • .settings.yaml
  • Makefile
  • docs/contributor/inference-perf-fluctuation.md
  • docs/contributor/tests.md
  • docs/contributor/validator.md
  • docs/design/008-kwok-deployer-matrix.md
  • docs/design/010-kwok-git-source-lanes.md
  • kwok/README.md
  • kwok/kind-config.yaml
  • kwok/scripts/install-infra.sh
  • kwok/scripts/run-all-recipes.sh
  • kwok/scripts/validate-scheduling.sh
  • recipes/checks/nvidia-dra-driver-gpu/health-check.yaml
  • recipes/components/nvidia-dra-driver-gpu/values.yaml
  • tests/chainsaw/ai-conformance/common/assert-dra-driver.yaml
  • tests/chainsaw/kwok/flux-git-sync/chainsaw-test.yaml
  • tests/chainsaw/kwok/flux-sync/chainsaw-test.yaml
  • validators/conformance/dra_support_check.go
  • validators/deployment/expected_resources_test.go

@mchmarny

mchmarny commented Jun 10, 2026

Copy link
Copy Markdown
Member

Looks like the root-caused the OKE flux failures (all 6 OKE flux-oci / flux-git cells) has one-line fix.

Symptom (from helm-controller logs on Tier 1: oke-inference (flux-oci)):

Helm install failed for release nvidia-dra-driver/... with chart dra-driver-nvidia-gpu@0.4.0:
  error while running post render on manifests: yaml: unmarshal errors:
  line 29: mapping key "nvidia-dra-driver-gpu-component" already defined at line 28

This is exactly the upstream duplicate-label bug this PR is fixing — but recipes/components/nvidia-dra-driver-gpu/values-oke.yaml still carries the old nameOverride: nvidia-dra-driver-gpu from #1285 (L23-25 on main):

# See values.yaml for why both nameOverride and fullnameOverride are pinned.
nameOverride: nvidia-dra-driver-gpu          # ← needs to go
fullnameOverride: nvidia-dra-driver-gpu

OKE recipes merge values-oke.yaml on top of values.yaml, so the override comes back, and Flux's strict YAML parser rejects the rendered pod template.

Why only OKE-flux fails:

  • helm / argocd-* parse YAML leniently and silently accept the duplicate (so they pass)
  • Flux's helm-controller post-renderer is strict (so it fails)
  • Other overlays don't include values-oke.yaml, so flux-* there is fine

Cascade: nvsentinel and prometheus-adapter HelmReleases fail with dependency 'nvidia-dra-driver-gpu' is not ready — same root cause via the dependsOn chain.

Fix: delete the nameOverride line in values-oke.yaml (keep fullnameOverride) and update the comment header to match the new rationale in values.yaml. The 5 cross-referencing comments elsewhere already imply OKE doesn't deviate, so no other changes needed.

Signed-off-by: Christopher Haar <christopher.haar@upbound.io>
@haarchri

Copy link
Copy Markdown
Contributor Author

@mchmarny seeing upstream folks working on the issue regarding the label kubernetes-sigs/dra-driver-nvidia-gpu#1187 - we want to wait for a v0.4.1 or RC release and pin this ?

@haarchri haarchri requested a review from mchmarny June 11, 2026 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: ci(kwok): chainsaw sync gates pass when ANY resource converges, not ALL

2 participants