ci(kwok): add flux-git lane with in-cluster gitea#1290
Conversation
Signed-off-by: Christopher Haar <christopher.haar@upbound.io>
8864e41 to
5188632
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR implements a new KWOK flux-git deployer lane that installs an in-cluster Gitea, generates filesystem bundles as Git repositories, pushes them to Gitea, and deploys via Flux GitRepository + Kustomization wrappers. It adds install and diagnostic logic for Gitea, extends runner and CI matrices to include flux-git, and updates validate-scheduling.sh to support bundle generation, deployment, cleanup, and chainsaw-based sync waiting for flux-git. It also fixes Flux sync gates by replacing existence-only asserts with error-polarity polling to enforce "all resources converged", and aligns NVIDIA DRA driver naming by removing problematic nameOverride and updating health checks, validators, tests, and documentation to use chart-rendered dra-driver-nvidia-gpu-* names. Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
mchmarny
left a comment
There was a problem hiding this comment.
Thanks @haarchri — this is a lot of careful work landing in one PR, and the bundling makes sense (#1288 and #1289 only surfaced because the new lane existed). ADR-010 is comprehensive, the distinct exit codes 70/71/72 follow the existing pattern, and the DRA name rollback is documented in 5 places that cross-reference each other so it's hard to drift.
One actionable: add the upstream kubernetes-sigs/dra-driver-nvidia-gpu issue link in values.yaml so "revert when the chart fix ships" has a closeable trigger. Two nits inline. Nothing blocking.
The workload rename (nvidia-dra-driver-gpu-* → dra-driver-nvidia-gpu-*) is a behavior change anyone selecting on the old names will feel — your rollout note already covers it, just flagging for visibility.
| namespace: flux-system | ||
| status: | ||
| ((conditions[?type=='Ready']|[0].status == 'True') || ((conditions[?type=='Released']|[0].status == 'True') && (history|[0].status == 'deployed') && (conditions[?type=='Stalled']|[0].status != 'True'))): true | ||
| ((conditions[?type=='Ready']|[0].status != 'True') && ((conditions[?type=='Released']|[0].status != 'True') || (history|[0].status != 'deployed') || (conditions[?type=='Stalled']|[0].status == 'True'))): true |
There was a problem hiding this comment.
The existence + error-polarity pair is exactly the right fix, and the WHY comment ("1 Ready + 12 dependency-blocked HRs passed instantly") is the kind of context that saves the next person hours. Quickly verified the De Morgan'd predicate against the positive form — matches. Nice catch surfacing this from the flux-git lane.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
recipes/components/nvidia-dra-driver-gpu/values.yaml (1)
59-59:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd the upstream issue link to unblock restoration.
The comment references an upstream bug but provides no issue number or link. Without a tracking reference, the "restore the override once a fixed chart ships" guidance on lines 60-61 is unactionable — nobody will know when the fix is available.
📎 Suggested fix
- kubernetes-sigs/dra-driver-nvidia-gpu; restore the override (and the + kubernetes-sigs/dra-driver-nvidia-gpu#<issue-number>; restore the override (and the nvidia-dra-driver-gpu-* workload names below) once a fixed chart🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/components/nvidia-dra-driver-gpu/values.yaml` at line 59, The comment that says "bundle of this component. Upstream bug filed against" and the follow-up guidance to "restore the override once a fixed chart ships" lacks an upstream issue reference; update that comment in values.yaml to include the upstream issue URL or issue/PR number (or a link to the vendor's issue tracker) so anyone can track the fix, and if the exact issue is unknown add a TODO with a clear placeholder like "UPSTREAM-ISSUE: <url-or-number>" so the restore action is actionable and discoverable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@recipes/components/nvidia-dra-driver-gpu/values.yaml`:
- Line 59: The comment that says "bundle of this component. Upstream bug filed
against" and the follow-up guidance to "restore the override once a fixed chart
ships" lacks an upstream issue reference; update that comment in values.yaml to
include the upstream issue URL or issue/PR number (or a link to the vendor's
issue tracker) so anyone can track the fix, and if the exact issue is unknown
add a TODO with a clear placeholder like "UPSTREAM-ISSUE: <url-or-number>" so
the restore action is actionable and discoverable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 6295b579-ea28-4986-a7fb-7fa55ee74861
📒 Files selected for processing (21)
.github/actions/kwok-test/action.yml.github/workflows/kwok-recipes.yaml.settings.yamlMakefiledocs/contributor/inference-perf-fluctuation.mddocs/contributor/tests.mddocs/contributor/validator.mddocs/design/008-kwok-deployer-matrix.mddocs/design/010-kwok-git-source-lanes.mdkwok/README.mdkwok/kind-config.yamlkwok/scripts/install-infra.shkwok/scripts/run-all-recipes.shkwok/scripts/validate-scheduling.shrecipes/checks/nvidia-dra-driver-gpu/health-check.yamlrecipes/components/nvidia-dra-driver-gpu/values.yamltests/chainsaw/ai-conformance/common/assert-dra-driver.yamltests/chainsaw/kwok/flux-git-sync/chainsaw-test.yamltests/chainsaw/kwok/flux-sync/chainsaw-test.yamlvalidators/conformance/dra_support_check.govalidators/deployment/expected_resources_test.go
|
Looks like the root-caused the OKE flux failures (all 6 OKE flux-oci / flux-git cells) has one-line fix. Symptom (from helm-controller logs on This is exactly the upstream duplicate-label bug this PR is fixing — but # See values.yaml for why both nameOverride and fullnameOverride are pinned.
nameOverride: nvidia-dra-driver-gpu # ← needs to go
fullnameOverride: nvidia-dra-driver-gpuOKE recipes merge Why only OKE-flux fails:
Cascade: Fix: delete the |
Signed-off-by: Christopher Haar <christopher.haar@upbound.io>
|
@mchmarny seeing upstream folks working on the issue regarding the label kubernetes-sigs/dra-driver-nvidia-gpu#1187 - we want to wait for a v0.4.1 or RC release and pin this ? |
Summary
Adds the
flux-gitKWOK deployer lane: an in-cluster Gitea receives the filesystem bundle viagit push, and Flux reconciles it from aGitRepositorysource — closing the Git-source round-trip gap in the deployer matrix. Along the way this fixes the chainsaw sync-gate exists-semantics bug (#1288) and thenvidia-dra-driver-gpuduplicate-label install failure under Flux (#1289), both of which the new lane exposed.Motivation / Context
The KWOK matrix only validated OCI round-trips (
argocd-oci,argocd-helm-oci,flux-oci— ADR-008 / PR #956). The flux deployer's Git output shape (sources/gitrepo-*.yamlGitRepository CRs,branch: maindefault, GitRepositorysourceRefon local-chart HelmReleases) had zero end-to-end coverage, so regressions in it would ship undetected. The first live run of the new lane immediately surfaced two such masked bugs — validating the premise. Design rationale in ADR-010 (docs/design/010-kwok-git-source-lanes.md).Fixes: #1288, #1289
Related: #963 (flux half;
argocd-gitfollows on the same Gitea infra), #843, #1285, kubernetes-sigs/dra-driver-nvidia-gpu#1184Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)kwok/,tests/chainsaw/kwok), CI workflows (.github/), conformance validators (validators/), component values (recipes/components,recipes/checks)Implementation Notes
Testing
Risk Assessment
Rollout notes:
kind delete cluster --name aicr-kwok-test) to pick up the 3300→30300 port mapping;install-infra.shexit 71 is the telltale.nvidia-dra-driver-gpu-*→dra-driver-nvidia-gpu-*) for fresh installs; anyone selecting on the old names (dashboards, scripts) must update. Revert planned once the upstream chart fix ships.Checklist
make testwith-race)make lint)git commit -S) — GPG signing info