feat(infra): add layered ci validation for k8s infra (L1-L5 + L7)#3546
feat(infra): add layered ci validation for k8s infra (L1-L5 + L7)#3546manamana32321 wants to merge 2 commits into
Conversation
Catches prometheus-operator silent reconcile failures before merge by booting a kind cluster, installing kube-prometheus-stack with prod values, and asserting that: - prometheus-operator log has no `failed to (initialize|provision|sync)` - alertmanager-generated secret reflects our config (discord_configs + inhibit_rules present) This class of failure (e.g. operator vendored alertmanager library lag on `webhook_url_file`) is invisible to helm template, kubeconform, and even amtool check-config because it depends on operator runtime reconciliation. Static analysis cannot catch it; only an actual install into a sandbox cluster can. Verified locally that this workflow catches the exact silent fail that escaped review on the previous alertmanager wiring PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 78a39ea65e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| monitoring: | ||
| - 'infra/k8s/monitoring/**' | ||
| - '.github/workflows/ci-infra.yml' |
There was a problem hiding this comment.
Include Prometheus ApplicationSet changes in filter
This filter only watches infra/k8s/monitoring/**, but the live monitoring deployment inputs also change in infra/k8s/argocd/applications/monitoring/prometheus.yaml (for example targetRevision and valueFiles). A PR that updates that ApplicationSet can change the operator/chart behavior without running this new e2e job, so the guard can be bypassed exactly when chart-level regressions are introduced.
Useful? React with 👍 / 👎.
| # - ingress: cert-manager + 학교 DNS 의존 → CI 에서 비활성 | ||
| # - storageSpec: kind 의 local-path 와 충돌 방지 | ||
| helm install prom prometheus-community/kube-prometheus-stack \ | ||
| --version 79.5.0 \ |
There was a problem hiding this comment.
Avoid pinning a duplicate chart version in CI
The workflow hardcodes --version 79.5.0, which duplicates the chart revision tracked in infra/k8s/argocd/applications/monitoring/prometheus.yaml. When ArgoCD targetRevision is bumped, this e2e job will still validate the old chart and can return a false green result while production deploys a different operator/schema combination.
Useful? React with 👍 / 👎.
Adds a layered validation stack to ci-infra.yml that catches different classes of bugs at different cost tiers: L1 yaml-lint (blocking) — basic YAML syntax L2 kustomize-build (blocking) — overlay rendering, missing resources L3 helm-template (warn-only) — Go template + required values L4 kubeconform (warn-only) — CRD schema strict validation L5 component-dryrun (warn-only) — amtool/promtool/promtail per-tool L7 monitoring-e2e (blocking) — kind cluster + operator log assertion Hybrid blocking strategy: cheap layers (L1, L2) and the catch-all (L7) block merge from day 1; semi-noisy layers (L3-L5) start as warn-only to surface existing baseline violations without immediately blocking unrelated PRs. Promote to blocking after baseline cleanup PRs land. L5 amtool version is intentionally pinned to alertmanager 0.29.0 to match what prometheus-operator vendors. Mismatch causes false negatives of the exact silent-fail class this PR is meant to catch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Description
K8s 인프라 변경에 대한 다층 검증 stack을 ci-infra.yml에 추가.
하이브리드 blocking 전략 (왜)
cheap한 layer(L1, L2)와 catch-all(L7)은 처음부터 blocking. semi-noisy한 L3-L5는 baseline 부채 cleanup 전엔 warn-only로 가시화만 (다른 PR 의도치 않게 막지 않게). cleanup PR들 이후에 blocking 승격.
L7 (E2E)이 catch하는 unique class
prometheus-operator의 silent reconcile 실패를 PR 머지 전에 차단:
failed to (initialize|provision|sync)없음#3543 머지 시 alertmanager에서 silent reconcile 실패 발생 (
field webhook_url_file not found in type alertmanager.discordConfig). 이 클래스는:L4(kubeconform)도 못 잡음 —
Alertmanager.spec.config는 inline string blob. operator vendored 라이브러리와 upstream 사이의 schema lag은 정적 분석 불가, 오직 reconcile만이 진실. L7이 그 reconcile을 sandbox에서 미리 실행.Additional context
L5 amtool 버전 주의
L5의 amtool는 alertmanager 0.29.0으로 명시 pin. prometheus-operator가 vendor한 alertmanager 라이브러리 버전과 정확히 매칭해야 함. mismatch 시 #3543 같은 silent fail 못 잡음 (false negative).
향후 chart upgrade 시 amtool 버전도 같이 sync (renovate.json 추가 권장).
chicken-and-egg 알림
본 PR이 도입하는 L7 E2E는 main의 기존 alertmanager silent fail을 catch함 → 본 PR은 자기 자신이 도입한 CI에 fail함. 이는 의도된 동작이며 PR이 protect하려는 정확한 case 입증.
해결: 후속 fix PR (AlertmanagerConfig CRD 방식)이 본 PR 위에 stack되어 같이 머지. 또는 본 PR을 admin override로 머지하고 fix PR 빠르게 따라감.
베이스라인 cleanup 작업 (별도 PR 예상)
L3-L5가 warn-only로 surface할 violations:
각각 별도 cleanup PR로 처리 후 warn → blocking 승격.
향후 확장
Before submitting the PR, please make sure you do the following
🤖 Generated with Claude Code