Skip to content

feat(infra): add layered ci validation for k8s infra (L1-L5 + L7)#3546

Draft
manamana32321 wants to merge 2 commits into
mainfrom
feat/ci-monitoring-e2e
Draft

feat(infra): add layered ci validation for k8s infra (L1-L5 + L7)#3546
manamana32321 wants to merge 2 commits into
mainfrom
feat/ci-monitoring-e2e

Conversation

@manamana32321
Copy link
Copy Markdown
Member

@manamana32321 manamana32321 commented Apr 27, 2026

Description

K8s 인프라 변경에 대한 다층 검증 stack을 ci-infra.yml에 추가.

Layer Job 모드 잡는 것 비용
L1 yaml-lint blocking YAML 구문 <5s
L2 kustomize-build blocking overlay 빌드, missing resource <30s
L3 helm-template warn-only Go template, required values <30s
L4 kubeconform warn-only CRD strict 스키마 <30s
L5 component-dryrun warn-only amtool/promtool/promtail 도구별 ~1min
L7 monitoring-e2e blocking operator vendored lag, silent reconcile ~5min

하이브리드 blocking 전략 (왜)

cheap한 layer(L1, L2)와 catch-all(L7)은 처음부터 blocking. semi-noisy한 L3-L5는 baseline 부채 cleanup 전엔 warn-only로 가시화만 (다른 PR 의도치 않게 막지 않게). cleanup PR들 이후에 blocking 승격.

L7 (E2E)이 catch하는 unique class

prometheus-operator의 silent reconcile 실패를 PR 머지 전에 차단:

  1. kind cluster 부팅
  2. dummy Discord webhook secret 주입 (mock)
  3. kube-prometheus-stack을 prod values로 install
  4. assertion:
    • prometheus-operator log에 failed to (initialize|provision|sync) 없음
    • alertmanager-generated secret에 우리 config (discord_configs, inhibit_rules) 반영됨

#3543 머지 시 alertmanager에서 silent reconcile 실패 발생 (field webhook_url_file not found in type alertmanager.discordConfig). 이 클래스는:

  • ArgoCD: status=Synced+Healthy
  • helm template / kubeconform / amtool check-config (upstream): 모두 통과
  • alertmanager pod: 정상 Running (이전 config 로드 중)
  • 유일한 실패 흔적: prometheus-operator pod log

L4(kubeconform)도 못 잡음 — Alertmanager.spec.config는 inline string blob. operator vendored 라이브러리와 upstream 사이의 schema lag은 정적 분석 불가, 오직 reconcile만이 진실. L7이 그 reconcile을 sandbox에서 미리 실행.

Additional context

L5 amtool 버전 주의

L5의 amtool는 alertmanager 0.29.0으로 명시 pin. prometheus-operator가 vendor한 alertmanager 라이브러리 버전과 정확히 매칭해야 함. mismatch 시 #3543 같은 silent fail 못 잡음 (false negative).

향후 chart upgrade 시 amtool 버전도 같이 sync (renovate.json 추가 권장).

chicken-and-egg 알림

본 PR이 도입하는 L7 E2E는 main의 기존 alertmanager silent fail을 catch함 → 본 PR은 자기 자신이 도입한 CI에 fail함. 이는 의도된 동작이며 PR이 protect하려는 정확한 case 입증.

해결: 후속 fix PR (AlertmanagerConfig CRD 방식)이 본 PR 위에 stack되어 같이 머지. 또는 본 PR을 admin override로 머지하고 fix PR 빠르게 따라감.

베이스라인 cleanup 작업 (별도 PR 예상)

L3-L5가 warn-only로 surface할 violations:

  • 기존 helm template 실패 항목
  • 기존 CRD 스키마 위반 항목
  • promtool/amtool로 잡히는 기존 alert rule / config 문제

각각 별도 cleanup PR로 처리 후 warn → blocking 승격.

향후 확장

  • L6 (conftest/OPA): Rego 룰 정의 후 추가
  • L7 E2E 범위 확장: grafana / loki / promtail / 인프라 stack 전체

Before submitting the PR, please make sure you do the following

🤖 Generated with Claude Code

Catches prometheus-operator silent reconcile failures before merge by
booting a kind cluster, installing kube-prometheus-stack with prod
values, and asserting that:

- prometheus-operator log has no `failed to (initialize|provision|sync)`
- alertmanager-generated secret reflects our config (discord_configs +
  inhibit_rules present)

This class of failure (e.g. operator vendored alertmanager library lag
on `webhook_url_file`) is invisible to helm template, kubeconform, and
even amtool check-config because it depends on operator runtime
reconciliation. Static analysis cannot catch it; only an actual install
into a sandbox cluster can.

Verified locally that this workflow catches the exact silent fail that
escaped review on the previous alertmanager wiring PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 78a39ea65e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +118 to +120
monitoring:
- 'infra/k8s/monitoring/**'
- '.github/workflows/ci-infra.yml'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include Prometheus ApplicationSet changes in filter

This filter only watches infra/k8s/monitoring/**, but the live monitoring deployment inputs also change in infra/k8s/argocd/applications/monitoring/prometheus.yaml (for example targetRevision and valueFiles). A PR that updates that ApplicationSet can change the operator/chart behavior without running this new e2e job, so the guard can be bypassed exactly when chart-level regressions are introduced.

Useful? React with 👍 / 👎.

# - ingress: cert-manager + 학교 DNS 의존 → CI 에서 비활성
# - storageSpec: kind 의 local-path 와 충돌 방지
helm install prom prometheus-community/kube-prometheus-stack \
--version 79.5.0 \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid pinning a duplicate chart version in CI

The workflow hardcodes --version 79.5.0, which duplicates the chart revision tracked in infra/k8s/argocd/applications/monitoring/prometheus.yaml. When ArgoCD targetRevision is bumped, this e2e job will still validate the old chart and can return a false green result while production deploys a different operator/schema combination.

Useful? React with 👍 / 👎.

Adds a layered validation stack to ci-infra.yml that catches different
classes of bugs at different cost tiers:

L1 yaml-lint        (blocking)    — basic YAML syntax
L2 kustomize-build  (blocking)    — overlay rendering, missing resources
L3 helm-template    (warn-only)   — Go template + required values
L4 kubeconform      (warn-only)   — CRD schema strict validation
L5 component-dryrun (warn-only)   — amtool/promtool/promtail per-tool
L7 monitoring-e2e   (blocking)    — kind cluster + operator log assertion

Hybrid blocking strategy: cheap layers (L1, L2) and the catch-all (L7)
block merge from day 1; semi-noisy layers (L3-L5) start as warn-only
to surface existing baseline violations without immediately blocking
unrelated PRs. Promote to blocking after baseline cleanup PRs land.

L5 amtool version is intentionally pinned to alertmanager 0.29.0 to
match what prometheus-operator vendors. Mismatch causes false negatives
of the exact silent-fail class this PR is meant to catch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@manamana32321 manamana32321 changed the title feat(infra): add monitoring stack e2e validation to ci feat(infra): add layered ci validation for k8s infra (L1-L5 + L7) Apr 27, 2026
@manamana32321 manamana32321 marked this pull request as draft May 2, 2026 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant