feat(infra): add layered ci validation for k8s infra (L1-L5 + L7) by manamana32321 · Pull Request #3546 · skkuding/codedang

manamana32321 · 2026-04-27T06:55:07Z

Description

K8s 인프라 변경에 대한 다층 검증 stack을 ci-infra.yml에 추가.

Layer	Job	모드	잡는 것	비용
L1	yaml-lint	blocking	YAML 구문	<5s
L2	kustomize-build	blocking	overlay 빌드, missing resource	<30s
L3	helm-template	warn-only	Go template, required values	<30s
L4	kubeconform	warn-only	CRD strict 스키마	<30s
L5	component-dryrun	warn-only	amtool/promtool/promtail 도구별	~1min
L7	monitoring-e2e	blocking	operator vendored lag, silent reconcile	~5min

하이브리드 blocking 전략 (왜)

cheap한 layer(L1, L2)와 catch-all(L7)은 처음부터 blocking. semi-noisy한 L3-L5는 baseline 부채 cleanup 전엔 warn-only로 가시화만 (다른 PR 의도치 않게 막지 않게). cleanup PR들 이후에 blocking 승격.

L7 (E2E)이 catch하는 unique class

prometheus-operator의 silent reconcile 실패를 PR 머지 전에 차단:

kind cluster 부팅
dummy Discord webhook secret 주입 (mock)
kube-prometheus-stack을 prod values로 install
assertion:
- prometheus-operator log에 failed to (initialize|provision|sync) 없음
- alertmanager-generated secret에 우리 config (discord_configs, inhibit_rules) 반영됨

#3543 머지 시 alertmanager에서 silent reconcile 실패 발생 (field webhook_url_file not found in type alertmanager.discordConfig). 이 클래스는:

ArgoCD: status=Synced+Healthy
helm template / kubeconform / amtool check-config (upstream): 모두 통과
alertmanager pod: 정상 Running (이전 config 로드 중)
유일한 실패 흔적: prometheus-operator pod log

L4(kubeconform)도 못 잡음 — Alertmanager.spec.config는 inline string blob. operator vendored 라이브러리와 upstream 사이의 schema lag은 정적 분석 불가, 오직 reconcile만이 진실. L7이 그 reconcile을 sandbox에서 미리 실행.

Additional context

L5 amtool 버전 주의

L5의 amtool는 alertmanager 0.29.0으로 명시 pin. prometheus-operator가 vendor한 alertmanager 라이브러리 버전과 정확히 매칭해야 함. mismatch 시 #3543 같은 silent fail 못 잡음 (false negative).

향후 chart upgrade 시 amtool 버전도 같이 sync (renovate.json 추가 권장).

chicken-and-egg 알림

본 PR이 도입하는 L7 E2E는 main의 기존 alertmanager silent fail을 catch함 → 본 PR은 자기 자신이 도입한 CI에 fail함. 이는 의도된 동작이며 PR이 protect하려는 정확한 case 입증.

해결: 후속 fix PR (AlertmanagerConfig CRD 방식)이 본 PR 위에 stack되어 같이 머지. 또는 본 PR을 admin override로 머지하고 fix PR 빠르게 따라감.

베이스라인 cleanup 작업 (별도 PR 예상)

L3-L5가 warn-only로 surface할 violations:

기존 helm template 실패 항목
기존 CRD 스키마 위반 항목
promtool/amtool로 잡히는 기존 alert rule / config 문제

각각 별도 cleanup PR로 처리 후 warn → blocking 승격.

향후 확장

L6 (conftest/OPA): Rego 룰 정의 후 추가
L7 E2E 범위 확장: grafana / loki / promtail / 인프라 stack 전체

Before submitting the PR, please make sure you do the following

Read the Contributing Guidelines
Read the Contributing Guidelines and follow the Commit Convention
Provide a description in this PR that addresses what the PR is solving
L7 E2E를 로컬 kind에서 검증 (silent fail 정확히 catch 확인)

🤖 Generated with Claude Code

Catches prometheus-operator silent reconcile failures before merge by booting a kind cluster, installing kube-prometheus-stack with prod values, and asserting that: - prometheus-operator log has no `failed to (initialize|provision|sync)` - alertmanager-generated secret reflects our config (discord_configs + inhibit_rules present) This class of failure (e.g. operator vendored alertmanager library lag on `webhook_url_file`) is invisible to helm template, kubeconform, and even amtool check-config because it depends on operator runtime reconciliation. Static analysis cannot catch it; only an actual install into a sandbox cluster can. Verified locally that this workflow catches the exact silent fail that escaped review on the previous alertmanager wiring PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-04-27T06:55:13Z

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 78a39ea65e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T06:58:41Z

+            monitoring:
+              - 'infra/k8s/monitoring/**'
+              - '.github/workflows/ci-infra.yml'


Include Prometheus ApplicationSet changes in filter

This filter only watches infra/k8s/monitoring/**, but the live monitoring deployment inputs also change in infra/k8s/argocd/applications/monitoring/prometheus.yaml (for example targetRevision and valueFiles). A PR that updates that ApplicationSet can change the operator/chart behavior without running this new e2e job, so the guard can be bypassed exactly when chart-level regressions are introduced.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-27T06:58:41Z

+          # - ingress: cert-manager + 학교 DNS 의존 → CI 에서 비활성
+          # - storageSpec: kind 의 local-path 와 충돌 방지
+          helm install prom prometheus-community/kube-prometheus-stack \
+            --version 79.5.0 \


Avoid pinning a duplicate chart version in CI

The workflow hardcodes --version 79.5.0, which duplicates the chart revision tracked in infra/k8s/argocd/applications/monitoring/prometheus.yaml. When ArgoCD targetRevision is bumped, this e2e job will still validate the old chart and can return a false green result while production deploys a different operator/schema combination.

Useful? React with 👍 / 👎.

Adds a layered validation stack to ci-infra.yml that catches different classes of bugs at different cost tiers: L1 yaml-lint (blocking) — basic YAML syntax L2 kustomize-build (blocking) — overlay rendering, missing resources L3 helm-template (warn-only) — Go template + required values L4 kubeconform (warn-only) — CRD schema strict validation L5 component-dryrun (warn-only) — amtool/promtool/promtail per-tool L7 monitoring-e2e (blocking) — kind cluster + operator log assertion Hybrid blocking strategy: cheap layers (L1, L2) and the catch-all (L7) block merge from day 1; semi-noisy layers (L3-L5) start as warn-only to surface existing baseline violations without immediately blocking unrelated PRs. Promote to blocking after baseline cleanup PRs land. L5 amtool version is intentionally pinned to alertmanager 0.29.0 to match what prometheus-operator vendors. Mismatch causes false negatives of the exact silent-fail class this PR is meant to catch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

manamana32321 changed the title ~~feat(infra): add monitoring stack e2e validation to ci~~ feat(infra): add layered ci validation for k8s infra (L1-L5 + L7) Apr 27, 2026

manamana32321 marked this pull request as draft May 2, 2026 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(infra): add layered ci validation for k8s infra (L1-L5 + L7)#3546

feat(infra): add layered ci validation for k8s infra (L1-L5 + L7)#3546
manamana32321 wants to merge 2 commits into
mainfrom
feat/ci-monitoring-e2e

manamana32321 commented Apr 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

manamana32321 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

하이브리드 blocking 전략 (왜)

L7 (E2E)이 catch하는 unique class

Additional context

L5 amtool 버전 주의

chicken-and-egg 알림

베이스라인 cleanup 작업 (별도 PR 예상)

향후 확장

Before submitting the PR, please make sure you do the following

Uh oh!

gemini-code-assist Bot commented Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

manamana32321 commented Apr 27, 2026 •

edited

Loading