Skip to content

feat: Add PrometheusRule CRD for Prometheus-to-CloudWatch alarm migration#63

Open
bonclay7 wants to merge 1 commit into
aws-controllers-k8s:mainfrom
bonclay7:feat/cloudwatch-prometheus-rules-crd
Open

feat: Add PrometheusRule CRD for Prometheus-to-CloudWatch alarm migration#63
bonclay7 wants to merge 1 commit into
aws-controllers-k8s:mainfrom
bonclay7:feat/cloudwatch-prometheus-rules-crd

Conversation

@bonclay7

@bonclay7 bonclay7 commented Apr 8, 2026

Copy link
Copy Markdown

Add PrometheusRule CRD for Prometheus-to-CloudWatch alarm migration

Enables Kubernetes users to migrate Prometheus alerting rules to CloudWatch PromQL alarms by changing only the apiVersion/kind and adding a cloudWatch section. No PromQL rewrite required.

What it does

Applies a PrometheusRule CR → controller converts each alerting rule into a CloudWatch PromQL alarm via PutMetricAlarm with EvaluationCriteria. Recording rules are detected and skipped. The controller continuously reconciles desired state and reports per-alarm status back to the CR.

Example

   apiVersion: cloudwatch.services.k8s.aws/v1alpha1
   kind: PrometheusRule
   metadata:
     name: k8s-infra-alerts
   spec:
     groups:
     - name: pods
       rules:
       - alert: PodCrashLooping
         expr: max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}[5m]) >= 1
         for: "15m"
         labels:
           severity: critical
     cloudWatch:
       alarmActions:
         - arn:aws:sns:us-east-1:123456789012:alerts
         - arn:aws:lambda:us-east-1:123456789012:function:create-ticket
       okActions:
         - arn:aws:sns:us-east-1:123456789012:resolved
       alarmNamePrefix: "cw-k8s"
       tags:
         team: platform

Changes

  • CRD types + deepcopy (apis/v1alpha1/)
  • Converter: PromQL expr → EvaluationCriteriaMemberPromQLCriteria, for → PendingPeriod/RecoveryPeriod, group interval → EvaluationInterval
  • Resource manager with create/update/delete reconciliation and garbage collection of removed rules
  • Action model: alarmActions[], okActions[], insufficientDataActions[] — supports SNS, Lambda, SSM Incident Manager, Auto Scaling
  • cloudWatch.tags for AWS resource tags on alarms
  • Helm CRD, RBAC, kustomize config
  • E2E test helpers + fixtures
  • 4 sample manifests + design doc
  • SDK bump: aws-sdk-go-v2/service/cloudwatch v1.43.12 → v1.56.0

Tested

  • Unit tests: conversion logic, duration parsing, recording rule skip
  • Live on EKS cluster: 7 PromQL alarms created successfully in CloudWatch, all in OK state

@ack-prow ack-prow Bot requested a review from a-hilaly April 8, 2026 13:02
@ack-prow

ack-prow Bot commented Apr 8, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bonclay7
Once this PR has been reviewed and has the lgtm label, please assign michaelhtm for approval by writing /assign @michaelhtm in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ack-prow ack-prow Bot requested a review from knottnt April 8, 2026 13:02
@ack-prow ack-prow Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 8, 2026
@ack-prow

ack-prow Bot commented Apr 8, 2026

Copy link
Copy Markdown

Hi @bonclay7. Thanks for your PR.

I'm waiting for a aws-controllers-k8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…tion

Add a custom PrometheusRule resource that enables Kubernetes users to
migrate their existing Prometheus alerting rules to CloudWatch alarms
without rewriting PromQL expressions. Uses the native CloudWatch PromQL
EvaluationCriteria API for alarm creation.

Components:
- CRD types and deepcopy generation (apis/v1alpha1)
- PromQL-to-CloudWatch alarm converter using EvaluationCriteria API
- Full resource manager with create/update/delete reconciliation
- Per-alarm status feedback via status.alarmStatuses[]
- Recording rule detection and skip (status.skippedRuleCount)
- Helm CRD, RBAC rules, and kustomize configuration
- E2E test helpers and resource fixtures
- Sample PrometheusRule manifests for common use cases
- Design document with architecture and reconciliation flow

Action model:
- alarmActions[]: SNS, Lambda, SSM Incident Manager, Auto Scaling
- okActions[]: actions on recovery
- insufficientDataActions[]: actions on missing data
- cloudWatch.tags: AWS resource tags on alarms

SDK: aws-sdk-go-v2/service/cloudwatch v1.43.12 → v1.56.0

The controller converts each alerting rule into a CloudWatch PromQL alarm
using the naming convention <prefix>-<group>-<alertName>, maps Prometheus
'for' duration to PendingPeriod/RecoveryPeriod, routes notifications via
configurable action ARNs, and continuously reconciles desired state.
@bonclay7 bonclay7 force-pushed the feat/cloudwatch-prometheus-rules-crd branch from 394686d to e2fe78e Compare April 8, 2026 13:19
@michaelhtm

Copy link
Copy Markdown
Member

Hey @bonclay7
This PR doesn't seem to have been generated with code-generator
https://aws-controllers-k8s.github.io/community/docs/contributor-docs/overview/

Feel free to reach out if you have any questions. Thanks!

@bonclay7

Copy link
Copy Markdown
Author

So @michaelhtm, the PrometheusRule resource is marked is_custom: true in generator.yaml, which seems to be the standard ACK pattern for resources not backed by an AWS API operation. The PR here proposes an adapter for Prometheus rules to land into CW alarms as there is no native path. happy to discuss further

@mhausenblas

Copy link
Copy Markdown

Product here. I would appreciate it we could prioritize this since it was called out by a number of customers to be blocking.

@knottnt

knottnt commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

@bonclay7 @mhausenblas Looking over the PR this new PrometheusRule CR deviates quite a bit from other custom resources (CRs) provided by ACK. For the ACK project, we aim to provide CRs that map to AWS control plane resources. For higher level constructs like this we generally look to tools such as KRO.

In this case I think a good potential solution would be to update our existing MetricAlarm resource to support the new EvaluationCriteria field and provide users a KRO RGD template that they can use to translate existing Prometheus CRs to ACK MetricAlarms. KRO's external references can be used to pull in target PrometheusRules and translate them into a collection of ACK MetricAlarms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants