Skip to content

CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator#7749

Open
enxebre wants to merge 3 commits into
openshift:mainfrom
enxebre:fix-CNTRLPLANE-2775
Open

CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator#7749
enxebre wants to merge 3 commits into
openshift:mainfrom
enxebre:fix-CNTRLPLANE-2775

Conversation

@enxebre
Copy link
Copy Markdown
Member

@enxebre enxebre commented Feb 19, 2026

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

  • New Features

    • Added KAS health metrics (availability gauge and request-duration histogram) and registered them for collection.
  • Tests

    • Added unit tests for KAS health metrics and end-to-end validation to check metrics are emitted during runs.
  • Chores

    • Metrics instrumentation initialized at startup so controller exposes KAS health metrics for monitoring.

enxebre and others added 2 commits February 19, 2026 13:12
Instrument the existing healthCheckKASEndpoint function in the
control-plane-operator to expose two new Prometheus metrics:

- hypershift_kube_apiserver_available: Gauge reporting 1 when the KAS
  /healthz endpoint returns HTTP 200, 0 otherwise.
- hypershift_kube_apiserver_request_duration_seconds: Histogram tracking
  the latency of the /healthz health check probe.

These metrics are registered with the controller-runtime metrics registry
and are automatically scraped by the existing PodMonitor for the
control-plane-operator. Each CPO pod runs in its own HCP namespace, so
metrics are naturally scoped per hosted cluster. The existing
HostedControlPlaneAvailable condition logic is unchanged — metrics are a
side-effect of the existing health check, not a replacement.

This eliminates the need for external monitoring tooling (e.g.
route-monitor-operator) to track KAS availability and latency for SLA
purposes in HCP offerings.

Ref: CNTRLPLANE-2775

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add tests for the healthCheckKASEndpoint function that verify metrics
are correctly recorded during health check probes:

- Gauge set to 1 and histogram observed on successful 200 response
- Gauge set to 0 on non-200 response (503)
- Gauge set to 0 on connection error (unreachable endpoint)
- No panic when metrics is nil (backward compatibility)

Also add a basic test for KASHealthMetrics construction and registration.

Ref: CNTRLPLANE-2775

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 19, 2026

No actionable comments were generated in the recent review. 🎉


Walkthrough

Adds Prometheus metrics for KAS health (availability and request duration), threads an optional KASHealthMetrics through KAS health checks, initializes metrics at startup, and adds unit and E2E tests and helpers to validate metrics exposure. Some test and helper additions are duplicated in the diff.

Changes

Cohort / File(s) Summary
Controller instrumentation
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
Added HostedControlPlaneReconciler.KASHealthMetrics *kas.KASHealthMetrics; changed healthCheckKASEndpoint to accept m *kas.KASHealthMetrics; record request duration and set availability (guarded by nil check).
KAS metrics implementation
control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go
New metrics: KASAvailableMetricName, KASRequestDurationMetricName, KASRequestDurationBuckets; KASHealthMetrics struct with Available gauge and RequestDuration histogram; NewKASHealthMetrics() registering metrics.
Controller unit tests
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
Added TestHealthCheckKASEndpointMetrics with subtests (200, 503, unreachable, nil-metrics), helpers newTestKASHealthMetrics and parseHostPort. Note: test and helpers appear duplicated in the diff — inspect for unintended repeats.
KAS metrics unit tests
control-plane-operator/controllers/hostedcontrolplane/kas/metrics_test.go
New test validating metric registration, gauge initial state, gauge update, and histogram observation via a Prometheus registry.
Startup wiring
control-plane-operator/main.go
Instantiates kas.NewKASHealthMetrics() and injects it into HostedControlPlaneReconciler during startup/setup.
E2E helpers & integration
test/e2e/util/hypershift_framework.go, test/e2e/util/util.go
Adds ValidateCPOMetrics E2E helper and invokes it in after-phase; helper polls control-plane-operator metrics for kas.KASAvailableMetricName and kas.KASRequestDurationMetricName. Note: ValidateCPOMetrics appears duplicated in the diff — verify and dedupe.

Sequence Diagram(s)

sequenceDiagram
  participant Tests as Tests/E2E
  participant Reconciler as HostedControlPlaneReconciler
  participant KAS as KAS/API
  participant Metrics as Prometheus/Registry

  Tests->>Reconciler: trigger health check
  Reconciler->>KAS: HTTP request to ingress (healthCheckKASEndpoint with m)
  activate KAS
  KAS-->>Reconciler: HTTP response (200/503/timeout)
  deactivate KAS
  alt m != nil
    Reconciler->>Metrics: observe RequestDuration
    Reconciler->>Metrics: set Available = 1 or 0
  end
  Tests->>Metrics: query metrics endpoint to validate metrics present/values
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning PR contains duplicated test implementations in hostedcontrolplane_controller_test.go and util.go, lacks meaningful assertion messages, and has insufficient timeout safeguards in E2E polling operations. Remove duplicate test functions, add descriptive failure messages to all assertions, and ensure explicit timeout configuration for all polling operations.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main objective: exposing KAS availability and latency metrics from the control-plane-operator, which aligns with the core implementation across all modified files.
Stable And Deterministic Test Names ✅ Passed All test names in the PR use stable, deterministic naming with no dynamic content, formatted strings, or variable substitution.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 19, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented Feb 19, 2026

/cc @muraee @csrwng

@openshift-ci openshift-ci Bot requested review from csrwng and muraee February 19, 2026 12:36
@openshift-ci openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Feb 19, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Feb 19, 2026
@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented Feb 19, 2026

/auto-cc

@openshift-ci openshift-ci Bot requested review from jparrill and sjenning February 19, 2026 12:41
@typeid
Copy link
Copy Markdown
Member

typeid commented Feb 19, 2026

Just to clarify a bit further, while these metrics are great, in ROSA HCP we intend to probe KAS availability externally via RHOBS synthetic monitoring (SREP-333), where Blackbox Exporter runs on RHOBS cells outside the management cluster. This gives us the advantage of testing the actual customer-facing network path (only partially for private API), including DNS resolution, load balancer health, and regional routing, rather than probing from within the MC's own network.

I understand ARO HCP wants to avoid the RMO dependency, and these metrics help with that. However, since the CPO probe originates from within the MC, it's not a full replacement for external synthetic monitoring for SLA purposes IMO.

That said, these CPO-local metrics are definitely useful even for the ROSA side for faster internal detection of control plane issues, for example catching KAS pod crashes or pinpointing network issues to in-cluster networking failures.

LGTM & thanks for the addition!

@typeid
Copy link
Copy Markdown
Member

typeid commented Feb 19, 2026

Also cc @dustman9000 as a FYI that this exists and is now being extended with latency as well :)

@muraee
Copy link
Copy Markdown
Contributor

muraee commented Feb 19, 2026

lgtm

@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented Feb 19, 2026

/test e2e-aws

@enxebre enxebre marked this pull request as ready for review February 19, 2026 13:59
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

Release Notes

  • New Features

  • Added Kubernetes API Server (KAS) health metrics monitoring with Prometheus instrumentation, tracking request duration and availability status.

  • Tests

  • Added comprehensive validation tests for KAS health metrics functionality.

  • Integrated Control Plane Operator metrics validation into end-to-end test workflows.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/util/util.go`:
- Around line 2712-2743: The loop in ValidateCPOMetrics uses
ValidateMetricPresence which expects labeled metrics and therefore never matches
label-less KAS metrics; modify the inner check after GetMetricsFromPod to look
for the MetricFamily by name (from the returned mf MetricFamily map or slice)
for kas.KASAvailableMetricName and kas.KASRequestDurationMetricName instead of
calling ValidateMetricPresence, i.e., verify the MetricFamily exists and has at
least one metric (no label checks) before returning true; keep the surrounding
wait.PollUntilContextTimeout, GetMetricsFromPod, and error handling unchanged.

Comment thread test/e2e/util/util.go
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

  • New Features

  • Added KAS health metrics: availability and request-duration metrics exposed for monitoring.

  • Tests

  • Added unit tests for KAS health metrics and integrated control-plane metrics validation into end-to-end test flows.

  • Chores

  • Instrumentation initialized at startup so metrics are available from the controller runtime.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Add ValidateCPOMetrics function that verifies both
hypershift_kube_apiserver_available and
hypershift_kube_apiserver_request_duration_seconds metrics are present
on the control-plane-operator pod's metrics endpoint (port 8080).

The validation is integrated into the e2e test framework's pre-teardown
phase, running alongside existing hypershift-operator metrics checks.
It follows the established pattern using GetMetricsFromPod and
ValidateMetricPresence with polling (10s interval, 5min timeout).

Ref: CNTRLPLANE-2775

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@enxebre enxebre force-pushed the fix-CNTRLPLANE-2775 branch from 51eb7d6 to e94ec9e Compare February 19, 2026 17:24
@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented Feb 19, 2026

/test e2e-aws
/verified by e2e

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@enxebre: This PR has been marked as verified by e2e.

Details

In response to this:

/test e2e-aws
/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@enxebre enxebre added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

  • New Features

  • Added KAS health metrics (availability gauge and request-duration histogram) and registered them for collection.

  • Tests

  • Added unit tests for KAS health metrics and end-to-end validation to check metrics are emitted during runs.

  • Chores

  • Metrics instrumentation initialized at startup so controller exposes KAS health metrics for monitoring.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 21, 2026
@openshift-merge-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cwbotbot
Copy link
Copy Markdown

Test Results

e2e-aws

Failed Tests

Total failed tests: 4

  • TestCreateClusterRequestServingIsolation
  • TestCreateClusterRequestServingIsolation/Teardown
  • TestCreateClusterRequestServingIsolation/ValidateHostedCluster
  • TestCreateClusterRequestServingIsolation/ValidateHostedCluster/EnsureNoCrashingPods

@openshift-bot
Copy link
Copy Markdown

Stale PRs are closed after 21d of inactivity.

If this PR is still relevant, comment to refresh it or remove the stale label.
Mark the PR as fresh by commenting /remove-lifecycle stale.

If this PR is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 17, 2026
@openshift-bot
Copy link
Copy Markdown

Stale PRs rot after 14d of inactivity.

Mark the PR as fresh by commenting /remove-lifecycle rotten.
Rotten PRs close after an additional 7d of inactivity.

If this PR is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 2, 2026
@dustman9000
Copy link
Copy Markdown
Member

This is a great addition! Having a native KAS health signal per HCP from inside the CPO will be very useful.

One thing I want to flag from the ROSA HCP SRE monitoring side: the probe frequency here is tied to the reconcile loop (1 min when healthy, 15s when unhealthy), and the probe runs from inside the management cluster. We currently have two layers of external synthetic monitoring that validate the customer-facing network path (DNS, LB, ingress):

  • route-monitor-operator: creates Prometheus blackbox exporter probes per HCP endpoint on the management cluster
  • RHOBS synthetics agent: external blackbox probes monitoring HCP API endpoints from outside the MC, feeding into RHOBS for SLO tracking

Both probe at a higher frequency and validate reachability from outside the control plane. For SLO/SLA calculations like KubeAPIErrorBudgetBurn, that external perspective and tighter sampling interval matter.

So this is complementary to our synthetic monitoring, not a replacement. If we can, I'd suggest softening the "eliminate that dependency" language in the description to avoid confusion downstream. Something like "these native metrics reduce reliance on external probing for internal health checks" would be more accurate.

Copy link
Copy Markdown
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments. Thanks!

}),
}

crmetrics.Registry.MustRegister(m.Available, m.RequestDuration)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider accepting prometheus.Registerer instead of hard-coding the global registry. This eliminates the 3 duplicate manual constructions in tests and avoids MustRegister panic on double-registration:

func NewKASHealthMetrics(reg prometheus.Registerer) *KASHealthMetrics {


// KASRequestDurationBuckets defines the histogram bucket boundaries for KAS
// health check latency measurements, ranging from 10ms to 10s.
var KASRequestDurationBuckets = []float64{0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Exported mutable slice — any caller can corrupt the histogram buckets. Consider unexported kasRequestDurationBuckets.

if err != nil {
return err
}
defer resp.Body.Close()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix for the pre-existing resp.Body leak. Placement is non-idiomatic though — resp is used in the if m != nil block before err is checked. Safer:

if resp != nil {
    defer resp.Body.Close()
}

placed right after httpClient.Get(), before the metrics block.

)

func TestNewKASHealthMetrics(t *testing.T) {
t.Run("When creating KAS health metrics, it should register both metrics", func(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This never calls NewKASHealthMetrics() — it tests Prometheus primitives, not the constructor. Rename to TestKASHealthMetricsOperations, or rewrite to call the real constructor (easier with the Registerer param suggested above). Also: missing t.Parallel(), uses t.Errorf instead of gomega (project convention).

}

ValidateMetrics(t, context.Background(), h.client, hostedCluster, metricsToValidate, true)
ValidateCPOMetrics(t, context.Background(), h.client, hostedCluster)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runs in every test's after() — if metrics emission fails on any platform, all E2E tests fail. Suggest starting as a dedicated test, then promoting to after() once stable in CI.

Comment thread test/e2e/util/util.go
hcpNamespace := manifests.HostedControlPlaneNamespace(hc.Namespace, hc.Name)

err := wait.PollUntilContextTimeout(ctx, 10*time.Second, 5*time.Minute, true, func(ctx context.Context) (bool, error) {
mf, err := GetMetricsFromPod(ctx, c, "control-plane-operator", "control-plane-operator", hcpNamespace, "8080")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Magic string "8080" — CPO metrics port from main.go:207. Extract to a constant.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 11, 2026

@enxebre: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws e94ec9e link true /test e2e-aws
ci/prow/unit e94ec9e link true /test unit
ci/prow/e2e-azure-self-managed e94ec9e link true /test e2e-azure-self-managed
ci/prow/verify-workflows e94ec9e link true /test verify-workflows
ci/prow/security e94ec9e link true /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

Rotten PRs close after 7d of inactivity.

Reopen the PR by commenting /reopen.
Mark the PR as fresh by commenting /remove-lifecycle rotten.

/close

@openshift-ci openshift-ci Bot closed this May 19, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

@openshift-ci[bot]: Closed this PR.

Details

In response to this:

Rotten PRs close after 7d of inactivity.

Reopen the PR by commenting /reopen.
Mark the PR as fresh by commenting /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@iamkirkbater
Copy link
Copy Markdown

/reopen
/remove-lifecycle rotten

@openshift-ci openshift-ci Bot reopened this May 19, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

@iamkirkbater: Reopened this PR.

Details

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

  • New Features

  • Added KAS health metrics (availability gauge and request-duration histogram) and registered them for collection.

  • Tests

  • Added unit tests for KAS health metrics and end-to-end validation to check metrics are emitted during runs.

  • Chores

  • Metrics instrumentation initialized at startup so controller exposes KAS health metrics for monitoring.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 19, 2026
@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented May 21, 2026

All the evidence is clear. Here is the final report:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

CONFLICT (content): Merge conflict in control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
CONFLICT (content): Merge conflict in test/e2e/util/util.go
Automatic merge failed; fix conflicts and then commit the result.
# Error: exit status 1

Summary

All four CI failures on PR #7749 share a single root cause: the PR branch is 3 months stale and has merge conflicts with main. The security job failed during the git merge step before any scanning ran. The two Konflux enterprise-contract checks failed because the branch predates PR #8557 (merged 2026-05-20) which updated 38 outdated Tekton task bundle references. Tide reports an error state because the needs-rebase label is present and mergeStateStatus is DIRTY. None of the failures are caused by the PR's code changes (KAS metrics exposure).

Root Cause

Single root cause: stale branch with merge conflicts.

PR #7749 was created on 2026-02-19 and has not been rebased since. In the 3 months since, main has diverged significantly, creating merge conflicts in two files:

  1. control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go — content conflict
  2. test/e2e/util/util.go — content conflict

This stale branch causes a cascade of failures across all four checks:

Check Failure Mechanism
ci/prow/security Git merge of PR HEAD (e94ec9e) into main (f16ca0d) fails with exit status 1 during the clone/checkout phase. The security scanner never executes.
tide Prow's merge controller detects mergeable: CONFLICTING and applies the needs-rebase label, setting the check to error state. The PR cannot enter the merge pool.
Konflux enterprise-contract (×2) The branch still carries old .tekton/ pipeline definitions with 64 outdated Tekton task bundle references. PR #8557 ("Update Konflux Tekton task bundles"), merged 2026-05-20, fixed these on main but the stale branch doesn't have those updates.

Evidence this is not a code issue: Other recent PRs (#8555, #8556, #8563) with current branches pass or skip these same checks. The PR's actual code changes (KAS metrics in the control-plane-operator) are not involved in any failure.

Recommendations
  1. Rebase the branch onto latest main:

    git fetch upstream
    git checkout fix-CNTRLPLANE-2775
    git rebase upstream/main
    # Resolve conflicts in:
    #   control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
    #   test/e2e/util/util.go
    git push --force-with-lease origin fix-CNTRLPLANE-2775
  2. No other action needed — rebasing will:

    • Resolve the merge conflicts → fixes ci/prow/security and tide
    • Pull in PR NO-JIRA: Update Konflux Tekton task bundles #8557's updated Tekton task bundles → fixes both Konflux enterprise-contract checks
    • Remove the needs-rebase label automatically
  3. After rebase, all four checks should pass on the next CI run without any code changes to the KAS metrics implementation.

Evidence
Evidence Detail
PR age Created 2026-02-19, last updated 2026-05-19 — 3 months stale
Merge state mergeable: CONFLICTING, mergeStateStatus: DIRTY
Conflicting files hostedcontrolplane_controller_test.go, test/e2e/util/util.go
Security job failure Git merge exits status 1 at build-log.txt line 134–142; scanner never runs
Tide error needs-rebase label present; PR blocked from merge pool
Konflux failures 64 outdated Tekton task bundle references in .tekton/ pipeline configs
Konflux fix PR #8557 merged 2026-05-20 updated all 38 stale task bundles on main
Proof not code-related Recent PRs #8555, #8556, #8563 pass the same checks with current branches
PR labels approved, verified, jira/valid-reference present — only needs-rebase blocks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants