CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator by enxebre · Pull Request #7749 · openshift/hypershift

enxebre · 2026-02-19T12:23:39Z

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric	Type	Description
`hypershift_kube_apiserver_available`	Gauge	1 if `/healthz` returns HTTP 200, 0 otherwise
`hypershift_kube_apiserver_request_duration_seconds`	Histogram	Latency of the `/healthz` probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
control-plane-operator/main.go — metrics initialization

Testing

Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

New Features
- Added KAS health metrics (availability gauge and request-duration histogram) and registered them for collection.
Tests
- Added unit tests for KAS health metrics and end-to-end validation to check metrics are emitted during runs.
Chores
- Metrics instrumentation initialized at startup so controller exposes KAS health metrics for monitoring.

Instrument the existing healthCheckKASEndpoint function in the control-plane-operator to expose two new Prometheus metrics: - hypershift_kube_apiserver_available: Gauge reporting 1 when the KAS /healthz endpoint returns HTTP 200, 0 otherwise. - hypershift_kube_apiserver_request_duration_seconds: Histogram tracking the latency of the /healthz health check probe. These metrics are registered with the controller-runtime metrics registry and are automatically scraped by the existing PodMonitor for the control-plane-operator. Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster. The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect of the existing health check, not a replacement. This eliminates the need for external monitoring tooling (e.g. route-monitor-operator) to track KAS availability and latency for SLA purposes in HCP offerings. Ref: CNTRLPLANE-2775 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add tests for the healthCheckKASEndpoint function that verify metrics are correctly recorded during health check probes: - Gauge set to 1 and histogram observed on successful 200 response - Gauge set to 0 on non-200 response (503) - Gauge set to 0 on connection error (unreachable endpoint) - No panic when metrics is nil (backward compatibility) Also add a basic test for KASHealthMetrics construction and registration. Ref: CNTRLPLANE-2775 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openshift-ci-robot · 2026-02-19T12:23:41Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-02-19T12:23:43Z

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description

hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise

hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required

Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster

The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement

Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration

control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check

control-plane-operator/main.go — metrics initialization

Testing

Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios

E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)

All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-02-19T12:23:49Z

No actionable comments were generated in the recent review. 🎉

Walkthrough

Adds Prometheus metrics for KAS health (availability and request duration), threads an optional KASHealthMetrics through KAS health checks, initializes metrics at startup, and adds unit and E2E tests and helpers to validate metrics exposure. Some test and helper additions are duplicated in the diff.

Changes

Cohort / File(s)	Summary
Controller instrumentation `control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`	Added `HostedControlPlaneReconciler.KASHealthMetrics kas.KASHealthMetrics`; changed `healthCheckKASEndpoint` to accept `m kas.KASHealthMetrics`; record request duration and set availability (guarded by nil check).
KAS metrics implementation `control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go`	New metrics: `KASAvailableMetricName`, `KASRequestDurationMetricName`, `KASRequestDurationBuckets`; `KASHealthMetrics` struct with `Available` gauge and `RequestDuration` histogram; `NewKASHealthMetrics()` registering metrics.
Controller unit tests `control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go`	Added `TestHealthCheckKASEndpointMetrics` with subtests (200, 503, unreachable, nil-metrics), helpers `newTestKASHealthMetrics` and `parseHostPort`. Note: test and helpers appear duplicated in the diff — inspect for unintended repeats.
KAS metrics unit tests `control-plane-operator/controllers/hostedcontrolplane/kas/metrics_test.go`	New test validating metric registration, gauge initial state, gauge update, and histogram observation via a Prometheus registry.
Startup wiring `control-plane-operator/main.go`	Instantiates `kas.NewKASHealthMetrics()` and injects it into `HostedControlPlaneReconciler` during startup/setup.
E2E helpers & integration `test/e2e/util/hypershift_framework.go`, `test/e2e/util/util.go`	Adds `ValidateCPOMetrics` E2E helper and invokes it in after-phase; helper polls control-plane-operator metrics for `kas.KASAvailableMetricName` and `kas.KASRequestDurationMetricName`. Note: `ValidateCPOMetrics` appears duplicated in the diff — verify and dedupe.

Sequence Diagram(s)

sequenceDiagram
  participant Tests as Tests/E2E
  participant Reconciler as HostedControlPlaneReconciler
  participant KAS as KAS/API
  participant Metrics as Prometheus/Registry

  Tests->>Reconciler: trigger health check
  Reconciler->>KAS: HTTP request to ingress (healthCheckKASEndpoint with m)
  activate KAS
  KAS-->>Reconciler: HTTP response (200/503/timeout)
  deactivate KAS
  alt m != nil
    Reconciler->>Metrics: observe RequestDuration
    Reconciler->>Metrics: set Available = 1 or 0
  end
  Tests->>Metrics: query metrics endpoint to validate metrics present/values

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	PR contains duplicated test implementations in hostedcontrolplane_controller_test.go and util.go, lacks meaningful assertion messages, and has insufficient timeout safeguards in E2E polling operations.	Remove duplicate test functions, add descriptive failure messages to all assertions, and ensure explicit timeout configuration for all polling operations.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main objective: exposing KAS availability and latency metrics from the control-plane-operator, which aligns with the core implementation across all modified files.
Stable And Deterministic Test Names	✅ Passed	All test names in the PR use stable, deterministic naming with no dynamic content, formatted strings, or variable substitution.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-02-19T12:24:59Z

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description

hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise

hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required

Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster

The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement

Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration

control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check

control-plane-operator/main.go — metrics initialization

Testing

Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios

E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)

All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-02-19T12:25:25Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

enxebre · 2026-02-19T12:35:56Z

/cc @muraee @csrwng

openshift-ci · 2026-02-19T12:37:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enxebre · 2026-02-19T12:40:52Z

/auto-cc

typeid · 2026-02-19T13:34:37Z

Just to clarify a bit further, while these metrics are great, in ROSA HCP we intend to probe KAS availability externally via RHOBS synthetic monitoring (SREP-333), where Blackbox Exporter runs on RHOBS cells outside the management cluster. This gives us the advantage of testing the actual customer-facing network path (only partially for private API), including DNS resolution, load balancer health, and regional routing, rather than probing from within the MC's own network.

I understand ARO HCP wants to avoid the RMO dependency, and these metrics help with that. However, since the CPO probe originates from within the MC, it's not a full replacement for external synthetic monitoring for SLA purposes IMO.

That said, these CPO-local metrics are definitely useful even for the ROSA side for faster internal detection of control plane issues, for example catching KAS pod crashes or pinpointing network issues to in-cluster networking failures.

LGTM & thanks for the addition!

typeid · 2026-02-19T13:35:19Z

Also cc @dustman9000 as a FYI that this exists and is now being extended with latency as well :)

muraee · 2026-02-19T13:56:59Z

lgtm

enxebre · 2026-02-19T13:59:24Z

/test e2e-aws

openshift-ci-robot · 2026-02-19T14:10:04Z

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description

hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise

hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required

Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster

The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement

Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration

control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check

control-plane-operator/main.go — metrics initialization

Testing

Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios

E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)

All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

Release Notes

New Features

Added Kubernetes API Server (KAS) health metrics monitoring with Prometheus instrumentation, tracking request duration and availability status.

Tests

Added comprehensive validation tests for KAS health metrics functionality.

Integrated Control Plane Operator metrics validation into end-to-end test workflows.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/util/util.go`:
- Around line 2712-2743: The loop in ValidateCPOMetrics uses
ValidateMetricPresence which expects labeled metrics and therefore never matches
label-less KAS metrics; modify the inner check after GetMetricsFromPod to look
for the MetricFamily by name (from the returned mf MetricFamily map or slice)
for kas.KASAvailableMetricName and kas.KASRequestDurationMetricName instead of
calling ValidateMetricPresence, i.e., verify the MetricFamily exists and has at
least one metric (no label checks) before returning true; keep the surrounding
wait.PollUntilContextTimeout, GetMetricsFromPod, and error handling unchanged.

openshift-ci-robot · 2026-02-19T17:19:00Z

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description

hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise

hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required

Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster

The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement

Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration

control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check

control-plane-operator/main.go — metrics initialization

Testing

Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios

E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)

All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

New Features

Added KAS health metrics: availability and request-duration metrics exposed for monitoring.

Tests

Added unit tests for KAS health metrics and integrated control-plane metrics validation into end-to-end test flows.

Chores

Instrumentation initialized at startup so metrics are available from the controller runtime.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Add ValidateCPOMetrics function that verifies both hypershift_kube_apiserver_available and hypershift_kube_apiserver_request_duration_seconds metrics are present on the control-plane-operator pod's metrics endpoint (port 8080). The validation is integrated into the e2e test framework's pre-teardown phase, running alongside existing hypershift-operator metrics checks. It follows the established pattern using GetMetricsFromPod and ValidateMetricPresence with polling (10s interval, 5min timeout). Ref: CNTRLPLANE-2775 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

enxebre · 2026-02-19T17:25:07Z

/test e2e-aws
/verified by e2e

openshift-ci-robot · 2026-02-19T17:25:22Z

@enxebre: This PR has been marked as verified by e2e.

Details

In response to this:

/test e2e-aws
/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-02-19T17:34:09Z

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description

hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise

hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required

Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster

The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement

Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration

control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check

control-plane-operator/main.go — metrics initialization

Testing

Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios

E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)

All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

New Features

Added KAS health metrics (availability gauge and request-duration histogram) and registered them for collection.

Tests

Added unit tests for KAS health metrics and end-to-end validation to check metrics are emitted during runs.

Chores

Metrics instrumentation initialized at startup so controller exposes KAS health metrics for monitoring.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-robot · 2026-02-21T20:43:13Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

cwbotbot · 2026-02-23T17:57:33Z

Test Results

e2e-aws

Status: ❌ FAIL
Started: 2026-02-19T17:27:01Z
View Job
View Job History

Failed Tests

Total failed tests: 4

TestCreateClusterRequestServingIsolation
TestCreateClusterRequestServingIsolation/Teardown
TestCreateClusterRequestServingIsolation/ValidateHostedCluster
TestCreateClusterRequestServingIsolation/ValidateHostedCluster/EnsureNoCrashingPods

openshift-bot · 2026-04-17T01:30:16Z

Stale PRs are closed after 21d of inactivity.

If this PR is still relevant, comment to refresh it or remove the stale label.
Mark the PR as fresh by commenting /remove-lifecycle stale.

If this PR is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2026-05-02T02:00:38Z

Stale PRs rot after 14d of inactivity.

Mark the PR as fresh by commenting /remove-lifecycle rotten.
Rotten PRs close after an additional 7d of inactivity.

If this PR is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

dustman9000 · 2026-05-02T17:05:23Z

This is a great addition! Having a native KAS health signal per HCP from inside the CPO will be very useful.

One thing I want to flag from the ROSA HCP SRE monitoring side: the probe frequency here is tied to the reconcile loop (1 min when healthy, 15s when unhealthy), and the probe runs from inside the management cluster. We currently have two layers of external synthetic monitoring that validate the customer-facing network path (DNS, LB, ingress):

route-monitor-operator: creates Prometheus blackbox exporter probes per HCP endpoint on the management cluster
RHOBS synthetics agent: external blackbox probes monitoring HCP API endpoints from outside the MC, feeding into RHOBS for SLO tracking

Both probe at a higher frequency and validate reachability from outside the control plane. For SLO/SLA calculations like KubeAPIErrorBudgetBurn, that external perspective and tighter sampling interval matter.

So this is complementary to our synthetic monitoring, not a replacement. If we can, I'd suggest softening the "eliminate that dependency" language in the description to avoid confusion downstream. Something like "these native metrics reduce reliance on external probing for internal health checks" would be more accurate.

jparrill

Dropped some comments. Thanks!

jparrill · 2026-05-05T07:59:01Z

+		}),
+	}
+
+	crmetrics.Registry.MustRegister(m.Available, m.RequestDuration)


Consider accepting prometheus.Registerer instead of hard-coding the global registry. This eliminates the 3 duplicate manual constructions in tests and avoids MustRegister panic on double-registration:

func NewKASHealthMetrics(reg prometheus.Registerer) *KASHealthMetrics {

jparrill · 2026-05-05T07:59:01Z

+
+// KASRequestDurationBuckets defines the histogram bucket boundaries for KAS
+// health check latency measurements, ranging from 10ms to 10s.
+var KASRequestDurationBuckets = []float64{0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}


nit: Exported mutable slice — any caller can corrupt the histogram buckets. Consider unexported kasRequestDurationBuckets.

jparrill · 2026-05-05T07:59:01Z

 	if err != nil {
 		return err
 	}
+	defer resp.Body.Close()


Good fix for the pre-existing resp.Body leak. Placement is non-idiomatic though — resp is used in the if m != nil block before err is checked. Safer:

if resp != nil { defer resp.Body.Close() }

placed right after httpClient.Get(), before the metrics block.

jparrill · 2026-05-05T07:59:02Z

+)
+
+func TestNewKASHealthMetrics(t *testing.T) {
+	t.Run("When creating KAS health metrics, it should register both metrics", func(t *testing.T) {


This never calls NewKASHealthMetrics() — it tests Prometheus primitives, not the constructor. Rename to TestKASHealthMetricsOperations, or rewrite to call the real constructor (easier with the Registerer param suggested above). Also: missing t.Parallel(), uses t.Errorf instead of gomega (project convention).

jparrill · 2026-05-05T07:59:02Z

 		}

 		ValidateMetrics(t, context.Background(), h.client, hostedCluster, metricsToValidate, true)
+		ValidateCPOMetrics(t, context.Background(), h.client, hostedCluster)


Runs in every test's after() — if metrics emission fails on any platform, all E2E tests fail. Suggest starting as a dedicated test, then promoting to after() once stable in CI.

jparrill · 2026-05-05T07:59:02Z

+		hcpNamespace := manifests.HostedControlPlaneNamespace(hc.Namespace, hc.Name)
+
+		err := wait.PollUntilContextTimeout(ctx, 10*time.Second, 5*time.Minute, true, func(ctx context.Context) (bool, error) {
+			mf, err := GetMetricsFromPod(ctx, c, "control-plane-operator", "control-plane-operator", hcpNamespace, "8080")


nit: Magic string "8080" — CPO metrics port from main.go:207. Extract to a constant.

openshift-ci · 2026-05-11T17:32:37Z

@enxebre: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws	`e94ec9e`	link	true	`/test e2e-aws`
ci/prow/unit	`e94ec9e`	link	true	`/test unit`
ci/prow/e2e-azure-self-managed	`e94ec9e`	link	true	`/test e2e-azure-self-managed`
ci/prow/verify-workflows	`e94ec9e`	link	true	`/test verify-workflows`
ci/prow/security	`e94ec9e`	link	true	`/test security`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci · 2026-05-19T02:30:05Z

Rotten PRs close after 7d of inactivity.

Reopen the PR by commenting /reopen.
Mark the PR as fresh by commenting /remove-lifecycle rotten.

/close

openshift-ci · 2026-05-19T02:30:44Z

@openshift-ci[bot]: Closed this PR.

Details

In response to this:

Rotten PRs close after 7d of inactivity.

Reopen the PR by commenting /reopen.
Mark the PR as fresh by commenting /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

iamkirkbater · 2026-05-19T13:33:28Z

/reopen
/remove-lifecycle rotten

openshift-ci · 2026-05-19T13:33:46Z

@iamkirkbater: Reopened this PR.

Details

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2026-05-19T13:33:48Z

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description

hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise

hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required

Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster

The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement

Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration

control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check

control-plane-operator/main.go — metrics initialization

Testing

Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios

E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)

All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

New Features

Added KAS health metrics (availability gauge and request-duration histogram) and registered them for collection.

Tests

Added unit tests for KAS health metrics and end-to-end validation to check metrics are emitted during runs.

Chores

Metrics instrumentation initialized at startup so controller exposes KAS health metrics for monitoring.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hypershift-jira-solve-ci · 2026-05-21T09:12:05Z

All the evidence is clear. Here is the final report:

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-security, tide, Red Hat Konflux enterprise-contract (×2)
Build ID: 2053890895576567808 (security job)
PR: CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator #7749 — fix-CNTRLPLANE-2775 (CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator)
PR Created: 2026-02-19 (3 months ago)
Branch: fix-CNTRLPLANE-2775 → main

Test Failure Analysis

Error

CONFLICT (content): Merge conflict in control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
CONFLICT (content): Merge conflict in test/e2e/util/util.go
Automatic merge failed; fix conflicts and then commit the result.
# Error: exit status 1

Summary

All four CI failures on PR #7749 share a single root cause: the PR branch is 3 months stale and has merge conflicts with main. The security job failed during the git merge step before any scanning ran. The two Konflux enterprise-contract checks failed because the branch predates PR #8557 (merged 2026-05-20) which updated 38 outdated Tekton task bundle references. Tide reports an error state because the needs-rebase label is present and mergeStateStatus is DIRTY. None of the failures are caused by the PR's code changes (KAS metrics exposure).

Root Cause

Single root cause: stale branch with merge conflicts.

PR #7749 was created on 2026-02-19 and has not been rebased since. In the 3 months since, main has diverged significantly, creating merge conflicts in two files:

control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go — content conflict
test/e2e/util/util.go — content conflict

This stale branch causes a cascade of failures across all four checks:

Check	Failure Mechanism
ci/prow/security	Git merge of PR HEAD (`e94ec9e`) into `main` (`f16ca0d`) fails with exit status 1 during the clone/checkout phase. The security scanner never executes.
tide	Prow's merge controller detects `mergeable: CONFLICTING` and applies the `needs-rebase` label, setting the check to error state. The PR cannot enter the merge pool.
Konflux enterprise-contract (×2)	The branch still carries old `.tekton/` pipeline definitions with 64 outdated Tekton task bundle references. PR #8557 ("Update Konflux Tekton task bundles"), merged 2026-05-20, fixed these on `main` but the stale branch doesn't have those updates.

Evidence this is not a code issue: Other recent PRs (#8555, #8556, #8563) with current branches pass or skip these same checks. The PR's actual code changes (KAS metrics in the control-plane-operator) are not involved in any failure.

Recommendations

Rebase the branch onto latest main:

git fetch upstream
git checkout fix-CNTRLPLANE-2775
git rebase upstream/main
# Resolve conflicts in:
#   control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
#   test/e2e/util/util.go
git push --force-with-lease origin fix-CNTRLPLANE-2775

No other action needed — rebasing will:
- Resolve the merge conflicts → fixes ci/prow/security and tide
- Pull in PR NO-JIRA: Update Konflux Tekton task bundles #8557's updated Tekton task bundles → fixes both Konflux enterprise-contract checks
- Remove the needs-rebase label automatically
After rebase, all four checks should pass on the next CI run without any code changes to the KAS metrics implementation.

Evidence

Evidence	Detail
PR age	Created 2026-02-19, last updated 2026-05-19 — 3 months stale
Merge state	`mergeable: CONFLICTING`, `mergeStateStatus: DIRTY`
Conflicting files	`hostedcontrolplane_controller_test.go`, `test/e2e/util/util.go`
Security job failure	Git merge exits status 1 at build-log.txt line 134–142; scanner never runs
Tide error	`needs-rebase` label present; PR blocked from merge pool
Konflux failures	64 outdated Tekton task bundle references in `.tekton/` pipeline configs
Konflux fix PR	#8557 merged 2026-05-20 updated all 38 stale task bundles on `main`
Proof not code-related	Recent PRs #8555, #8556, #8563 pass the same checks with current branches
PR labels	`approved`, `verified`, `jira/valid-reference` present — only `needs-rebase` blocks

enxebre and others added 2 commits February 19, 2026 13:12

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 19, 2026

openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Feb 19, 2026

openshift-ci Bot requested review from csrwng and muraee February 19, 2026 12:36

openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Feb 19, 2026

openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Feb 19, 2026

openshift-ci Bot requested review from jparrill and sjenning February 19, 2026 12:41

enxebre marked this pull request as ready for review February 19, 2026 13:59

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 19, 2026

coderabbitai Bot reviewed Feb 19, 2026

View reviewed changes

Comment thread test/e2e/util/util.go

enxebre force-pushed the fix-CNTRLPLANE-2775 branch from 51eb7d6 to e94ec9e Compare February 19, 2026 17:24

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 19, 2026

enxebre added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Feb 19, 2026

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 21, 2026

openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 17, 2026

openshift-ci Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 2, 2026

jparrill reviewed May 5, 2026

View reviewed changes

openshift-ci Bot closed this May 19, 2026

openshift-ci Bot reopened this May 19, 2026

openshift-ci Bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 19, 2026

Conversation

enxebre commented Feb 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why

How

Key files

Testing

Jira

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Feb 19, 2026

Uh oh!

openshift-ci-robot commented Feb 19, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why

How

Key files

Testing

Jira

Uh oh!

coderabbitai Bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci-robot commented Feb 19, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why

How

Key files

Testing

Jira

Uh oh!

openshift-ci Bot commented Feb 19, 2026

Uh oh!

enxebre commented Feb 19, 2026

Uh oh!

openshift-ci Bot commented Feb 19, 2026

Uh oh!

enxebre commented Feb 19, 2026

Uh oh!

typeid commented Feb 19, 2026

Uh oh!

typeid commented Feb 19, 2026

Uh oh!

muraee commented Feb 19, 2026

Uh oh!

enxebre commented Feb 19, 2026

Uh oh!

openshift-ci-robot commented Feb 19, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why

How

Key files

Testing

Jira

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci-robot commented Feb 19, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why

How

Key files

Testing

Jira

Summary by CodeRabbit

Uh oh!

enxebre commented Feb 19, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Feb 19, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Feb 19, 2026 •

edited

Loading

openshift-ci-robot commented Feb 19, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Feb 19, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Feb 19, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Feb 19, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented May 19, 2026 •

edited by openshift-ci Bot

Loading