Skip to content

GCP-841: remove ClusterResourceSet feature gate from CAPG manager args#8795

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
cristianoveiga:fix/gcp-841-remove-clusterresourceset-feature-gate
Jun 30, 2026
Merged

GCP-841: remove ClusterResourceSet feature gate from CAPG manager args#8795
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
cristianoveiga:fix/gcp-841-remove-clusterresourceset-feature-gate

Conversation

@cristianoveiga

@cristianoveiga cristianoveiga commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Removes ClusterResourceSet=false from the --feature-gates arg passed to the CAPG manager
  • ClusterResourceSet was promoted to GA in CAPI 1.10 and removed in CAPI 1.12 (kubernetes-sigs/cluster-api#12950)
  • OCP 4.22+ ships CAPG built against CAPI 1.12.8, causing the capi-provider pod to crash at startup with: unrecognized feature gate: ClusterResourceSet
  • MachinePool=false is retained — still valid in CAPI 1.12 (Beta, default-on)

Fixes: https://redhat.atlassian.net/browse/GCP-841

Test plan

  • Existing unit tests pass (go test ./hypershift-operator/controllers/hostedcluster/internal/platform/gcp/)
  • periodic-ci-openshift-hypershift-release-4.23-periodics-e2e-v2-gke no longer fails due to capi-provider crash
  • capi-provider pod starts successfully on 4.22.x and 4.23.x without a CAPG image override

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor
    • Simplified GCP controller feature gate configuration by removing version-dependent logic, now using a fixed set of feature gates instead of conditionally adjusting based on payload version.

ClusterResourceSet was promoted to GA in CAPI 1.10 and removed entirely
in CAPI 1.12. OCP 4.22+ ships CAPG built against CAPI 1.12.8, causing
the capi-provider pod to crash at startup with:

  invalid argument "MachinePool=false,ClusterResourceSet=false" for
  "--feature-gates" flag: unrecognized feature gate: ClusterResourceSet

Fixes: GCP-841

Signed-off-by: Cristiano Veiga <cveiga@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 22, 2026
@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

In CAPIProviderDeploymentSpec within the GCP platform controller, the featureGates variable is now initialized with a single static entry (MachinePool=false). The previous conditional logic that parsed payloadVersion and appended ClusterResourceSet=false when the major version was 4 and the minor version was greater than 16 has been removed entirely.

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR does not contain Ginkgo test definitions. Modified file (gcp.go) is non-test code; codebase uses standard Go testing, not Ginkgo.
Test Structure And Quality ✅ Passed PR modifies only non-test code (gcp.go) and contains no Ginkgo tests. Custom check for Ginkgo test quality is not applicable to this pull request.
Topology-Aware Scheduling Compatibility ✅ Passed This PR only modifies feature gate configuration strings for CAPI 1.12 compatibility; it introduces no scheduling constraints, affinity rules, topology assumptions, or replica changes whatsoever.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add any Ginkgo e2e tests. It modifies only the GCP platform controller configuration to remove an obsolete feature gate, making this check not applicable.
No-Weak-Crypto ✅ Passed PR modifies GCP feature gate configuration, not cryptographic code. No MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB, custom crypto, or insecure secret comparisons detected in changes.
Container-Privileges ✅ Passed PR contains no container privilege escalations: AllowPrivilegeEscalation=false, RunAsNonRoot=true, all capabilities dropped. Changes are only to feature gates, not security configuration.
No-Sensitive-Data-In-Logs ✅ Passed The PR removes a feature gate flag from CAPG controller configuration. No logging statements are added, modified, or exposed. No sensitive data (credentials, tokens, PII) is logged in this change.
Title check ✅ Passed The title accurately and specifically describes the main change: removing the ClusterResourceSet feature gate from CAPG manager arguments, which aligns with the core objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/gcp PR/issue for GCP (GCPPlatform) platform and removed do-not-merge/needs-area labels Jun 22, 2026
@cristianoveiga cristianoveiga changed the title fix(gcp): remove ClusterResourceSet feature gate from CAPG manager args GCP-841: remove ClusterResourceSet feature gate from CAPG manager args Jun 22, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 22, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 22, 2026

Copy link
Copy Markdown

@cristianoveiga: This pull request references GCP-841 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Removes ClusterResourceSet=false from the --feature-gates arg passed to the CAPG manager
  • ClusterResourceSet was promoted to GA in CAPI 1.10 and removed in CAPI 1.12 (kubernetes-sigs/cluster-api#12950)
  • OCP 4.22+ ships CAPG built against CAPI 1.12.8, causing the capi-provider pod to crash at startup with: unrecognized feature gate: ClusterResourceSet
  • MachinePool=false is retained — still valid in CAPI 1.12 (Beta, default-on)

Fixes: https://redhat.atlassian.net/browse/GCP-841

Test plan

  • Existing unit tests pass (go test ./hypershift-operator/controllers/hostedcluster/internal/platform/gcp/)
  • periodic-ci-openshift-hypershift-release-4.23-periodics-e2e-v2-gke no longer fails due to capi-provider crash
  • capi-provider pod starts successfully on 4.22.x and 4.23.x without a CAPG image override

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor
  • Simplified GCP controller feature gate configuration by removing version-dependent logic, now using a fixed set of feature gates instead of conditionally adjusting based on payload version.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.09%. Comparing base (8019810) to head (f11fc38).
⚠️ Report is 143 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8795      +/-   ##
==========================================
- Coverage   42.09%   42.09%   -0.01%     
==========================================
  Files         766      766              
  Lines       95047    95043       -4     
==========================================
- Hits        40012    40008       -4     
  Misses      52221    52221              
  Partials     2814     2814              
Files with missing lines Coverage Δ
...rollers/hostedcluster/internal/platform/gcp/gcp.go 83.67% <ø> (-0.20%) ⬇️
Flag Coverage Δ
cmd-support 35.42% <ø> (ø)
cpo-hostedcontrolplane 44.48% <ø> (ø)
cpo-other 44.25% <ø> (ø)
hypershift-operator 51.91% <ø> (-0.01%) ⬇️
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cristianoveiga cristianoveiga marked this pull request as ready for review June 22, 2026 15:47
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 22, 2026
@openshift-ci openshift-ci Bot requested review from clebs and jimdaga June 22, 2026 15:47
@cristianoveiga

Copy link
Copy Markdown
Contributor Author

/test e2e-v2-gke

@clebs

clebs commented Jun 25, 2026

Copy link
Copy Markdown
Member

@cristianoveiga hypershift is still on CAPI 1.11, since you are removing a feature that is still there on that version we need to make sure it is fine.

@cristianoveiga

Copy link
Copy Markdown
Contributor Author

@cristianoveiga hypershift is still on CAPI 1.11, since you are removing a feature that is still there on that version we need to make sure it is fine.

Hi @clebs,

The deployed CAPG binary comes from the OCP payload image, built separately from HyperShift's own vendor. My understanding is that these versions are not required to match.

The OpenShift CAPG fork upgraded to CAPI 1.12.8 in openshift/cluster-api-provider-gcp@e049bbd, and the new payloads (GCP HCP minimum will be 4.23) ship that binary.

ClusterResourceSet doesn't exist in any supported CAPG binary, so the fix is safe.

@clebs

clebs commented Jun 26, 2026

Copy link
Copy Markdown
Member

@cristianoveiga I see, if older CAPG versions that are still on CAPI 1.11 do not have that either, it should work fine.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2070428924499726336 | Cost: $2.93488025 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cristianoveiga

Copy link
Copy Markdown
Contributor Author

/retest-required

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2070515044726083584 | Cost: $2.9783685 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cristianoveiga

Copy link
Copy Markdown
Contributor Author

/retest-required

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2071244081966616576 | Cost: $3.4274917499999993 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cristianoveiga

Copy link
Copy Markdown
Contributor Author

/retest-required

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2071613649356591104 | Cost: $4.784708499999997 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cristianoveiga

Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@cristianoveiga

Copy link
Copy Markdown
Contributor Author

/verified later by @cristianoveiga

@openshift-ci-robot

Copy link
Copy Markdown

@cristianoveiga: Only users can be targets for the /verified later command.

Details

In response to this:

/verified later by @cristianoveiga

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@cristianoveiga

Copy link
Copy Markdown
Contributor Author

/verified bypass

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 29, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@cristianoveiga: The verified label has been added.

Details

In response to this:

/verified bypass

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2071668215800401920 | Cost: $5.610252449999998 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cristianoveiga

Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown

The e2e-aks step was never started (Started: None, Finished: None) because it depends on [release:initial] which failed. The failure is entirely in CI infrastructure — the release-images-initial pod could not be scheduled on the build01 cluster for the entire 1-hour timeout period. This is unrelated to the PR's code changes.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

step [release:initial] failed: release "release-images-initial" failed: pod pending for more than
1h0m0s: pod has not been scheduled in 1h0m0.000126666s:
0/47 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 12 node(s) didn't match
Pod's node affinity/selector, 2 node(s) were unschedulable, 32 node(s) had untolerated taint(s).
preemption: 0/47 nodes are available: No preemption victims found for incoming pod.

Summary

This failure is a CI infrastructure scheduling issue completely unrelated to the PR's code changes. The job failed before any test code ever executed. The release-images-initial pod — responsible for importing the OCP 5.0 initial release payload — could not be scheduled on the build01 cluster for the entire 1-hour timeout. All 47–55 available nodes were excluded due to a combination of untolerated taints (~30–40 nodes), node affinity/selector mismatches (12 nodes), unschedulable nodes (1–7 nodes), and pod anti-affinity rules (1 node). Because the e2e-aks test step depends on [release:initial], it was never started.

Root Cause

The root cause is CI build cluster resource exhaustion / scheduling constraints on build01. The release-images-initial pod was created at 20:58:51Z and remained in Pending state for the full 60-minute timeout until 21:58:51Z, when ci-operator aborted the job with reason executing_graph:step_failed:importing_release:running_pod:pod_pending.

Across the 78 scheduling events recorded, the scheduler consistently could not place the pod because:

  1. ~30–40 nodes had untolerated taints: These nodes are reserved for other workloads (e.g., different CI profiles or infrastructure components) and cannot accept this pod without matching tolerations.
  2. 12 nodes didn't match Pod's node affinity/selector: The pod has node affinity rules (likely requiring amd64 architecture and specific worker labels) that excluded these nodes.
  3. 1–7 nodes were unschedulable: These nodes were cordoned for maintenance or draining.
  4. 1 node didn't match pod anti-affinity rules: The pod has anti-affinity constraints preventing co-location with certain other pods.

The multiarch-tuning-operator processed the pod correctly (gated it, detected amd64 architecture, removed the gate at 20:58:54Z), but after that the Kubernetes scheduler was never able to find a suitable node. The node counts fluctuated throughout the hour (47–55 total nodes), but the combination of constraints always eliminated all candidates.

This is a transient infrastructure condition. The actual test step e2e-aks was never reached — it has a dependency on [release:initial] and its Started and Finished timestamps are both None.

The PR changes (removing the ClusterResourceSet feature gate from CAPG manager args) were never tested because the failure occurred during release image import, long before any HyperShift or CAPG code was executed.

Recommendations
  1. Retry the job — This is a transient CI infrastructure issue. Use /retest or /test e2e-aks to re-trigger the job. The cluster scheduling pressure is likely to have resolved.
  2. No code changes needed — The PR's changes to remove the ClusterResourceSet feature gate are not implicated in this failure in any way.
  3. If retries continue to fail — Escalate to the CI infrastructure team (#forum-ocp-crt on Slack) about scheduling capacity on the build01 cluster, particularly for release-image import pods that require amd64 workers without restrictive taints.
Evidence
Evidence Detail
Failed Step [release:initial] — Import the release payload "initial" from an external source
Pod Name release-images-initial
Pod Status Pending for 60m0s (full timeout)
Build Cluster build01
Scheduling Events 78 events, all showing no schedulable nodes
Node Exclusions ~30-40 untolerated taints, 12 affinity mismatches, 1-7 unschedulable, 1 anti-affinity conflict
e2e-aks Step Never started (Started: None, Finished: None)
Job Reason executing_graph:step_failed:importing_release:running_pod:pod_pending
Release Being Imported registry.ci.openshift.org/ocp/release-5:5.0.0-0.ci-2026-06-25-122017
All Images Built ✅ hypershift, hypershift-operator, hypershift-cli, hypershift-tests — all succeeded

@cristianoveiga

Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@csrwng csrwng added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 30, 2026
@openshift-ci

openshift-ci Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: cristianoveiga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci

openshift-ci Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

@cristianoveiga: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 4d087b3 into openshift:main Jun 30, 2026
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/gcp PR/issue for GCP (GCPPlatform) platform jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants