Skip to content

CNTRLPLANE-2539: Move generation of the CAPI Provider Role#8305

Open
cwilkers wants to merge 1 commit into
openshift:mainfrom
cwilkers:CNTRLPLANE-2539
Open

CNTRLPLANE-2539: Move generation of the CAPI Provider Role#8305
cwilkers wants to merge 1 commit into
openshift:mainfrom
cwilkers:CNTRLPLANE-2539

Conversation

@cwilkers
Copy link
Copy Markdown

@cwilkers cwilkers commented Apr 22, 2026

CNTRLPLANE-2539: Move generation of CAPI Provider Role from cli to operator

What this PR does / why we need it:

Supports work in ZTP deployment of HCP clusters by taking Role generation (which is a security risk in gitops workflows) out of the CLI or manual steps and into the operator. Now the operator will handle the role creation, and gitops based deployments do not require relaxing security restrictions to create Roles.

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-2539

Special notes for your reviewer:

Checklist:

  • [ + ] Subject and description added to both, commit and PR.
  • [ + ] Relevant issues have been referenced.
  • [ ? ] This change includes docs.
  • [ + ] This change includes unit tests.

AI Assistance

  • Model Used: Claude with claude-sonnet-4@20250514 in CLI and in VS Code IDE integration
  • Scope: Planning, Test writing, and Code
  • Level: Code generation and Code assistance
  • Human Review: Quick code review by @cwilkers, manual testing of multiple use cases on local cluster

Summary by CodeRabbit

  • Refactor

    • RBAC role emission removed from the legacy creation path; agent credential lifecycle now centralizes role management and only deletes a shared role when no remaining bindings reference it.
  • Tests

    • Added and extended tests to verify shared role creation, idempotent reconciliation, preservation across multiple hosted clusters, and deletion once unreferenced.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 22, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 22, 2026

@cwilkers: This pull request references CNTRLPLANE-2539 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

CNTRLPLANE-2539: Move generation of CAPI Provider Role from cli to operator

What this PR does / why we need it:

Supports work in ZTP deployment of HCP clusters by taking Role generation (which is a security risk in gitops workflows) out of the CLI or manual steps and into the operator. Now the operator will handle the role creation, and gitops based deployments do not require relaxing security restrictions to create Roles.

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-2539

Special notes for your reviewer:

Checklist:

  • [ + ] Subject and description added to both, commit and PR.
  • [ + ] Relevant issues have been referenced.
  • [ ? ] This change includes docs.
  • [ + ] This change includes unit tests.

AI Assistance

  • Model Used: Claude with claude-sonnet-4@20250514 in CLI and in VS Code IDE integration
  • Scope: Planning, Test writing, and Code
  • Level: Code generation and Code assistance
  • Human Review: Quick code review by @cwilkers, manual testing of multiple use cases on local cluster

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 22, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR removes RBAC Role emission from cmd/cluster/agent/create.go and moves Role management into the agent controller. The agent controller now creates/updates an rbacv1.Role named capi-provider-role granting * on agent-install.openshift.io/agents, creates/updates a RoleBinding referencing that Role, and on credential deletion deletes the Role only when no remaining RoleBindings in the agent namespace reference it. Tests were added/updated to cover creation, idempotency, and shared-role deletion semantics.

Sequence Diagram(s)

sequenceDiagram
    participant HostedCluster as HostedCluster
    participant AgentController as Agent Controller
    participant K8sAPI as Kubernetes API
    participant Role as Role (capi-provider-role)
    participant RoleBinding as RoleBinding

    HostedCluster->>AgentController: ReconcileCredentials(request)
    AgentController->>K8sAPI: Get HostedCluster & namespace info
    K8sAPI-->>AgentController: HostedCluster data
    AgentController->>K8sAPI: Create/Update Role (agents on agent-install.openshift.io, verbs: *)
    K8sAPI-->>Role: Role created/updated
    AgentController->>K8sAPI: Create/Update RoleBinding -> RoleRef: capi-provider-role
    K8sAPI-->>RoleBinding: RoleBinding created/updated

    alt Delete credentials flow
        AgentController->>K8sAPI: Delete RoleBinding(s)
        K8sAPI-->>AgentController: RoleBinding deletion results
        AgentController->>K8sAPI: List RoleBindings in agent namespace
        K8sAPI-->>AgentController: RoleBindings list
        alt No remaining RoleBindings referencing capi-provider-role
            AgentController->>K8sAPI: Delete Role (capi-provider-role)
            K8sAPI-->>Role: Role deleted
        else Remaining references exist
            Note right of AgentController: Preserve shared Role
        end
    end
Loading
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Tests lack descriptive assertion messages (93% missing), making failures difficult to diagnose. Add descriptive failure messages to all assertions throughout the four test functions for improved maintainability and debugging.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: moving CAPI Provider Role generation from CLI to operator, which is reflected in all three modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Tests use standard Go testing conventions with testing.T, not Ginkgo framework, so Ginkgo naming guidelines do not apply.
Microshift Test Compatibility ✅ Passed The custom check applies only to new Ginkgo e2e tests using Ginkgo patterns. The tests added are standard Go unit tests following the func TestXxx(t *testing.T) convention, not Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Tests added are standard Go unit tests with fake clients, not Ginkgo e2e tests. No SNO compatibility check required.
Topology-Aware Scheduling Compatibility ✅ Passed PR changes move RBAC Role generation from CLI to operator; this is authorization logic only, not scheduling constraints
Ote Binary Stdout Contract ✅ Passed The pull request contains no stdout writes in process-level code. The three changed files are a CLI platform package, an operator reconciler, and standard Go unit tests with no fmt.Print*, log.Print*, or klog calls.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Custom check for IPv6 and disconnected network test compatibility applies only to Ginkgo e2e tests. The tests in this PR are standard Go unit tests using testing.T framework with a fake Kubernetes client, have no Ginkgo imports, and do not run in IPv6-only disconnected CI environments, making this check not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 22, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cwilkers
Once this PR has been reviewed and has the lgtm label, please assign cblecker for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels Apr 22, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 234-236: The current DeleteIfNeeded call in agent.go removes the
shared Role named by CAPIProviderRoleName in the agent namespace which can break
other HostedClusters; instead, change the cleanup to either (A) skip deleting
the Role entirely, (B) make the Role name unique per controlPlaneNamespace, or
(C) before calling hyperutil.DeleteIfNeeded for the Role (the call that
constructs &rbacv1.Role{... Name: CAPIProviderRoleName, Namespace:
hc.Spec.Platform.Agent.AgentNamespace}), list RoleBindings in
hc.Spec.Platform.Agent.AgentNamespace and only delete the Role if no
RoleBinding.RoleRef (or Subjects) references CAPIProviderRoleName; implement the
RoleBinding check using the client to List rbacv1.RoleBinding objects and
inspect RoleRef.Name/Kind and Subjects to decide safety of deletion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 9f07eb43-14f4-4111-b02b-4c2718081b20

📥 Commits

Reviewing files that changed from the base of the PR and between f2fd2ca and 051ef9f.

⛔ Files ignored due to path filters (1)
  • cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yaml is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • cmd/cluster/agent/create.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go

Comment thread hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 78.33333% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.41%. Comparing base (36dfb1b) to head (5b096e7).

Files with missing lines Patch % Lines
...ers/hostedcluster/internal/platform/agent/agent.go 83.33% 6 Missing and 3 partials ⚠️
...trollers/hostedcluster/hostedcluster_controller.go 20.00% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8305   +/-   ##
=======================================
  Coverage   40.40%   40.41%           
=======================================
  Files         755      755           
  Lines       93235    93268   +33     
=======================================
+ Hits        37675    37697   +22     
- Misses      52858    52866    +8     
- Partials     2702     2705    +3     
Files with missing lines Coverage Δ
cmd/cluster/agent/create.go 49.38% <100.00%> (-9.21%) ⬇️
...trollers/hostedcluster/hostedcluster_controller.go 43.67% <20.00%> (-0.02%) ⬇️
...ers/hostedcluster/internal/platform/agent/agent.go 48.51% <83.33%> (+10.45%) ⬆️
Flag Coverage Δ
cmd-support 34.40% <100.00%> (-0.04%) ⬇️
cpo-hostedcontrolplane 41.76% <ø> (ø)
cpo-other 40.31% <ø> (ø)
hypershift-operator 50.78% <77.96%> (+0.05%) ⬆️
other 31.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cwilkers cwilkers force-pushed the CNTRLPLANE-2539 branch 2 times, most recently from 668e301 to aa074db Compare April 22, 2026 13:11
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 231-245: The Role may be left orphaned because the cached List can
still return the just-deleted RoleBinding; update the loop that scans
roleBindings.Items (after DeleteIfNeeded) to ignore entries that are being
deleted or are the exact binding we just removed: skip items with non-nil
DeletionTimestamp and skip items whose ObjectMeta.Name equals
fmt.Sprintf("%s-%s", CredentialsRBACPrefix, controlPlaneNamespace); also tighten
the match to require roleBindings.Items[i].RoleRef.APIGroup ==
"rbac.authorization.k8s.io" in addition to RoleRef.Kind == "Role" and
RoleRef.Name == CAPIProviderRoleName so you only return early for real, active
RoleBindings. Ensure these checks are applied where roleBindings is iterated
before deciding not to delete the Role.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: ca68a83e-fe57-41cf-983f-f50ed82118a8

📥 Commits

Reviewing files that changed from the base of the PR and between 051ef9f and 668e301.

⛔ Files ignored due to path filters (1)
  • cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yaml is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • cmd/cluster/agent/create.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/cluster/agent/create.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go

Comment thread hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go (1)

234-245: ⚠️ Potential issue | 🟡 Minor

Skip stale/deleting RoleBindings before preserving the shared Role.

Line 239 can still match the RoleBinding just deleted on Line 231 when the cached client has not observed the deletion yet, causing the final cleanup to return early and leave capi-provider-role orphaned. Also include RoleRef.APIGroup so only active RBAC RoleRefs preserve the Role.

Suggested tightening of the cleanup guard
 	for i := range roleBindings.Items {
-		if roleBindings.Items[i].RoleRef.Kind == "Role" && roleBindings.Items[i].RoleRef.Name == CAPIProviderRoleName {
+		roleBinding := &roleBindings.Items[i]
+		if roleBinding.Name == fmt.Sprintf("%s-%s", CredentialsRBACPrefix, controlPlaneNamespace) || roleBinding.DeletionTimestamp != nil {
+			continue
+		}
+		if roleBinding.RoleRef.APIGroup == "rbac.authorization.k8s.io" &&
+			roleBinding.RoleRef.Kind == "Role" &&
+			roleBinding.RoleRef.Name == CAPIProviderRoleName {
 			return nil
 		}
 	}

Run this read-only check to confirm whether this path is using a cached manager client in production:

#!/bin/bash
# Description: Inspect controller wiring to see whether DeleteCredentials receives a cached controller-runtime client.
# Expect: If mgr.GetClient() or reconciler Client is passed through, cached List behavior should be assumed.

rg -n -C4 'DeleteCredentials\s*\(|ReconcileCredentials\s*\(|mgr\.GetClient\(\)|GetAPIReader\(\)|client\.New\(' .
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`
around lines 234 - 245, The loop that preserves the shared Role can erroneously
match RoleBindings that are being deleted or from other API groups; update the
check over roleBindings.Items in agent.go to skip items with a non-nil
metadata.DeletionTimestamp and require RoleRef.APIGroup ==
"rbac.authorization.k8s.io" in addition to RoleRef.Kind == "Role" and
RoleRef.Name == CAPIProviderRoleName (so the guard uses RoleBindingList /
roleBindings.Items[i].ObjectMeta.DeletionTimestamp and RoleRef.APIGroup checks);
leave the hyperutil.DeleteIfNeeded call for the Role (constructed with
CAPIProviderRoleName and hc.Spec.Platform.Agent.AgentNamespace) unchanged so the
role is only preserved by active, same-APIGroup RoleBindings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 234-245: The loop that preserves the shared Role can erroneously
match RoleBindings that are being deleted or from other API groups; update the
check over roleBindings.Items in agent.go to skip items with a non-nil
metadata.DeletionTimestamp and require RoleRef.APIGroup ==
"rbac.authorization.k8s.io" in addition to RoleRef.Kind == "Role" and
RoleRef.Name == CAPIProviderRoleName (so the guard uses RoleBindingList /
roleBindings.Items[i].ObjectMeta.DeletionTimestamp and RoleRef.APIGroup checks);
leave the hyperutil.DeleteIfNeeded call for the Role (constructed with
CAPIProviderRoleName and hc.Spec.Platform.Agent.AgentNamespace) unchanged so the
role is only preserved by active, same-APIGroup RoleBindings.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 63d4fe37-2e1d-4837-be44-6aaf3ff1d451

📥 Commits

Reviewing files that changed from the base of the PR and between 668e301 and aa074db.

⛔ Files ignored due to path filters (1)
  • cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yaml is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • cmd/cluster/agent/create.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/cluster/agent/create.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go (1)

228-254: Deletion guard looks good; consider logging the skip for observability.

The updated DeleteCredentials correctly addresses prior concerns: it skips the just-deleted RoleBinding by name, ignores bindings with a non-nil DeletionTimestamp, and tightens the match with RoleRef.APIGroup == "rbac.authorization.k8s.io" before short-circuiting. This correctly preserves the shared capi-provider-role when another HostedCluster in the same agent namespace still references it.

Two small follow-ups to consider (non-blocking):

  1. When skipping role deletion (Line 248), a debug/info log indicating "role retained — still referenced by RoleBinding X" would help operators diagnose why a Role lingers in the agent namespace after an HC teardown.
  2. The early return nil on Line 248 exits on the first referencing binding, which is correct, but since the iteration order of roleBindings.Items is not deterministic you won't be able to tell which binding held it. Capturing the name in a log as suggested above mitigates this.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`
around lines 228 - 254, Add an observability log before early-returning from
DeleteCredentials to show which RoleBinding prevented role deletion: inside the
loop over roleBindings.Items in DeleteCredentials, when you detect a live
referencing binding (rb.RoleRef.APIGroup == "rbac.authorization.k8s.io" &&
rb.RoleRef.Kind == "Role" && rb.RoleRef.Name == CAPIProviderRoleName) call the
controller logger (e.g., ctrl.LoggerFrom(ctx) or the project logger in context)
to emit an Info/Debug message like "retaining CAPI provider role; referenced by
RoleBinding <rb.Name>" and then return nil; also consider logging when you skip
the just-deleted binding (deletedBindingName) or when skipping due to
rb.DeletionTimestamp != nil for extra observability.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 228-254: Add an observability log before early-returning from
DeleteCredentials to show which RoleBinding prevented role deletion: inside the
loop over roleBindings.Items in DeleteCredentials, when you detect a live
referencing binding (rb.RoleRef.APIGroup == "rbac.authorization.k8s.io" &&
rb.RoleRef.Kind == "Role" && rb.RoleRef.Name == CAPIProviderRoleName) call the
controller logger (e.g., ctrl.LoggerFrom(ctx) or the project logger in context)
to emit an Info/Debug message like "retaining CAPI provider role; referenced by
RoleBinding <rb.Name>" and then return nil; also consider logging when you skip
the just-deleted binding (deletedBindingName) or when skipping due to
rb.DeletionTimestamp != nil for extra observability.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: e1264a9a-e9bb-4549-9298-afdeed2950c2

📥 Commits

Reviewing files that changed from the base of the PR and between aa074db and ceef355.

⛔ Files ignored due to path filters (1)
  • cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yaml is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • cmd/cluster/agent/create.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go

@cwilkers cwilkers marked this pull request as ready for review April 22, 2026 15:58
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2026
@openshift-ci openshift-ci Bot requested review from clebs and jparrill April 22, 2026 15:59
Comment thread hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go (1)

73-106: Solid idempotency coverage; optionally assert rule content, not just length.

The test correctly verifies the second ReconcileCredentials call does not produce an error or duplicate rules. As a small hardening, consider also asserting the rule content on the second read (APIGroups/Resources/Verbs) so a regression that replaces the rule with a different-but-still-length-1 rule would be caught.

♻️ Optional strengthening of idempotency assertion
 	g.Expect(err).ToNot(HaveOccurred())
 	g.Expect(role.Rules).To(HaveLen(1))
+	g.Expect(role.Rules[0].APIGroups).To(Equal([]string{"agent-install.openshift.io"}))
+	g.Expect(role.Rules[0].Resources).To(Equal([]string{"agents"}))
+	g.Expect(role.Rules[0].Verbs).To(Equal([]string{"*"}))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go`
around lines 73 - 106, The test
TestReconcileCredentials_WhenCalledMultipleTimes_ItShouldBeIdempotent currently
only asserts role.Rules length; after fetching the Role (variable role from
client.Get using CAPIProviderRoleName) add assertions that the single rule's
fields match the expected APIGroups, Resources and Verbs (e.g. check
role.Rules[0].APIGroups, role.Rules[0].Resources and role.Rules[0].Verbs equal
the canonical slices expected by ReconcileCredentials) so a
replacement-with-different-rule regression is caught; keep these assertions
after the second ReconcileCredentials call to validate idempotent content as
well as count.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 228-254: The DeleteCredentials function has a race where a
RoleBinding can be created after the cached List completes but before we
DeleteIfNeeded the Role; to fix, before calling hyperutil.DeleteIfNeeded for the
Role (in DeleteCredentials), perform either a second uncached server-side List
(use c.List with client.InNamespace and client.DirectClient or use
client.Options with metav1.ListOptions) to re-check for any live RoleBindings
referencing CAPIProviderRoleName (skip deletedBindingName and DeletionTimestamp
as already done), or add and use a deterministic label on Role/RoleBinding in
ReconcileCredentials and then List using that label selector server-side so only
relevant bindings are considered; if the second check finds any matching
RoleBindings return nil, otherwise proceed to delete via
hyperutil.DeleteIfNeeded.

---

Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go`:
- Around line 73-106: The test
TestReconcileCredentials_WhenCalledMultipleTimes_ItShouldBeIdempotent currently
only asserts role.Rules length; after fetching the Role (variable role from
client.Get using CAPIProviderRoleName) add assertions that the single rule's
fields match the expected APIGroups, Resources and Verbs (e.g. check
role.Rules[0].APIGroups, role.Rules[0].Resources and role.Rules[0].Verbs equal
the canonical slices expected by ReconcileCredentials) so a
replacement-with-different-rule regression is caught; keep these assertions
after the second ReconcileCredentials call to validate idempotent content as
well as count.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 8f0b5551-1ad6-45ef-b804-379415b3ac99

📥 Commits

Reviewing files that changed from the base of the PR and between ceef355 and 01425d8.

⛔ Files ignored due to path filters (1)
  • cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yaml is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • cmd/cluster/agent/create.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go

@jparrill
Copy link
Copy Markdown
Contributor

jparrill commented May 4, 2026

Thanks for working on this @cwilkers — I understand the motivation for ZTP/gitops workflows. I have a few concerns about the current approach that I'd like to discuss before we settle on the right solution:

  1. Role reconciliation on every loop iteration

Moving the Role creation into ReconcileCredentials() means every reconcile of every Agent-platform HostedCluster will issue at minimum a GET against the API server for a Role whose content isstatic and never changes. Before this PR, the Role was created once by the CLI and left alone. The self-healing benefit is real, but the cost of reconciling a static resource on every iterationseems disproportionate — especially at scale with many Agent-platform HCs.

Would it make sense to either (a) check for existence before calling createOrUpdate, or (b) move the Role creation to a one-time setup in the operator startup (e.g., alongside other platformbootstrapping)?

  1. Race condition in shared agent namespace teardown

When two HostedClusters share the same agentNamespace and are deleted concurrently, the following race can occur:

  1. HC1 and HC2 both enter DeleteCredentials
  2. Both delete their respective RoleBindings
  3. Both list remaining RoleBindings — each sees zero (or only the other's with a DeletionTimestamp)
  4. Both decide to delete the shared capi-provider-role 5. While HC1's CAPI provider is still running its Cluster finalizer cleanup (which needs access to agents resources), HC2's DeleteCredentials deletes the Role

The double-delete itself is safe (DeleteIfNeeded is idempotent), but the problem is the Role disappearing while a CAPI provider still needs it to release agents during teardown. The deletion order
in delete() (CAPI Cluster deleted → wait for completion → DeleteCredentials) protects the single-HC case, but not the concurrent multi-HC case.1. Role reconciliation on every loop

  1. Some Minos items:
  • The List in DeleteCredentials fetches all RoleBindings in the agent namespace (a user-controlled namespace that could have many unrelated bindings). Consider adding a HyperShift label to managedRoleBindings during reconciliation and filtering by that label during deletion.
  • The Role grants Verbs: []string{"*"} on agents — I know this is unchanged from the CLI version, but since we're moving it to the operator this would be a good opportunity to scope it down to the minimum required set (get, list, watch, update, patch).
  • Consider using rbacv1.GroupName instead of the hardcoded "rbac.authorization.k8s.io" string.

I see your point on the ZTP use case but concerns 1 and 2 would need to be addressed before this can go forward. Let me know your toughts.

@cwilkers
Copy link
Copy Markdown
Author

cwilkers commented May 4, 2026

With some planning help by Claude, I can move the Role creation out of the ReconcileCredentials function where it doesn't make that much sense, and into the reconcileCAPIProvider function where it can be gated to only create if the capi deployment does not exist yet. I'll need to test.

For the deletion race condition, would it be an option to simply not have the operator delete the Role? It's not as clean, but a NS deletion will clear it and it doesn't really grant access to anything we would worry about. The other option suggested by AI would be to add multiple finalizers to the Role, which seems clunky to me.

@cwilkers cwilkers force-pushed the CNTRLPLANE-2539 branch from 01425d8 to c2e85ea Compare May 6, 2026 14:41
@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 6, 2026
@cwilkers cwilkers force-pushed the CNTRLPLANE-2539 branch from c2e85ea to edbb80e Compare May 8, 2026 11:27
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2026
@clebs
Copy link
Copy Markdown
Member

clebs commented May 8, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 8, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@cwilkers
Copy link
Copy Markdown
Author

cwilkers commented May 8, 2026

/test e2e-aws

@cwilkers
Copy link
Copy Markdown
Author

cwilkers commented May 8, 2026

/test e2e-aws-4-22

@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 18, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 18, 2026

New changes are detected. LGTM label has been removed.

@cwilkers
Copy link
Copy Markdown
Author

The changes I just pushed fix the deletion race condition by creating a different role for each hosted cluster using a namespace. Now the Role may be cleaned when finalizing the HostedCluster as normal.

Supports work in ZTP deployment of HCP clusters by taking Role
generation (which is a security risk in gitops workflows) out of the CLI
and into the operator. Now the operator will handle the role creation,
and gitops based deployments need not be unconstrained to be able to
create Roles.

To avoid race conditions between multiple clusters using a single agent
NameSpace, each cluster creates a distinct role named according to its
control plane NameSpace.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: cwilkers <cwilkers@redhat.com>
@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented May 21, 2026

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

codecov/project: 40.11% (-0.30%) compared to 36dfb1b
Project coverage decreased by 0.30% (40.40% → 40.11%), exceeding the default threshold of 0%.
Patch coverage: 76.67% with 14 lines in changed files missing coverage.
Warning: "Current head 9319d0b differs from pull request most recent head 5b096e7"

Summary

The codecov/project check failed because overall project code coverage dropped by 0.30% (from 40.40% on main to 40.11% on the PR branch). The repository's codecov.yml does not configure an explicit coverage threshold, so Codecov applies its default rule: project coverage must not decrease. The failure is compounded by a stale coverage report — Codecov is evaluating coverage data from a previous commit (9319d0b, dated Apr 20) rather than the current PR head (5b096e7), making the comparison unreliable. Additionally, 14 files with indirect coverage changes (not touched by this PR) account for most of the 365-hit coverage loss against only 217 fewer lines, indicating base branch drift is the primary contributor to the coverage drop.

Root Cause

The failure has two contributing factors:

1. Stale coverage report (primary issue):
Codecov explicitly warns: "Current head 9319d0b differs from pull request most recent head 5b096e7. Please upload reports for the commit 5b096e7 to get more accurate results." The PR was force-pushed from 9319d0b (Apr 20) to 5b096e7, but no new coverage upload occurred for the latest commit. Codecov is comparing stale data against the base branch, producing an unreliable -0.30% delta.

2. Uncovered new code in the PR (secondary issue):
The PR adds new production code that is partially untested:

  • hostedcluster_controller.go (0% patch coverage): The 5 new lines added to reconcileCAPIProvider() — the if hcluster.Spec.Platform.Type == hyperv1.AgentPlatform guard and the call to platformagent.ReconcileCAPIProviderRole() — are not exercised by any test. This function takes a complex controlplanecomponent.ControlPlaneContext and is tested only through integration-level tests, not unit tests.

  • agent/agent.go (83.33% patch coverage): 6 lines missing + 3 partial lines in the new ReconcileCAPIProviderRole() function and updated DeleteCredentials(). While most of the new RBAC logic is tested (the PR adds 179 lines of test code in agent_test.go), some error-handling branches and label initialization paths are uncovered.

3. Indirect coverage changes (major contributor to the -0.30% delta):
14 files not touched by this PR show indirect coverage changes. The stats tell the story: 365 fewer "hits" against only 217 fewer lines (and 2 fewer files: 755→753). This means coverage was lost from code that existed before the PR — likely due to base branch changes that removed or refactored tested code, or test infrastructure changes affecting coverage collection across multiple flags (cmd-support: -0.20%, cpo-hostedcontrolplane: -1.19%, cpo-other: -0.18%).

Recommendations
  1. Rebase and re-trigger CI to generate a fresh coverage upload for the current commit 5b096e7. The stale report from 9319d0b is the primary reason the comparison is unreliable. A fresh upload will produce accurate numbers and may resolve the failure if indirect changes normalize.

  2. Add unit test coverage for the new reconcileCAPIProvider call path in hostedcluster_controller.go: The 5 new lines (the AgentPlatform type check and ReconcileCAPIProviderRole call) at line ~2588 have 0% coverage. Add a test case that exercises the Agent platform path with a mock createOrUpdate function.

  3. Consider adding an explicit threshold to codecov.yml: The current config has no coverage: section, so Codecov's strict default (0% decrease) applies. Adding a small tolerance (e.g., threshold: 0.5%) would prevent flaky failures from indirect/base-branch drift:

    coverage:
      status:
        project:
          default:
            threshold: 0.5%
  4. The PR's test additions are solid: The 179 new lines in agent_test.go cover the core ReconcileCAPIProviderRole and DeleteCredentials logic well (83.33% patch coverage for agent.go). The remaining uncovered lines are error-handling branches that are acceptable to leave uncovered.

Evidence
Evidence Detail
Check Run Conclusion failure — codecov/project
Project Coverage 40.11% (down from 40.40% on base 36dfb1b)
Coverage Delta -0.30% (exceeds default 0% threshold)
Patch Coverage 76.67% (14 lines missing in changed files)
Stale Report Warning Codecov evaluated 9319d0b but PR head is 5b096e7
hostedcluster_controller.go 0% patch coverage — 4 missing + 1 partial line (new AgentPlatform guard at L2588)
agent/agent.go 83.33% patch coverage — 6 missing + 3 partial lines (new ReconcileCAPIProviderRole, updated DeleteCredentials)
agent/agent_test.go +179 lines of new tests added (good coverage effort)
cmd/cluster/agent/create.go 100% patch coverage but overall file dropped -9.21% (49.38%) — indirect
Indirect Changes 14 files with indirect coverage changes not touched by PR
Coverage Hits Lost 365 fewer hits vs only 217 fewer lines → indirect drift is the dominant factor
Files Changed 755 → 753 (-2 files removed by base branch)
Codecov Config No explicit coverage: thresholds in codecov.yml → strict default applies
Flag Breakdown cpo-hostedcontrolplane: -1.19%, cmd-support: -0.20%, cpo-other: -0.18%

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@cwilkers: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants