Skip to content

NO-JIRA: Fall back to AWS SDK default credential chain when no explicit credentials provided#8889

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
ironcladlou:aws-auth-refactor
Jul 3, 2026
Merged

NO-JIRA: Fall back to AWS SDK default credential chain when no explicit credentials provided#8889
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
ironcladlou:aws-auth-refactor

Conversation

@ironcladlou

@ironcladlou ironcladlou commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

The CLI previously required either --aws-creds or --sts-creds with
--role-arn for all AWS operations. This prevented using standard SDK
authentication mechanisms like AWS_PROFILE, environment variables,
or SAML/SSO sessions. Make all credential flags optional and fall
back to the SDK default chain, with optional --role-arn for role
assumption.

The product CLI behavior remains unchanged (--sts-creds and --role-arn
are still required).

Summary by CodeRabbit

  • New Features
    • Role assumption is now supported via the default cloud credential chain when no explicit credentials are provided.
    • Product credential information is validated during HostedCluster create/destroy pre-flight checks.
  • Bug Fixes
    • Tightened validation for --aws-creds, --sts-creds, and --role-arn to prevent invalid flag combinations and enforce required options.
    • Standardized AWS session creation and updated destroy behavior to treat per-component credential inputs as a complete mode.
  • Tests
    • Updated and added validation tests for new fallback and product-credential rules.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR updates AWS credential handling to allow default AWS SDK credential-chain fallback when explicit credentials are absent, and adds role-assumption support through role-arn. Validation for AWSCredentialsOptions now enforces exclusivity between global credentials and role-arn/sts-creds, and requires role-arn when sts-creds is set. AWS session creation in operator role and destroy flows now uses the revised credential logic, and related tests were updated. The CLI role policy gains iam:ListRolePolicies, and --aws-creds is no longer required.

Sequence Diagram(s)

sequenceDiagram
  participant Command
  participant AWSCredentialsOptions
  participant supportawsutil
  participant AWS SDK

  Command->>AWSCredentialsOptions: Validate() / ValidateProduct()
  AWSCredentialsOptions-->>Command: validation result
  Command->>AWSCredentialsOptions: GetSession()
  AWSCredentialsOptions->>AWS SDK: build default-chain session
  AWSCredentialsOptions->>supportawsutil: AssumeRole(RoleArn)
  supportawsutil-->>AWSCredentialsOptions: assumed credentials
  AWSCredentialsOptions-->>Command: awsSession
Loading

Possibly related PRs

  • openshift/hypershift#8883: Updates cmd/infra/aws/create_cli_role.go to add the same iam:ListRolePolicies permission to the CLI role policy.

Suggested reviewers: devguyio, enxebre

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The added/modified tests use only static, deterministic names; no Ginkgo titles or dynamic identifiers, dates, UUIDs, or generated suffixes were found.
Test Structure And Quality ✅ Passed PASS: The PR adds plain table-driven unit tests, not Ginkgo; they’re isolated, use t.TempDir for cleanup, and contain no cluster waits/timeouts.
Topology-Aware Scheduling Compatibility ✅ Passed No new pod affinity, node selectors, topology spread, or replica logic; changes are AWS auth/config wiring and controller cleanup, so topology assumptions unchanged.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; the changed test files are unit tests, so IPv4/disconnected-network compatibility isn’t implicated.
No-Weak-Crypto ✅ Passed PR changes AWS credential/session handling and IAM policy text only; no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, custom crypto, or secret/token comparisons found.
Container-Privileges ✅ Passed The PR only changes Go CLI/auth code; no YAML/JSON or k8s manifest files were modified, and no privilege-related fields were introduced.
No-Sensitive-Data-In-Logs ✅ Passed No new logs print secrets/tokens/PII; touched changes only add generic validation errors and non-secret role/ARN output.
Title check ✅ Passed The title clearly summarizes the main change: falling back to the AWS SDK default credential chain when explicit credentials are absent.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested review from bryan-cox and muraee July 1, 2026 15:04
@openshift-ci openshift-ci Bot added area/cli Indicates the PR includes changes for CLI area/platform/aws PR/issue for AWS (AWSPlatform) platform and removed do-not-merge/needs-area labels Jul 1, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/infra/aws/create_cli_role.go (1)

174-174: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Stale "(required)" in flag help text.

--aws-creds is no longer enforced as required (Line 178's MarkFlagRequired call was removed), but the flag description still states "(required)", which will mislead users.

📝 Proposed fix
-	cmd.Flags().StringVar(&opts.AWSCredentialsFile, "aws-creds", opts.AWSCredentialsFile, "Path to an AWS credentials file (required)")
+	cmd.Flags().StringVar(&opts.AWSCredentialsFile, "aws-creds", opts.AWSCredentialsFile, "Path to an AWS credentials file (optional; falls back to the AWS SDK default credential chain if not set)")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/infra/aws/create_cli_role.go` at line 174, Update the help text for the
AWS credentials flag in the create CLI role setup so it no longer says
“(required)”; the requirement was removed, so the description on the StringVar
for opts.AWSCredentialsFile should reflect that the flag is optional. Keep the
behavior in the command setup consistent with the current validation by
adjusting the flag registration in the create CLI role command.
🧹 Nitpick comments (1)
cmd/infra/aws/util/util_test.go (1)

219-223: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Exercise the default credential chain in this test. In cmd/infra/aws/util/util_test.go:219-223, the no-credentials case only checks that GetSession returns a config; set AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY and call cfg.Credentials.Retrieve(ctx) so it proves the fallback returns usable credentials.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/infra/aws/util/util_test.go` around lines 219 - 223, The no-credentials
test case for GetSession only verifies that a config is returned, but it does
not exercise the SDK default credential chain. Update the test case in
util_test.go around GetSession so that after obtaining cfg you set
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in the environment and call
cfg.Credentials.Retrieve(ctx), using the GetSession and config credential
symbols to confirm the fallback produces usable credentials.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cmd/infra/aws/create_cli_role.go`:
- Line 174: Update the help text for the AWS credentials flag in the create CLI
role setup so it no longer says “(required)”; the requirement was removed, so
the description on the StringVar for opts.AWSCredentialsFile should reflect that
the flag is optional. Keep the behavior in the command setup consistent with the
current validation by adjusting the flag registration in the create CLI role
command.

---

Nitpick comments:
In `@cmd/infra/aws/util/util_test.go`:
- Around line 219-223: The no-credentials test case for GetSession only verifies
that a config is returned, but it does not exercise the SDK default credential
chain. Update the test case in util_test.go around GetSession so that after
obtaining cfg you set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in the
environment and call cfg.Credentials.Retrieve(ctx), using the GetSession and
config credential symbols to confirm the fallback produces usable credentials.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5040903c-c966-429b-b30f-71879400dd01

📥 Commits

Reviewing files that changed from the base of the PR and between ce9dd2c and 116e7b0.

📒 Files selected for processing (3)
  • cmd/infra/aws/create_cli_role.go
  • cmd/infra/aws/util/util.go
  • cmd/infra/aws/util/util_test.go

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 57.62712% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.34%. Comparing base (9aeb1f3) to head (5951a54).
⚠️ Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
cmd/infra/aws/destroy.go 38.46% 8 Missing ⚠️
cmd/infra/aws/create_operator_roles.go 0.00% 6 Missing ⚠️
cmd/infra/aws/util/util.go 87.09% 3 Missing and 1 partial ⚠️
cmd/cluster/aws/destroy.go 40.00% 2 Missing and 1 partial ⚠️
product-cli/cmd/cluster/aws/create.go 0.00% 3 Missing ⚠️
product-cli/cmd/cluster/aws/destroy.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8889      +/-   ##
==========================================
+ Coverage   43.28%   43.34%   +0.06%     
==========================================
  Files         771      771              
  Lines       95503    95532      +29     
==========================================
+ Hits        41335    41409      +74     
+ Misses      51284    51239      -45     
  Partials     2884     2884              
Files with missing lines Coverage Δ
cmd/infra/aws/create_cli_role.go 0.00% <ø> (ø)
product-cli/cmd/cluster/aws/destroy.go 68.75% <0.00%> (ø)
cmd/cluster/aws/destroy.go 9.09% <40.00%> (-0.99%) ⬇️
product-cli/cmd/cluster/aws/create.go 0.00% <0.00%> (ø)
cmd/infra/aws/util/util.go 68.06% <87.09%> (+34.73%) ⬆️
cmd/infra/aws/create_operator_roles.go 55.45% <0.00%> (+1.41%) ⬆️
cmd/infra/aws/destroy.go 15.56% <38.46%> (+3.60%) ⬆️
Flag Coverage Δ
cmd-support 36.87% <61.81%> (+0.20%) ⬆️
cpo-hostedcontrolplane 45.31% <ø> (ø)
cpo-other 45.10% <ø> (ø)
hypershift-operator 53.59% <ø> (ø)
other 31.68% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cmd/cluster/aws/destroy_test.go (1)

17-17: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use the required “When ... it should ...” test-case format.

As per coding guidelines, **/*_test.go: Always use "When ... it should ..." format for describing test cases when creating unit tests.

Proposed fix
-		"when CredentialSecretName is blank and aws-creds is also blank it should fall back to SDK default chain": {
+		"When CredentialSecretName is blank and aws-creds is also blank, it should fall back to SDK default chain": {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/cluster/aws/destroy_test.go` at line 17, The test case name in
destroy_test.go does not follow the required “When ... it should ...” format.
Update the descriptive key used in the test table near the aws destroy tests so
it starts with “When” and includes “it should”, matching the existing unit test
naming convention for *_test.go files.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/infra/aws/destroy.go`:
- Around line 199-205: The destroy flow leaves vpcOwnerEC2Client nil whenever
useDelegatedCredentials is true, but later VPC cleanup helpers still receive it.
Update the initialization logic in destroy.go so the delegated branch also
creates the VPC-owner EC2 client, mirroring the non-delegated default, and
verify the value passed from the destroy path into the VPC/IGW/EIP/DHCP cleanup
calls is never nil for non-shared-VPC destroys.

---

Nitpick comments:
In `@cmd/cluster/aws/destroy_test.go`:
- Line 17: The test case name in destroy_test.go does not follow the required
“When ... it should ...” format. Update the descriptive key used in the test
table near the aws destroy tests so it starts with “When” and includes “it
should”, matching the existing unit test naming convention for *_test.go files.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 92e70707-327c-4408-aaf0-8ea62d65c133

📥 Commits

Reviewing files that changed from the base of the PR and between 116e7b0 and 0e0d7fc.

📒 Files selected for processing (6)
  • cmd/cluster/aws/destroy_test.go
  • cmd/infra/aws/create_cli_role.go
  • cmd/infra/aws/create_operator_roles.go
  • cmd/infra/aws/destroy.go
  • cmd/infra/aws/util/util.go
  • cmd/infra/aws/util/util_test.go
💤 Files with no reviewable changes (1)
  • cmd/infra/aws/create_cli_role.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/infra/aws/util/util_test.go
  • cmd/infra/aws/util/util.go

Comment thread cmd/infra/aws/destroy.go
@ironcladlou ironcladlou force-pushed the aws-auth-refactor branch 2 times, most recently from 0a32eab to ff40a2f Compare July 1, 2026 16:09

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cmd/cluster/aws/destroy.go (1)

163-167: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Doc comment slightly overstates the enforced guarantee.

The comment states ValidateProductCredentialInfo "requires explicit --sts-creds and --role-arn flags rather than allowing SDK default chain fallback," but that stricter check (opts.ValidateProduct, requiring both sts-creds and role-arn) is only exercised when credentialSecretName is empty (line 171). When a secret is supplied, the shared helper (lines 177-181) only requires role-arn, not sts-creds, for both ValidateCredentialInfo and ValidateProductCredentialInfo. Functionally this is likely fine (the secret path never uses SDK default-chain fallback in GetSession), but the doc comment could mislead a future reader into assuming sts-creds is always required for product callers.

Consider clarifying the comment to note the distinction only applies to the no-secret path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/cluster/aws/destroy.go` around lines 163 - 167, The doc comment on
ValidateProductCredentialInfo overstates its behavior by implying --sts-creds
and --role-arn are always required. Update the comment to match the actual logic
in ValidateProductCredentialInfo and validateCredentialInfo: the stricter
ValidateProduct check only applies when credentialSecretName is empty, while the
secret-backed path only requires role-arn. Keep the wording precise so future
readers understand the distinction without changing the function behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cmd/cluster/aws/destroy.go`:
- Around line 163-167: The doc comment on ValidateProductCredentialInfo
overstates its behavior by implying --sts-creds and --role-arn are always
required. Update the comment to match the actual logic in
ValidateProductCredentialInfo and validateCredentialInfo: the stricter
ValidateProduct check only applies when credentialSecretName is empty, while the
secret-backed path only requires role-arn. Keep the wording precise so future
readers understand the distinction without changing the function behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: cdfd9336-54c7-4c2f-9966-612304cf8ef1

📥 Commits

Reviewing files that changed from the base of the PR and between 0e0d7fc and 0a32eab.

📒 Files selected for processing (9)
  • cmd/cluster/aws/destroy.go
  • cmd/cluster/aws/destroy_test.go
  • cmd/infra/aws/create_cli_role.go
  • cmd/infra/aws/create_operator_roles.go
  • cmd/infra/aws/destroy.go
  • cmd/infra/aws/util/util.go
  • cmd/infra/aws/util/util_test.go
  • product-cli/cmd/cluster/aws/create.go
  • product-cli/cmd/cluster/aws/destroy.go
💤 Files with no reviewable changes (1)
  • cmd/infra/aws/create_cli_role.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • cmd/cluster/aws/destroy_test.go
  • cmd/infra/aws/create_operator_roles.go
  • cmd/infra/aws/destroy.go
  • cmd/infra/aws/util/util.go

@ironcladlou ironcladlou force-pushed the aws-auth-refactor branch 3 times, most recently from 27e5a81 to 40d1f98 Compare July 2, 2026 18:15
…cit credentials provided

The CLI previously required either --aws-creds or --sts-creds with
--role-arn for all AWS operations. This prevented using standard SDK
authentication mechanisms like AWS_PROFILE, environment variables,
or SAML/SSO sessions. Make all credential flags optional and fall
back to the SDK default chain, with optional --role-arn for role
assumption.

The product CLI behavior remains unchanged (--sts-creds and --role-arn
are still required).
@ironcladlou ironcladlou changed the title fix(cli): Fall back to AWS SDK default credential chain when no explicit credentials provided NO-JIRA: Fall back to AWS SDK default credential chain when no explicit credentials provided Jul 2, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 2, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@ironcladlou: This pull request explicitly references no jira issue.

Details

In response to this:

The CLI previously required either --aws-creds or --sts-creds with
--role-arn for all AWS operations. This prevented using standard SDK
authentication mechanisms like AWS_PROFILE, environment variables,
or SAML/SSO sessions. Make all credential flags optional and fall
back to the SDK default chain, with optional --role-arn for role
assumption.

The product CLI behavior remains unchanged (--sts-creds and --role-arn
are still required).

Summary by CodeRabbit

  • New Features
  • Role assumption is now supported via the default cloud credential chain when no explicit credentials are provided.
  • Product credential information is validated during HostedCluster create/destroy pre-flight checks.
  • Bug Fixes
  • Tightened validation for --aws-creds, --sts-creds, and --role-arn to prevent invalid flag combinations and enforce required options.
  • Standardized AWS session creation and updated destroy behavior to treat per-component credential inputs as a complete mode.
  • Tests
  • Updated and added validation tests for new fallback and product-credential rules.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@csrwng csrwng left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean refactor that adds SDK default credential chain fallback for the dev CLI while preserving strict validation for the product CLI. Well-tested across all validation paths. LGTM.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jul 2, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@openshift-ci

openshift-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng, ironcladlou

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 2, 2026
@ironcladlou

Copy link
Copy Markdown
Contributor Author

/verified by test coverage

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jul 2, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@ironcladlou: This PR has been marked as verified by test coverage.

Details

In response to this:

/verified by test coverage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2072758148669640704 | Cost: $4.146566999999999 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cwbotbot

cwbotbot commented Jul 2, 2026

Copy link
Copy Markdown

Test Results

e2e-aks

e2e-aws

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 9aeb1f3 and 2 for PR HEAD 5951a54 in total

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2072808664980262912 | Cost: $4.2577820000000015 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci

Copy link
Copy Markdown

Now I have all the evidence needed. Let me produce the final markdown report in the exact user-specified format with collapsible <details><summary> blocks.

ci/prow/e2e-aks

Job: pull-ci-openshift-hypershift-main-e2e-aks/2072808664980262912
Build ID: 2072808664980262912
Result: failure
Failed Step: hypershift-azure-run-e2e (test phase, 5357s)
Failed Tests: 5 of 361 (45 skipped)

Root Cause

AKS management cluster resource exhaustion — not related to PR changes.

The e2e-aks job runs ~12 HyperShift HostedClusters in parallel on a 16-node AKS management cluster. Each HostedCluster deploys ~55 control plane pods. The combined pod and memory demand exceeded the AKS cluster's capacity, causing the Kubernetes scheduler to reject new pods.

Two HostedClusters were affected:

  1. ha-break-glass-creds-crzkl (TestCreateClusterHABreakGlassCredentials): Catastrophic failure — 12 control plane deployments had unavailable replicas. The cluster-version-operator pod was stuck in Pending phase with FailedScheduling: 0/13 nodes are available: 10 Too many pods, 3 Insufficient memory. Because CVO never started, no ClusterOperator CRDs were installed, and all operator conditions reported Unknown. The catalog-operator, cluster-image-registry-operator, cluster-network-operator, hosted-cluster-config-operator, and others were also Pending.

  2. node-pool-7cfjm (TestNodePool/HostedCluster2): Partial failure — the kube-controller-manager pod was stuck in Pending with the identical scheduling error: 0/16 nodes are available: 1 node(s) had untolerated taint(s), 12 Too many pods, 3 Insufficient memory.

This failure is unrelated to PR #8889. All PR changes are in cmd/cluster/aws/, cmd/infra/aws/, and product-cli/cmd/cluster/aws/ — pure AWS credential handling code. The e2e-aks test runs on Azure/AKS and does not execute any of the changed AWS code paths. This is an infrastructure capacity issue on the shared AKS test cluster.

Recommendations
  1. Retry the job — this is a transient infrastructure resource exhaustion issue on the AKS management cluster, not a code regression from this PR
  2. No code changes needed — the PR modifies only AWS credential handling (cmd/cluster/aws/, cmd/infra/aws/, product-cli/cmd/cluster/aws/); the Azure/AKS e2e test does not exercise any of these paths
  3. If retries continue to fail, the AKS test cluster may need capacity increases (more nodes or larger node SKUs) to handle the parallel HostedCluster workload (~12 HCPs × ~55 pods each ≈ 660+ pods)
Evidence

Scheduling Failures (root cause):

  • CVO pod (cluster-version-operator-5c47cb66bc-nbwvw) in namespace e2e-clusters-7xhwx-ha-break-glass-creds-crzkl: phase: Pending, reason: FailedScheduling0/13 nodes are available: 10 Too many pods, 3 Insufficient memory. no new claims to deallocate, preemption: not eligible due to preemptionPolicy=Never.
  • KCM pod (kube-controller-manager-cc4655f87-l7pbk) in namespace e2e-clusters-xl27b-node-pool-7cfjm: phase: Pending, reason: FailedScheduling0/16 nodes are available: 1 node(s) had untolerated taint(s), 12 Too many pods, 3 Insufficient memory.

BreakGlass cluster condition dump (12 unavailable deployments):

UnavailableReplicas([catalog-operator deployment has 1 unavailable replicas,
  cluster-image-registry-operator deployment has 1 unavailable replicas,
  cluster-network-operator deployment has 1 unavailable replicas,
  cluster-storage-operator deployment has 1 unavailable replicas,
  cluster-version-operator deployment has 1 unavailable replicas,
  csi-snapshot-controller-operator deployment has 1 unavailable replicas,
  dns-operator deployment has 1 unavailable replicas,
  hosted-cluster-config-operator deployment has 1 unavailable replicas,
  ingress-operator deployment has 1 unavailable replicas,
  olm-operator deployment has 1 unavailable replicas,
  openshift-controller-manager deployment has 2 unavailable replicas,
  packageserver deployment has 3 unavailable replicas])

Verified Pending pods in BreakGlass namespace:

  • cluster-version-operator-5c47cb66bc-nbwvw — Pending
  • catalog-operator-cb54b999b-rh4xf — Pending
  • cluster-image-registry-operator-6c448bb6-dxvf8 — Pending
  • cluster-network-operator-7c5cf4c765-m9rmr — Pending
  • hosted-cluster-config-operator-7b94ff69dc-hxsbt — Pending

API server unreachable (consequence of Pending pods):

  • breakglass-dump.log: All pod log requests failed with "the server rejected our request for an unknown reason"
  • oc adm inspect failed: "the server doesn't have a resource type 'clusteroperator'" (CVO never installed CRDs)
  • TLS handshake timeouts connecting to hosted cluster API servers

Test parallelism: 12 test directories under artifacts (TestAutoscaling, TestAzureScheduler, TestCreateCluster, TestCreateClusterCustomConfig, TestCreateClusterHABreakGlassCredentials, TestHAEtcdChaos, TestNodePool_HostedCluster0, TestNodePool_HostedCluster2, TestPullSecretUnavailable, TestUpgradeControlPlane, etc.) — each spawning a full HostedCluster with ~55 control plane pods.

PR scope confirmation: Files changed — cmd/cluster/aws/destroy.go, cmd/infra/aws/create_cli_role.go, cmd/infra/aws/create_operator_roles.go, cmd/infra/aws/destroy.go, cmd/infra/aws/util/util.go, product-cli/cmd/cluster/aws/create.go, product-cli/cmd/cluster/aws/destroy.go + test files. Zero Azure/AKS code touched.


@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 488ef0e and 1 for PR HEAD 5951a54 in total

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@ironcladlou: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit f46050b into openshift:main Jul 3, 2026
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cli Indicates the PR includes changes for CLI area/platform/aws PR/issue for AWS (AWSPlatform) platform jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants