Skip to content

OCPBUGS-86296: Propagate management cluster proxy env vars to konnectivity sidecar#8569

Open
csrwng wants to merge 1 commit into
openshift:mainfrom
csrwng:konnectivity-proxy-fix
Open

OCPBUGS-86296: Propagate management cluster proxy env vars to konnectivity sidecar#8569
csrwng wants to merge 1 commit into
openshift:mainfrom
csrwng:konnectivity-proxy-fix

Conversation

@csrwng
Copy link
Copy Markdown
Contributor

@csrwng csrwng commented May 21, 2026

Summary

  • The konnectivity sidecar's dialDirectWithProxy() reads HTTPS_PROXY from the process environment to route cloud API calls through the management cluster's outbound proxy when ConnectDirectlyToCloudAPIs is enabled. The sidecar container spec never set these env vars, so cloud API calls fail on management clusters that require a proxy for outbound access.
  • Add proxy.SetEnvVars() in buildContainer() conditioned on ConnectDirectlyToCloudAPIs — the only code path that needs the management cluster proxy env vars. This follows the same pattern used by other CPO-managed containers (KCM, CCO, CAPI provider, etc.).
  • Add network-level tests for dialDirectWithProxy() with a real TCP echo server and HTTP CONNECT proxy to verify the proxy routing behavior end-to-end.

Test plan

  • Unit tests pass: go test ./support/controlplane-component/... ./support/konnectivityproxy/...
  • Tests pass with race detector: go test -race ./support/controlplane-component/... ./support/konnectivityproxy/...
  • make verify passes
  • Deploy a private HCP cluster on a management cluster with an outbound proxy and verify ingress-operator can reach cloud APIs

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Konnectivity now conditionally propagates proxy environment variables when direct cloud-API connectivity is enabled, with safe defaults for unset or unsupported proxy modes.
  • Tests

    • Added comprehensive unit and integration tests covering HTTPS, SOCKS5, dual-mode proxy behavior, preservation of NO_PROXY entries, presence of kubeconfig, and CONNECT-proxy dialing behavior.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 21, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 21, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@csrwng: This pull request references Jira Issue OCPBUGS-86296, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • The konnectivity sidecar's dialDirectWithProxy() reads HTTPS_PROXY from the process environment to route cloud API calls through the management cluster's outbound proxy when ConnectDirectlyToCloudAPIs is enabled. The sidecar container spec never set these env vars, so cloud API calls fail on management clusters that require a proxy for outbound access.
  • Add proxy.SetEnvVars() in buildContainer() conditioned on ConnectDirectlyToCloudAPIs — the only code path that needs the management cluster proxy env vars. This follows the same pattern used by other CPO-managed containers (KCM, CCO, CAPI provider, etc.).
  • Add network-level tests for dialDirectWithProxy() with a real TCP echo server and HTTP CONNECT proxy to verify the proxy routing behavior end-to-end.

Test plan

  • Unit tests pass: go test ./support/controlplane-component/... ./support/konnectivityproxy/...
  • Tests pass with race detector: go test -race ./support/controlplane-component/... ./support/konnectivityproxy/...
  • make verify passes
  • Deploy a private HCP cluster on a management cluster with an outbound proxy and verify ingress-operator can reach cloud APIs

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR conditionally propagates process proxy environment variables into the konnectivity container when the component is configured to connect directly to cloud APIs. A new helper determines the setting based on proxy mode (HTTPS vs Socks5). Unit tests cover HTTPS/Socks5/dual modes and NO_PROXY handling; integration tests add a CONNECT-only proxy and a TCP echo server to verify CONNECT usage only when HTTPS_PROXY is set.

Sequence Diagram(s)

sequenceDiagram
  participant TestClient
  participant KonnectivityProxy
  participant ConnectProxy
  participant EchoServer

  TestClient->>KonnectivityProxy: dialDirectWithProxy(target)
  alt HTTPS_PROXY set
    KonnectivityProxy->>ConnectProxy: send HTTP CONNECT host:port
    ConnectProxy->>EchoServer: dial host:port
    ConnectProxy-->>KonnectivityProxy: tunnel established
    KonnectivityProxy->>EchoServer: proxied TCP stream ("hello")
    EchoServer-->>KonnectivityProxy: echo "hello"
  else no proxy env
    KonnectivityProxy->>EchoServer: direct TCP dial host:port
    EchoServer-->>KonnectivityProxy: echo "hello"
  end
  KonnectivityProxy-->>TestClient: return connection with echoed payload
Loading

Suggested reviewers

  • enxebre
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and specifically describes the main change: propagating management cluster proxy environment variables to the konnectivity sidecar container. It directly reflects the primary objective and is concise.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR uses standard Go testing, not Ginkgo. Check targets Ginkgo test names (It, Describe, Context). All test names are static and deterministic with no dynamic values.
Test Structure And Quality ✅ Passed Tests use Go standard testing.T (not Ginkgo). Custom check targets Ginkgo code—not applicable here. Tests follow Go practices with proper cleanup, assertions, and single responsibility.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests were added. Tests are standard Go unit tests using testing.T that do not reference any MicroShift-incompatible APIs or features.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All new tests are standard Go unit tests using testing.T, not Ginkgo-style e2e tests that would require SNO compatibility checks.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies Go source code for proxy env vars only; no deployment manifests or scheduling constraints (affinity, topology spread, nodeSelector) introduced.
Ote Binary Stdout Contract ✅ Passed No stdout writes at process level. Production code builds container specs; tests use proper test function patterns with gomega assertions.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests added. Tests use standard Go testing.T and run as unit/integration tests via go test, not Ginkgo e2e tests. Check does not apply.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels May 21, 2026
@csrwng csrwng marked this pull request as ready for review May 21, 2026 15:58
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@csrwng: This pull request references Jira Issue OCPBUGS-86296, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

  • The konnectivity sidecar's dialDirectWithProxy() reads HTTPS_PROXY from the process environment to route cloud API calls through the management cluster's outbound proxy when ConnectDirectlyToCloudAPIs is enabled. The sidecar container spec never set these env vars, so cloud API calls fail on management clusters that require a proxy for outbound access.
  • Add proxy.SetEnvVars() in buildContainer() conditioned on ConnectDirectlyToCloudAPIs — the only code path that needs the management cluster proxy env vars. This follows the same pattern used by other CPO-managed containers (KCM, CCO, CAPI provider, etc.).
  • Add network-level tests for dialDirectWithProxy() with a real TCP echo server and HTTP CONNECT proxy to verify the proxy routing behavior end-to-end.

Test plan

  • Unit tests pass: go test ./support/controlplane-component/... ./support/konnectivityproxy/...
  • Tests pass with race detector: go test -race ./support/controlplane-component/... ./support/konnectivityproxy/...
  • make verify passes
  • Deploy a private HCP cluster on a management cluster with an outbound proxy and verify ingress-operator can reach cloud APIs

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Improved proxy environment variable propagation to support enhanced cloud API connectivity options when using direct connection mode capabilities.

  • Tests

  • Added comprehensive test coverage validating proxy routing functionality, environment variable handling, and system behavior across multiple proxy types and connection mode configurations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from enxebre and sdminonne May 21, 2026 15:59
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@support/controlplane-component/konnectivity-container_test.go`:
- Around line 102-108: Tests are nondeterministic because ambient proxy env vars
leak into subtests; in the table-driven loop inside Test... (the for _, tt :=
range tests { t.Run(tt.name, func(t *testing.T) { ... }) }), before applying
tt.proxyEnvs with t.Setenv, reset the proxy baseline by clearing or setting
empty all common proxy variables (HTTP_PROXY, HTTPS_PROXY, NO_PROXY and their
lowercase variants) using t.Setenv, then apply tt.proxyEnvs entries; this
ensures NewGomegaWithT and subsequent assertions run with a clean proxy env for
each subtest.

In `@support/konnectivityproxy/dialer_test.go`:
- Around line 394-398: In the "When HTTPS_PROXY is not set it should connect
directly" subtest, explicitly clear any ambient proxy environment to avoid
flakiness: capture the existing HTTPS_PROXY and https_proxy values with
os.Getenv, call os.Unsetenv for both keys before calling
startTCPEchoServer/startConnectProxy, and defer restoring the original values at
the top of the subtest so the environment is returned to its prior state after
the test; apply these changes in the subtest body that invokes
startTCPEchoServer and startConnectProxy.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a8848936-ad62-42e6-a688-5944afbf4b69

📥 Commits

Reviewing files that changed from the base of the PR and between 36dfb1b and a57a0ca.

📒 Files selected for processing (3)
  • support/controlplane-component/konnectivity-container.go
  • support/controlplane-component/konnectivity-container_test.go
  • support/konnectivityproxy/dialer_test.go

Comment thread support/controlplane-component/konnectivity-container_test.go
Comment thread support/konnectivityproxy/dialer_test.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
support/konnectivityproxy/dialer_test.go (1)

364-368: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Also clear ambient HTTP_PROXY/NO_PROXY in the "HTTPS_PROXY is set" subtest.

This subtest sets HTTPS_PROXY explicitly but inherits ambient HTTP_PROXY and especially NO_PROXY from the runner. If CI sets NO_PROXY to include 127.0.0.1 or localhost (common on developer machines and some CI images), dialDirectWithProxy will bypass the test proxy and connectCount will be 0, causing the assertion at line 389 to fail nondeterministically. Mirror the baseline reset already applied to the "not set" subtest.

Suggested fix
 	t.Run("When HTTPS_PROXY is set it should route through the proxy", func(t *testing.T) {
 		echo := startTCPEchoServer(t)
 		var connectCount atomic.Int32
 		proxyLn := startConnectProxy(t, &connectCount)
+		t.Setenv("HTTP_PROXY", "")
+		t.Setenv("NO_PROXY", "")
 		t.Setenv("HTTPS_PROXY", fmt.Sprintf("http://%s", proxyLn.Addr().String()))

As per coding guidelines, "Avoid global state in tests".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/konnectivityproxy/dialer_test.go` around lines 364 - 368, The "When
HTTPS_PROXY is set" subtest currently only sets HTTPS_PROXY and can inherit
ambient HTTP_PROXY/NO_PROXY causing flaky bypasses; update the subtest to
explicitly clear ambient proxy env by calling t.Setenv for "HTTP_PROXY" and
"NO_PROXY" (set to empty) so dialDirectWithProxy uses the intended proxy (the
proxy started by startConnectProxy and observed via connectCount) when you set
HTTPS_PROXY; place these t.Setenv calls alongside where HTTPS_PROXY is set in
the t.Run block that uses startTCPEchoServer and startConnectProxy.
🧹 Nitpick comments (1)
support/konnectivityproxy/dialer_test.go (1)

281-305: 💤 Low value

Helper-internal error handling could be tightened, but acceptable.

conn.SetDeadline and io.Copy errors are ignored. Given the helpers are test-only and deadlines bound goroutine lifetime, this is fine, but if a SetDeadline ever fails the loop would silently spin on a stale connection. Optionally surface the error via t.Errorf from the accept loop, or simply log via t.Log to aid debugging.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/konnectivityproxy/dialer_test.go` around lines 281 - 305, The helper
startTCPEchoServer currently ignores errors from conn.SetDeadline and io.Copy;
update the inner per-connection goroutine in startTCPEchoServer to check the
result of conn.SetDeadline and io.Copy and surface failures via the test logger
(use t.Logf or t.Errorf) so failures are visible during tests — e.g., if
conn.SetDeadline returns an error log it and close the connection, and after
io.Copy returns, log any non-nil error (except expected EOF) using
t.Logf/t.Errorf to aid debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@support/konnectivityproxy/dialer_test.go`:
- Around line 364-368: The "When HTTPS_PROXY is set" subtest currently only sets
HTTPS_PROXY and can inherit ambient HTTP_PROXY/NO_PROXY causing flaky bypasses;
update the subtest to explicitly clear ambient proxy env by calling t.Setenv for
"HTTP_PROXY" and "NO_PROXY" (set to empty) so dialDirectWithProxy uses the
intended proxy (the proxy started by startConnectProxy and observed via
connectCount) when you set HTTPS_PROXY; place these t.Setenv calls alongside
where HTTPS_PROXY is set in the t.Run block that uses startTCPEchoServer and
startConnectProxy.

---

Nitpick comments:
In `@support/konnectivityproxy/dialer_test.go`:
- Around line 281-305: The helper startTCPEchoServer currently ignores errors
from conn.SetDeadline and io.Copy; update the inner per-connection goroutine in
startTCPEchoServer to check the result of conn.SetDeadline and io.Copy and
surface failures via the test logger (use t.Logf or t.Errorf) so failures are
visible during tests — e.g., if conn.SetDeadline returns an error log it and
close the connection, and after io.Copy returns, log any non-nil error (except
expected EOF) using t.Logf/t.Errorf to aid debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e6adab7c-71dd-43c5-a960-db9e7fb0bdd5

📥 Commits

Reviewing files that changed from the base of the PR and between a57a0ca and 63fdc72.

📒 Files selected for processing (2)
  • support/controlplane-component/konnectivity-container_test.go
  • support/konnectivityproxy/dialer_test.go

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 81.81818% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.50%. Comparing base (36dfb1b) to head (c41b56c).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...t/controlplane-component/konnectivity-container.go 81.81% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8569      +/-   ##
==========================================
+ Coverage   40.40%   40.50%   +0.09%     
==========================================
  Files         755      755              
  Lines       93235    93246      +11     
==========================================
+ Hits        37675    37768      +93     
+ Misses      52858    52756     -102     
- Partials     2702     2722      +20     
Files with missing lines Coverage Δ
...t/controlplane-component/konnectivity-container.go 34.07% <81.81%> (+34.07%) ⬆️

... and 2 files with indirect coverage changes

Flag Coverage Δ
cmd-support 34.73% <81.81%> (+0.28%) ⬆️
cpo-hostedcontrolplane 41.76% <ø> (ø)
cpo-other 40.31% <ø> (ø)
hypershift-operator 50.72% <ø> (ø)
other 31.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

if err != nil {
t.Fatalf("failed to start proxy server: %v", err)
}
t.Cleanup(func() { ln.Close() })
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could explicitly call srv.Close()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added t.Cleanup(func() { srv.Close() }) before the srv.Serve goroutine in bfb72e7.

t.Fatalf("read failed: %v", err)
}
if string(buf) != string(msg) {
t.Errorf("expected %q, got %q", msg, buf)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be g.Expect

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — converted both subtests to use gomega g.Expect in bfb72e7.

t.Errorf("expected %q, got %q", msg, buf)
}
if connectCount.Load() != 0 {
t.Errorf("expected proxy to receive 0 CONNECT requests, got %d", connectCount.Load())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be g.Expect

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bfb72e7 along with the other g.Expect conversion above.

t.Errorf("expected %q, got %q", msg, buf)
}
if connectCount.Load() != 1 {
t.Errorf("expected proxy to receive 1 CONNECT request, got %d", connectCount.Load())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: g.Expect

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bfb72e7 — all assertions in both subtests now use g.Expect.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@support/konnectivityproxy/dialer_test.go`:
- Around line 332-333: The test currently ignores errors from core proxy
operations (calls like target.SetDeadline, io.Copy, and proxy.Serve), which can
hide failures; update the test to capture and assert or fail on these errors
instead of discarding them: check the returned error from target.SetDeadline,
propagate or t.Fatalf/t.Errorf on non-nil errors from io.Copy invocations (e.g.,
the copy between conn and target), and verify the error returned by Serve (or
use require.NoError/expectation helpers) so any tunnel/setup or runtime proxy
failures surface in the test logs. Ensure you modify the uses of
target.SetDeadline, io.Copy, and the Serve call accordingly.
- Line 376: Two occurrences call conn.SetDeadline(...) and ignore its returned
error (e.g., the lines using `_ =
conn.SetDeadline(time.Now().Add(testTimeout))`); change both to check the error
and fail the test immediately (e.g., if err := conn.SetDeadline(...); err != nil
{ t.Fatalf("SetDeadline failed: %v", err) }) so the test fails fast instead of
continuing with a bad deadline—update both instances that call conn.SetDeadline
to perform this error check using the test's fatal/assert helper.
- Around line 299-300: In the echo helper, don't ignore errors from
conn.SetDeadline and io.Copy: capture their returned errors and fail the test
(e.g., via t.Fatalf/t.Fatalf-like helper) or return the error so the caller can
fail; update the code around conn.SetDeadline(time.Now().Add(5 * time.Second))
and _, _ = io.Copy(conn, conn) to check the error values, include contextual
messages (which function: the echo helper, and operations: conn.SetDeadline and
io.Copy) and ensure the test does not silently succeed when those operations
fail.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: c6ef18ac-69dd-4b5f-94d0-65c74c8359dd

📥 Commits

Reviewing files that changed from the base of the PR and between 63fdc72 and fb80af2.

📒 Files selected for processing (2)
  • support/controlplane-component/konnectivity-container_test.go
  • support/konnectivityproxy/dialer_test.go

Comment thread support/konnectivityproxy/dialer_test.go Outdated
Comment thread support/konnectivityproxy/dialer_test.go Outdated
Comment thread support/konnectivityproxy/dialer_test.go Outdated
@csrwng csrwng force-pushed the konnectivity-proxy-fix branch from 9d3cbc0 to 8b83360 Compare May 21, 2026 16:36
Copy link
Copy Markdown
Member

@bryan-cox bryan-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@support/konnectivityproxy/dialer_test.go`:
- Around line 375-381: The subtest "When HTTPS_PROXY is set it should route
through the proxy" can inherit ambient NO_PROXY/HTTP_PROXY and bypass the
CONNECT proxy causing nondeterministic connectCount; before calling
t.Setenv("HTTPS_PROXY", ...), clear any existing proxy environment by calling
t.Setenv("NO_PROXY", "") and t.Setenv("HTTP_PROXY", "") (or use t.Unsetenv if
preferred) so startConnectProxy's connectCount is exercised deterministically;
update the test around NewGomegaWithT/startConnectProxy/proxyLn to set those
envs first.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9b13a432-fb9e-4054-909c-d329580e44a1

📥 Commits

Reviewing files that changed from the base of the PR and between bfb72e7 and 8b83360.

📒 Files selected for processing (3)
  • support/controlplane-component/konnectivity-container.go
  • support/controlplane-component/konnectivity-container_test.go
  • support/konnectivityproxy/dialer_test.go

Comment thread support/konnectivityproxy/dialer_test.go
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, csrwng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

When ConnectDirectlyToCloudAPIs is enabled, the konnectivity sidecar's
dialDirectWithProxy() reads HTTPS_PROXY from the process environment to
route cloud API calls through the management cluster's outbound proxy.
The sidecar container spec was missing these env vars, so the proxy
lookup returned empty and direct connections failed on clusters that
require an outbound proxy.

Conditionally call proxy.SetEnvVars() when building the konnectivity
sidecar container spec, scoped to containers where
ConnectDirectlyToCloudAPIs is true (HTTPS or Socks5 mode).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@csrwng csrwng force-pushed the konnectivity-proxy-fix branch from 8b83360 to c41b56c Compare May 21, 2026 16:54
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 21, 2026
@bryan-cox
Copy link
Copy Markdown
Member

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented May 21, 2026

Test Results

e2e-aws

e2e-aks

@dpateriya
Copy link
Copy Markdown
Contributor

/retest-required

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 22, 2026

@csrwng: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws c41b56c link true /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

This confirms the race condition analysis. The EnsureCustomAdminKubeconfigStatusIsRemoved test PASSED — confirming the HCP status was cleaned up. But the EnsureCustomAdminKubeconfigIsRemoved test FAILED — the actual Secret resource still existed.

Now I have all the evidence I need. Let me produce the final report:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

TestCreateCluster/Main/EnsureKubeAPIDNSNameCustomCert/EnsureCustomAdminKubeconfigIsRemoved

util.go:2533: Checking CustomAdminKubeconfig are removed
util.go:2536:
    KAS custom kubeconfig secret still exists in HCP namespace
    Expected an error to have occurred.  Got:
        <nil>: nil

Summary

The sole test failure is TestCreateCluster/Main/EnsureKubeAPIDNSNameCustomCert/EnsureCustomAdminKubeconfigIsRemoved, which cascades to fail TestCreateCluster/Main/EnsureKubeAPIDNSNameCustomCert, TestCreateCluster/Main, and TestCreateCluster (4 failures total, all from the same root cause). This is a pre-existing race condition in the test — completely unrelated to the PR's changes. The PR only modifies konnectivity proxy env var propagation (konnectivity-container.go) and adds unit tests; it does not touch any code in the custom kubeconfig secret lifecycle, the KAS component predicate, or the HCP status reconciliation.

Root Cause

The failure is a race condition between two independent reconciliation paths in the control-plane-operator (CPO):

  1. Status path (setKASCustomKubeconfigStatus() in hostedcontrolplane_controller.go:3223): When the test removes hcp.Spec.KubeAPIServerDNSName, the CPO main reconcile loop sets hcp.Status.CustomKubeconfig = nil (line 3242). The EventuallyObject wait at util.go:2515 observes this status change and succeeds after 20s.

  2. Resource path (KAS component generic-adapter.go:47): The KAS component's manifest adapter for custom-admin-kubeconfig.yaml has a predicate enableIfCustomKubeconfig that returns false when KubeAPIServerDNSName is empty. When the predicate is false, the framework calls DeleteIfNeeded to delete the Secret (line 70). This happens in a separate reconcile cycle from the status update.

The test at util.go:2531-2536 immediately checks that the custom-admin-kubeconfig Secret is deleted (expecting Get to return a NotFound error) right after the status EventuallyObject succeeds. However, the status update and the Secret deletion are handled by different reconciliation paths that can complete at different times. In this run, the status was cleared but the KAS component reconciler had not yet run to delete the Secret.

Evidence confirming the race:

  • EnsureCustomAdminKubeconfigStatusIsRemoved (checks hcp.Status.CustomKubeconfig == nil) → PASSED
  • EnsureCustomAdminKubeconfigIsRemoved (checks Secret is deleted) → FAILED
  • The log shows: "Successfully waited for the KAS custom kubeconfig secret to be deleted from HCP Namespace in 20s" — this is checking the status field, not the actual Secret

Why this is unrelated to PR #8569:
The PR modifies only support/controlplane-component/konnectivity-container.go (adding proxy.SetEnvVars() conditional on ConnectDirectlyToCloudAPIs), konnectivity-container_test.go, and support/konnectivityproxy/dialer_test.go. None of these files touch the custom-admin-kubeconfig Secret lifecycle, the enableIfCustomKubeconfig predicate, setKASCustomKubeconfigStatus(), or any part of the KAS custom kubeconfig reconciliation. The 0 grep matches for custom kubeconfig references in the changed files confirms complete isolation.

Recommendations
  1. Retest the PR/retest or /test e2e-aws should be sufficient since this is a pre-existing flaky test unrelated to the PR changes.

  2. Fix the test race condition — The EnsureCustomAdminKubeconfigIsRemoved subtest at util.go:2531-2536 performs a point-in-time check immediately after the status-level EventuallyObject succeeds. It should instead use its own EventuallyObject or polling loop to wait for the actual Secret to be deleted, rather than assuming the Secret is deleted as soon as the status is cleared. The correct fix would replace the instant Get+Expect with a polling wait similar to the EventuallyObject pattern used for the status fields.

  3. File a tracking bug for the test flake in TestCreateCluster/Main/EnsureKubeAPIDNSNameCustomCert/EnsureCustomAdminKubeconfigIsRemoved to prevent continued CI disruption.

Evidence
Evidence Detail
Failing test TestCreateCluster/Main/EnsureKubeAPIDNSNameCustomCert/EnsureCustomAdminKubeconfigIsRemoved
Failure assertion g.Expect(err).To(HaveOccurred()) at util.go:2536 — Secret custom-admin-kubeconfig was found (Get returned nil) but test expected NotFound
Status test result EnsureCustomAdminKubeconfigStatusIsRemoved PASSED — HCP status was nil as expected
Secret test result EnsureCustomAdminKubeconfigIsRemoved FAILED — Secret still existed
Status wait log "Successfully waited for the KAS custom kubeconfig secret to be deleted from HCP Namespace in 20s" (status field, not Secret)
Total test results 600 tests, 25 skipped, 4 failures (all from same root test)
PR changed files konnectivity-container.go, konnectivity-container_test.go, dialer_test.go — none touch custom kubeconfig
Custom kubeconfig refs in PR 0 matches across all 3 changed files
Race condition location Status: hostedcontrolplane_controller.go:3241; Secret deletion: generic-adapter.go:70 via KAS component predicate
Other tests passing TestCreateClusterProxy fully passed; e2e-aws-upgrade-hypershift-operator passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants