OCPBUGS-86296: Propagate management cluster proxy env vars to konnectivity sidecar#8569
OCPBUGS-86296: Propagate management cluster proxy env vars to konnectivity sidecar#8569csrwng wants to merge 1 commit into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@csrwng: This pull request references Jira Issue OCPBUGS-86296, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR conditionally propagates process proxy environment variables into the konnectivity container when the component is configured to connect directly to cloud APIs. A new helper determines the setting based on proxy mode (HTTPS vs Socks5). Unit tests cover HTTPS/Socks5/dual modes and NO_PROXY handling; integration tests add a CONNECT-only proxy and a TCP echo server to verify CONNECT usage only when HTTPS_PROXY is set. Sequence Diagram(s)sequenceDiagram
participant TestClient
participant KonnectivityProxy
participant ConnectProxy
participant EchoServer
TestClient->>KonnectivityProxy: dialDirectWithProxy(target)
alt HTTPS_PROXY set
KonnectivityProxy->>ConnectProxy: send HTTP CONNECT host:port
ConnectProxy->>EchoServer: dial host:port
ConnectProxy-->>KonnectivityProxy: tunnel established
KonnectivityProxy->>EchoServer: proxied TCP stream ("hello")
EchoServer-->>KonnectivityProxy: echo "hello"
else no proxy env
KonnectivityProxy->>EchoServer: direct TCP dial host:port
EchoServer-->>KonnectivityProxy: echo "hello"
end
KonnectivityProxy-->>TestClient: return connection with echoed payload
Suggested reviewers
🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@csrwng: This pull request references Jira Issue OCPBUGS-86296, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@support/controlplane-component/konnectivity-container_test.go`:
- Around line 102-108: Tests are nondeterministic because ambient proxy env vars
leak into subtests; in the table-driven loop inside Test... (the for _, tt :=
range tests { t.Run(tt.name, func(t *testing.T) { ... }) }), before applying
tt.proxyEnvs with t.Setenv, reset the proxy baseline by clearing or setting
empty all common proxy variables (HTTP_PROXY, HTTPS_PROXY, NO_PROXY and their
lowercase variants) using t.Setenv, then apply tt.proxyEnvs entries; this
ensures NewGomegaWithT and subsequent assertions run with a clean proxy env for
each subtest.
In `@support/konnectivityproxy/dialer_test.go`:
- Around line 394-398: In the "When HTTPS_PROXY is not set it should connect
directly" subtest, explicitly clear any ambient proxy environment to avoid
flakiness: capture the existing HTTPS_PROXY and https_proxy values with
os.Getenv, call os.Unsetenv for both keys before calling
startTCPEchoServer/startConnectProxy, and defer restoring the original values at
the top of the subtest so the environment is returned to its prior state after
the test; apply these changes in the subtest body that invokes
startTCPEchoServer and startConnectProxy.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: a8848936-ad62-42e6-a688-5944afbf4b69
📒 Files selected for processing (3)
support/controlplane-component/konnectivity-container.gosupport/controlplane-component/konnectivity-container_test.gosupport/konnectivityproxy/dialer_test.go
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
support/konnectivityproxy/dialer_test.go (1)
364-368:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAlso clear ambient
HTTP_PROXY/NO_PROXYin the "HTTPS_PROXY is set" subtest.This subtest sets
HTTPS_PROXYexplicitly but inherits ambientHTTP_PROXYand especiallyNO_PROXYfrom the runner. If CI setsNO_PROXYto include127.0.0.1orlocalhost(common on developer machines and some CI images),dialDirectWithProxywill bypass the test proxy andconnectCountwill be0, causing the assertion at line 389 to fail nondeterministically. Mirror the baseline reset already applied to the "not set" subtest.Suggested fix
t.Run("When HTTPS_PROXY is set it should route through the proxy", func(t *testing.T) { echo := startTCPEchoServer(t) var connectCount atomic.Int32 proxyLn := startConnectProxy(t, &connectCount) + t.Setenv("HTTP_PROXY", "") + t.Setenv("NO_PROXY", "") t.Setenv("HTTPS_PROXY", fmt.Sprintf("http://%s", proxyLn.Addr().String()))As per coding guidelines, "Avoid global state in tests".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@support/konnectivityproxy/dialer_test.go` around lines 364 - 368, The "When HTTPS_PROXY is set" subtest currently only sets HTTPS_PROXY and can inherit ambient HTTP_PROXY/NO_PROXY causing flaky bypasses; update the subtest to explicitly clear ambient proxy env by calling t.Setenv for "HTTP_PROXY" and "NO_PROXY" (set to empty) so dialDirectWithProxy uses the intended proxy (the proxy started by startConnectProxy and observed via connectCount) when you set HTTPS_PROXY; place these t.Setenv calls alongside where HTTPS_PROXY is set in the t.Run block that uses startTCPEchoServer and startConnectProxy.
🧹 Nitpick comments (1)
support/konnectivityproxy/dialer_test.go (1)
281-305: 💤 Low valueHelper-internal error handling could be tightened, but acceptable.
conn.SetDeadlineandio.Copyerrors are ignored. Given the helpers are test-only and deadlines bound goroutine lifetime, this is fine, but if aSetDeadlineever fails the loop would silently spin on a stale connection. Optionally surface the error viat.Errorffrom the accept loop, or simply log viat.Logto aid debugging.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@support/konnectivityproxy/dialer_test.go` around lines 281 - 305, The helper startTCPEchoServer currently ignores errors from conn.SetDeadline and io.Copy; update the inner per-connection goroutine in startTCPEchoServer to check the result of conn.SetDeadline and io.Copy and surface failures via the test logger (use t.Logf or t.Errorf) so failures are visible during tests — e.g., if conn.SetDeadline returns an error log it and close the connection, and after io.Copy returns, log any non-nil error (except expected EOF) using t.Logf/t.Errorf to aid debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@support/konnectivityproxy/dialer_test.go`:
- Around line 364-368: The "When HTTPS_PROXY is set" subtest currently only sets
HTTPS_PROXY and can inherit ambient HTTP_PROXY/NO_PROXY causing flaky bypasses;
update the subtest to explicitly clear ambient proxy env by calling t.Setenv for
"HTTP_PROXY" and "NO_PROXY" (set to empty) so dialDirectWithProxy uses the
intended proxy (the proxy started by startConnectProxy and observed via
connectCount) when you set HTTPS_PROXY; place these t.Setenv calls alongside
where HTTPS_PROXY is set in the t.Run block that uses startTCPEchoServer and
startConnectProxy.
---
Nitpick comments:
In `@support/konnectivityproxy/dialer_test.go`:
- Around line 281-305: The helper startTCPEchoServer currently ignores errors
from conn.SetDeadline and io.Copy; update the inner per-connection goroutine in
startTCPEchoServer to check the result of conn.SetDeadline and io.Copy and
surface failures via the test logger (use t.Logf or t.Errorf) so failures are
visible during tests — e.g., if conn.SetDeadline returns an error log it and
close the connection, and after io.Copy returns, log any non-nil error (except
expected EOF) using t.Logf/t.Errorf to aid debugging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: e6adab7c-71dd-43c5-a960-db9e7fb0bdd5
📒 Files selected for processing (2)
support/controlplane-component/konnectivity-container_test.gosupport/konnectivityproxy/dialer_test.go
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8569 +/- ##
==========================================
+ Coverage 40.40% 40.50% +0.09%
==========================================
Files 755 755
Lines 93235 93246 +11
==========================================
+ Hits 37675 37768 +93
+ Misses 52858 52756 -102
- Partials 2702 2722 +20
... and 2 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
| if err != nil { | ||
| t.Fatalf("failed to start proxy server: %v", err) | ||
| } | ||
| t.Cleanup(func() { ln.Close() }) |
There was a problem hiding this comment.
nit: Could explicitly call srv.Close()
There was a problem hiding this comment.
Done — added t.Cleanup(func() { srv.Close() }) before the srv.Serve goroutine in bfb72e7.
| t.Fatalf("read failed: %v", err) | ||
| } | ||
| if string(buf) != string(msg) { | ||
| t.Errorf("expected %q, got %q", msg, buf) |
There was a problem hiding this comment.
Fixed — converted both subtests to use gomega g.Expect in bfb72e7.
| t.Errorf("expected %q, got %q", msg, buf) | ||
| } | ||
| if connectCount.Load() != 0 { | ||
| t.Errorf("expected proxy to receive 0 CONNECT requests, got %d", connectCount.Load()) |
There was a problem hiding this comment.
Fixed in bfb72e7 along with the other g.Expect conversion above.
| t.Errorf("expected %q, got %q", msg, buf) | ||
| } | ||
| if connectCount.Load() != 1 { | ||
| t.Errorf("expected proxy to receive 1 CONNECT request, got %d", connectCount.Load()) |
There was a problem hiding this comment.
Fixed in bfb72e7 — all assertions in both subtests now use g.Expect.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@support/konnectivityproxy/dialer_test.go`:
- Around line 332-333: The test currently ignores errors from core proxy
operations (calls like target.SetDeadline, io.Copy, and proxy.Serve), which can
hide failures; update the test to capture and assert or fail on these errors
instead of discarding them: check the returned error from target.SetDeadline,
propagate or t.Fatalf/t.Errorf on non-nil errors from io.Copy invocations (e.g.,
the copy between conn and target), and verify the error returned by Serve (or
use require.NoError/expectation helpers) so any tunnel/setup or runtime proxy
failures surface in the test logs. Ensure you modify the uses of
target.SetDeadline, io.Copy, and the Serve call accordingly.
- Line 376: Two occurrences call conn.SetDeadline(...) and ignore its returned
error (e.g., the lines using `_ =
conn.SetDeadline(time.Now().Add(testTimeout))`); change both to check the error
and fail the test immediately (e.g., if err := conn.SetDeadline(...); err != nil
{ t.Fatalf("SetDeadline failed: %v", err) }) so the test fails fast instead of
continuing with a bad deadline—update both instances that call conn.SetDeadline
to perform this error check using the test's fatal/assert helper.
- Around line 299-300: In the echo helper, don't ignore errors from
conn.SetDeadline and io.Copy: capture their returned errors and fail the test
(e.g., via t.Fatalf/t.Fatalf-like helper) or return the error so the caller can
fail; update the code around conn.SetDeadline(time.Now().Add(5 * time.Second))
and _, _ = io.Copy(conn, conn) to check the error values, include contextual
messages (which function: the echo helper, and operations: conn.SetDeadline and
io.Copy) and ensure the test does not silently succeed when those operations
fail.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: c6ef18ac-69dd-4b5f-94d0-65c74c8359dd
📒 Files selected for processing (2)
support/controlplane-component/konnectivity-container_test.gosupport/konnectivityproxy/dialer_test.go
9d3cbc0 to
8b83360
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@support/konnectivityproxy/dialer_test.go`:
- Around line 375-381: The subtest "When HTTPS_PROXY is set it should route
through the proxy" can inherit ambient NO_PROXY/HTTP_PROXY and bypass the
CONNECT proxy causing nondeterministic connectCount; before calling
t.Setenv("HTTPS_PROXY", ...), clear any existing proxy environment by calling
t.Setenv("NO_PROXY", "") and t.Setenv("HTTP_PROXY", "") (or use t.Unsetenv if
preferred) so startConnectProxy's connectCount is exercised deterministically;
update the test around NewGomegaWithT/startConnectProxy/proxyLn to set those
envs first.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 9b13a432-fb9e-4054-909c-d329580e44a1
📒 Files selected for processing (3)
support/controlplane-component/konnectivity-container.gosupport/controlplane-component/konnectivity-container_test.gosupport/konnectivityproxy/dialer_test.go
|
Scheduling tests matching the |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox, csrwng The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
When ConnectDirectlyToCloudAPIs is enabled, the konnectivity sidecar's dialDirectWithProxy() reads HTTPS_PROXY from the process environment to route cloud API calls through the management cluster's outbound proxy. The sidecar container spec was missing these env vars, so the proxy lookup returned empty and direct connections failed on clusters that require an outbound proxy. Conditionally call proxy.SetEnvVars() when building the konnectivity sidecar container spec, scoped to containers where ConnectDirectlyToCloudAPIs is true (HTTPS or Socks5 mode). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8b83360 to
c41b56c
Compare
|
/lgtm |
|
Scheduling tests matching the |
Test Resultse2e-aws
e2e-aks
|
|
/retest-required |
|
@csrwng: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
This confirms the race condition analysis. The Now I have all the evidence I need. Let me produce the final report: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe sole test failure is Root CauseThe failure is a race condition between two independent reconciliation paths in the control-plane-operator (CPO):
The test at Evidence confirming the race:
Why this is unrelated to PR #8569: Recommendations
Evidence
|
Summary
dialDirectWithProxy()readsHTTPS_PROXYfrom the process environment to route cloud API calls through the management cluster's outbound proxy whenConnectDirectlyToCloudAPIsis enabled. The sidecar container spec never set these env vars, so cloud API calls fail on management clusters that require a proxy for outbound access.proxy.SetEnvVars()inbuildContainer()conditioned onConnectDirectlyToCloudAPIs— the only code path that needs the management cluster proxy env vars. This follows the same pattern used by other CPO-managed containers (KCM, CCO, CAPI provider, etc.).dialDirectWithProxy()with a real TCP echo server and HTTP CONNECT proxy to verify the proxy routing behavior end-to-end.Test plan
go test ./support/controlplane-component/... ./support/konnectivityproxy/...go test -race ./support/controlplane-component/... ./support/konnectivityproxy/...make verifypasses🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests