CNTRLPLANE-3553: Wire usesRunc detection into RHEL stream resolution#8832
CNTRLPLANE-3553: Wire usesRunc detection into RHEL stream resolution#8832sdminonne wants to merge 2 commits into
Conversation
Wire the osImageStream API field into the NodePool controller: - Validate spec.osImageStream via GetRHELStream before config generation (fail fast on invalid stream/version/runc combinations). - Return StreamRHEL9 (not "") for implicit pre-5.0 releases so that downstream consumers like StreamForName() always receive a concrete stream name, avoiding errors when legacy StreamMetadata is removed. - Normalize rhelStream in rolloutConfig so that setting the default stream does not change the config hash (no spurious fleet-wide rollouts). - Keep resolvedRHELStream on ConfigGenerator (not rolloutConfig) for downstream consumers that need a concrete stream name (GCP, AWS AMI, token secret). - Add getRHELStream wrapper and validateOSImageStream in osstream.go, both delegating to GetRHELStream from stream.go. - Write os-stream key to the token secret for future ignition-server consumption. - Infer status.osImageStream from Machine NodeInfo.OSImage using majority consensus (rhcosStreamFromOSImage, osImageStreamFromMachines, setOSImageStreamStatus in version.go). - Pass resolved stream to defaultNodePoolAMI / setAWSConditions for consistent boot image resolution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scan NodePool spec.config ConfigMaps for ContainerRuntimeConfig resources with defaultRuntime set to "runc". When detected: - Implicit stream on OCP >= 5.0 falls back to rhel-9 (instead of rhel-10) - Explicit rhel-10 + runc returns a validation error This replaces the hardcoded usesRunc=false TODO(CNTRLPLANE-3553) in getRHELStream, validateOSImageStream, and the config hash normalization in NewConfigGenerator. JIRA: CNTRLPLANE-3553 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@sdminonne: This pull request references CNTRLPLANE-3553 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
📝 WalkthroughWalkthroughNodePool reconciliation now infers Sequence Diagram(s)Status and validation sequenceDiagram
participant "NodePoolReconciler.reconcile" as Reconcile
participant "setOSImageStreamStatus" as SetOSImageStreamStatus
participant "osImageStreamFromMachines" as OsImageStreamFromMachines
participant "validMachineConfigCondition" as ValidMachineConfigCondition
participant "validateOSImageStream" as ValidateOSImageStream
participant "GetRHELStream" as GetRHELStream
Reconcile->>SetOSImageStreamStatus: infer status.osImageStream
SetOSImageStreamStatus->>OsImageStreamFromMachines: inspect Machine NodeInfo.OSImage
OsImageStreamFromMachines-->>SetOSImageStreamStatus: majority stream or empty
Reconcile->>ValidMachineConfigCondition: validate requested OS image stream
ValidMachineConfigCondition->>ValidateOSImageStream: check spec.osImageStream.name
ValidateOSImageStream->>GetRHELStream: resolve stream from release image and runtime
GetRHELStream-->>ValidateOSImageStream: stream or error
ValidateOSImageStream-->>ValidMachineConfigCondition: validation result
AWS and GCP image resolution sequenceDiagram
participant "awsMachineTemplate" as AwsMachineTemplate
participant "awsMachineTemplateSpec" as AwsMachineTemplateSpec
participant "resolveAWSAMI" as ResolveAWSAMI
participant "defaultNodePoolAMI" as DefaultNodePoolAMI
participant "gcpMachineTemplate" as GcpMachineTemplate
participant "gcpMachineTemplateSpec" as GcpMachineTemplateSpec
participant "resolveGCPImage" as ResolveGCPImage
participant "defaultNodePoolGCPImage" as DefaultNodePoolGCPImage
AwsMachineTemplate->>AwsMachineTemplateSpec: pass c.resolvedRHELStream
AwsMachineTemplateSpec->>ResolveAWSAMI: resolve AMI
ResolveAWSAMI->>DefaultNodePoolAMI: default Linux/RHCOS AMI
GcpMachineTemplate->>GcpMachineTemplateSpec: pass c.resolvedRHELStream
GcpMachineTemplateSpec->>ResolveGCPImage: resolve image
ResolveGCPImage->>DefaultNodePoolGCPImage: default GCP image
Token secret and AMI labels sequenceDiagram
participant "token secret reconciliation" as TokenSecretReconciliation
participant "tokenSecret" as TokenSecret
participant "setKarpenterAMILabels" as SetKarpenterAMILabels
participant "defaultNodePoolAMI" as DefaultNodePoolAMI
TokenSecretReconciliation->>TokenSecret: store TokenSecretOSStreamKey
TokenSecretReconciliation->>SetKarpenterAMILabels: pass t.resolvedRHELStream
SetKarpenterAMILabels->>DefaultNodePoolAMI: resolve AMI labels
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 11✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sdminonne The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8832 +/- ##
==========================================
+ Coverage 41.86% 42.07% +0.21%
==========================================
Files 759 770 +11
Lines 94101 97005 +2904
==========================================
+ Hits 39392 40813 +1421
- Misses 51949 53367 +1418
- Partials 2760 2825 +65
... and 48 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
hypershift-operator/controllers/nodepool/aws_test.go (1)
301-301: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winCover the non-empty stream path in the AWS tests.
These call sites still only exercise
rhelStream == "". Since the production change is about selecting AWS images from the resolved stream, add a case withOSStreams["rhel-10"]and a non-empty stream argument soawsMachineTemplateSpec/resolveAWSAMIprove they read named-stream metadata instead of the legacy default path.As per coding guidelines,
**/*_test.go: Unit test code changes and additions; include e2e tests when changes impact consumer behaviour.Also applies to: 1344-1344
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/nodepool/aws_test.go` at line 301, The AWS test coverage still only exercises the empty rhelStream path, so add a non-empty stream case using OSStreams["rhel-10"] to verify the AWS image selection logic. Update the relevant test call sites around awsMachineTemplateSpec and resolveAWSAMI so they assert the resolved stream metadata is used instead of the legacy default path, and keep the existing empty-stream case as a separate baseline.Source: Coding guidelines
hypershift-operator/controllers/nodepool/token_test.go (1)
1184-1184: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd a named-stream Karpenter label case.
This signature update still only tests the empty-stream path. Please add a case with per-stream AWS metadata and a non-empty
rhelStreamsosetKarpenterAMILabelsis pinned toresolvedRHELStreamrather than the legacy default lookup.As per coding guidelines,
**/*_test.go: Unit test code changes and additions; include e2e tests when changes impact consumer behaviour.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/nodepool/token_test.go` at line 1184, The test coverage for setKarpenterAMILabels only exercises the empty-stream path, so add a new table-driven case in token_test.go with per-stream AWS metadata and a non-empty rhelStream to verify the function uses resolvedRHELStream instead of falling back to the legacy default lookup. Extend the existing setKarpenterAMILabels call path in the test to pass the named stream scenario and assert the expected Karpenter AMI label resolution for that case.Source: Coding guidelines
hypershift-operator/controllers/nodepool/config.go (1)
107-116: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAvoid shadowing
errin the normalization block.Line 109 creates a new
errscoped to theif, which makes this already branchy path harder to scan and goes against the repo’s Go guidance. Reusing the outererrkeeps the flow clearer.As per coding guidelines,
**/!(*.pb).go: Avoid variable shadowing.Suggested cleanup
if rhelStream != "" { - version, err := semver.Parse(releaseImage.Version()) + var version semver.Version + version, err = semver.Parse(releaseImage.Version()) if err != nil { return nil, fmt.Errorf("failed to parse release image version %q: %w", releaseImage.Version(), err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/nodepool/config.go` around lines 107 - 116, The normalization block in the NodePool config path is shadowing the outer err variable by redeclaring it inside the rhelStream branch, which goes against the Go guidance for avoiding shadowing. Update the logic in the config normalization flow around rhelStream, semver.Parse, and usesRuncRuntime to reuse the existing err variable instead of introducing a new scoped one, while preserving the same error handling and return behavior.Source: Coding guidelines
hypershift-operator/controllers/nodepool/config_test.go (1)
250-305: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd a
usesRuncRuntimenormalization case here.These new cases only pin the release-version branch.
NewConfigGeneratornow also normalizes againstusesRuncRuntime(...), so please add a 5.x case with aContainerRuntimeConfigsettingdefaultRuntime: runcand assert that explicitrhel-9hashes like the implicit default. That is the new branch this PR introduces.As per coding guidelines,
**/*_test.go: Unit test code changes and additions; include e2e tests when changes impact consumer behaviour.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/nodepool/config_test.go` around lines 250 - 305, The hash test table in config_test.go only covers release-version-based normalization, but NewConfigGenerator also normalizes through usesRuncRuntime(...). Add a 5.x test case that includes a ContainerRuntimeConfig with defaultRuntime set to runc and verify that an explicit rhel-9 OSImageStream hashes the same as the implicit default path. Use the existing test table around the NodePool, releaseImage, and hostedCluster cases to extend coverage for this new normalization branch.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@hypershift-operator/controllers/nodepool/osstream.go`:
- Around line 40-46: The ConfigMap lookup in the osstream logic is swallowing
all c.Get failures instead of only missing objects, which can hide transient
API/RBAC errors; update the error handling around the client Get call to skip
only NotFound and return all other errors so the failure propagates through
validateOSImageStream and getRHELStream. Add the needed
k8s.io/apimachinery/pkg/api/errors import and use the existing
ref.Name/nodePool.Namespace lookup path to keep the behavior limited to true
absent ConfigMaps.
In `@hypershift-operator/controllers/nodepool/version.go`:
- Around line 165-174: In setOSImageStreamStatus, the
NodePool.Status.OSImageStream field is only ever set and never cleared, which
can leave stale status behind. Update this helper to resolve the default stream
via GetRHELStream using the release version and usesRunc context, compare it
with osImageStreamFromMachines(machines), and assign an empty
OSImageStreamReference when there is no majority or the observed stream matches
the default. Keep the existing getMachinesForNodePool and
osImageStreamFromMachines flow, but ensure the status field is omitted whenever
the pool is on the release-default OS images.
---
Nitpick comments:
In `@hypershift-operator/controllers/nodepool/aws_test.go`:
- Line 301: The AWS test coverage still only exercises the empty rhelStream
path, so add a non-empty stream case using OSStreams["rhel-10"] to verify the
AWS image selection logic. Update the relevant test call sites around
awsMachineTemplateSpec and resolveAWSAMI so they assert the resolved stream
metadata is used instead of the legacy default path, and keep the existing
empty-stream case as a separate baseline.
In `@hypershift-operator/controllers/nodepool/config_test.go`:
- Around line 250-305: The hash test table in config_test.go only covers
release-version-based normalization, but NewConfigGenerator also normalizes
through usesRuncRuntime(...). Add a 5.x test case that includes a
ContainerRuntimeConfig with defaultRuntime set to runc and verify that an
explicit rhel-9 OSImageStream hashes the same as the implicit default path. Use
the existing test table around the NodePool, releaseImage, and hostedCluster
cases to extend coverage for this new normalization branch.
In `@hypershift-operator/controllers/nodepool/config.go`:
- Around line 107-116: The normalization block in the NodePool config path is
shadowing the outer err variable by redeclaring it inside the rhelStream branch,
which goes against the Go guidance for avoiding shadowing. Update the logic in
the config normalization flow around rhelStream, semver.Parse, and
usesRuncRuntime to reuse the existing err variable instead of introducing a new
scoped one, while preserving the same error handling and return behavior.
In `@hypershift-operator/controllers/nodepool/token_test.go`:
- Line 1184: The test coverage for setKarpenterAMILabels only exercises the
empty-stream path, so add a new table-driven case in token_test.go with
per-stream AWS metadata and a non-empty rhelStream to verify the function uses
resolvedRHELStream instead of falling back to the legacy default lookup. Extend
the existing setKarpenterAMILabels call path in the test to pass the named
stream scenario and assert the expected Karpenter AMI label resolution for that
case.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: b26364bd-aea7-431e-9260-ca7e560c480c
📒 Files selected for processing (17)
hypershift-operator/controllers/nodepool/aws.gohypershift-operator/controllers/nodepool/aws_test.gohypershift-operator/controllers/nodepool/capi_test.gohypershift-operator/controllers/nodepool/conditions.gohypershift-operator/controllers/nodepool/config.gohypershift-operator/controllers/nodepool/config_test.gohypershift-operator/controllers/nodepool/gcp.gohypershift-operator/controllers/nodepool/gcp_test.gohypershift-operator/controllers/nodepool/nodepool_controller.gohypershift-operator/controllers/nodepool/osstream.gohypershift-operator/controllers/nodepool/osstream_test.gohypershift-operator/controllers/nodepool/stream.gohypershift-operator/controllers/nodepool/stream_test.gohypershift-operator/controllers/nodepool/token.gohypershift-operator/controllers/nodepool/token_test.gohypershift-operator/controllers/nodepool/version.gohypershift-operator/controllers/nodepool/version_test.go
| if err := c.Get(ctx, client.ObjectKey{ | ||
| Namespace: nodePool.Namespace, | ||
| Name: ref.Name, | ||
| }, cm); err != nil { | ||
| // If the ConfigMap doesn't exist, skip — validation catches this elsewhere. | ||
| continue | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Handle non-NotFound ConfigMap read failures.
This currently treats any c.Get failure as “no runc config”. A transient API error or RBAC failure will silently disable the runc check, which can make validateOSImageStream and getRHELStream resolve rhel-10 when the pool should have been rejected or retried. Only NotFound should be skipped here; other errors need to bubble up.
Suggested fix
if err := c.Get(ctx, client.ObjectKey{
Namespace: nodePool.Namespace,
Name: ref.Name,
}, cm); err != nil {
- // If the ConfigMap doesn't exist, skip — validation catches this elsewhere.
- continue
+ if apierrors.IsNotFound(err) {
+ // Validation catches missing ConfigMaps elsewhere.
+ continue
+ }
+ return false, fmt.Errorf("failed to get configmap %s/%s: %w", nodePool.Namespace, ref.Name, err)
}Add the corresponding k8s.io/apimachinery/pkg/api/errors import.
As per path instructions, **/*.go: Go security (prodsec-skills): Never ignore error returns.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if err := c.Get(ctx, client.ObjectKey{ | |
| Namespace: nodePool.Namespace, | |
| Name: ref.Name, | |
| }, cm); err != nil { | |
| // If the ConfigMap doesn't exist, skip — validation catches this elsewhere. | |
| continue | |
| } | |
| if err := c.Get(ctx, client.ObjectKey{ | |
| Namespace: nodePool.Namespace, | |
| Name: ref.Name, | |
| }, cm); err != nil { | |
| if apierrors.IsNotFound(err) { | |
| // Validation catches missing ConfigMaps elsewhere. | |
| continue | |
| } | |
| return false, fmt.Errorf("failed to get configmap %s/%s: %w", nodePool.Namespace, ref.Name, err) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@hypershift-operator/controllers/nodepool/osstream.go` around lines 40 - 46,
The ConfigMap lookup in the osstream logic is swallowing all c.Get failures
instead of only missing objects, which can hide transient API/RBAC errors;
update the error handling around the client Get call to skip only NotFound and
return all other errors so the failure propagates through validateOSImageStream
and getRHELStream. Add the needed k8s.io/apimachinery/pkg/api/errors import and
use the existing ref.Name/nodePool.Namespace lookup path to keep the behavior
limited to true absent ConfigMaps.
Source: Path instructions
| func (r *NodePoolReconciler) setOSImageStreamStatus(ctx context.Context, nodePool *hyperv1.NodePool) error { | ||
| machines, err := r.getMachinesForNodePool(ctx, nodePool) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get Machines for OSImageStream status: %w", err) | ||
| } | ||
|
|
||
| stream := osImageStreamFromMachines(machines) | ||
| if stream != "" { | ||
| nodePool.Status.OSImageStream = hyperv1.OSImageStreamReference{Name: stream} | ||
| } |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift
status.osImageStream can become sticky and violate the API contract.
This helper only writes nodePool.Status.OSImageStream when it finds a majority and never clears it otherwise. Once set, the value survives scale-to-zero, split rollouts, and pools that are simply running the release-default stream. The NodePoolStatus.OSImageStream contract says the field should be omitted when the pool is using the release version’s default OS images, so this will publish stale/non-actionable status to clients.
This path needs enough context to compare the observed stream against the resolved default (GetRHELStream("", releaseVersion, usesRunc)) and clear the field whenever there is no majority or the observed stream matches that default.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@hypershift-operator/controllers/nodepool/version.go` around lines 165 - 174,
In setOSImageStreamStatus, the NodePool.Status.OSImageStream field is only ever
set and never cleared, which can leave stale status behind. Update this helper
to resolve the default stream via GetRHELStream using the release version and
usesRunc context, compare it with osImageStreamFromMachines(machines), and
assign an empty OSImageStreamReference when there is no majority or the observed
stream matches the default. Keep the existing getMachinesForNodePool and
osImageStreamFromMachines flow, but ensure the status field is omitted whenever
the pool is on the release-default OS images.
|
I have all the evidence. Let me verify the import structure of the new file to confirm the exact gci ordering problem: The file import (
"bufio"
"context"
coreerrors "errors"
"fmt"
"io"
"strings"
hyperv1 "github.com/openshift/hypershift/api/hypershift/v1beta1"
"github.com/openshift/hypershift/support/releaseinfo"
mcfgv1 "github.com/openshift/api/machineconfiguration/v1"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
runtimeserializer "k8s.io/apimachinery/pkg/runtime/serializer"
"k8s.io/apimachinery/pkg/util/yaml"
"github.com/blang/semver" // <-- line 21, gci error here
"sigs.k8s.io/controller-runtime/pkg/client"
)The And the Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe Root CauseIssue 1 — gci (import ordering) at line 21: The new file
The project's Issue 2 — unparam (unused error return) at line 29: The function func usesRuncRuntime(ctx context.Context, c client.Client, nodePool *hyperv1.NodePool) (bool, error)However, examining all code paths:
Every path returns Note: The companion file RecommendationsFix 1 — gci (import ordering): Reorganize the imports in golangci-lint run --fix --enable gci hypershift-operator/controllers/nodepool/osstream.goOr manually restructure the imports to match the ordering used in passing files (e.g., Fix 2 — unparam (always-nil error): Two options:
Option 1 is recommended since the current error-swallowing is intentional (comments say "validation catches this elsewhere"), and this aligns with the linter's finding. Evidence
|
Summary
usesRuncRuntime()to scan NodePoolspec.configConfigMaps forContainerRuntimeConfigwithdefaultRuntime: runcgetRHELStream(),validateOSImageStream(), and the config hash normalization inNewConfigGenerator()rhel-9; explicitrhel-10+ runc returns a validation error viaValidMachineConfigconditionTODO([CNTRLPLANE-3553](https://redhat.atlassian.net/browse/CNTRLPLANE-3553)): pass actual usesRuncplaceholdersDependencies
Parallel dependency (no ordering constraint with this PR):
Test plan
TestUsesRuncRuntime— 7 cases: no configs, runc, crun, empty runtime, MachineConfig, missing ConfigMap, multiple configsTest_getRHELStream— extended with 5 runc-aware cases: runc+5.0→rhel-9, runc+rhel-10→error, crun+5.0→rhel-10, runc+4.x→rhel-9, missing configTestValidateOSImageStream— extended with 2 runc-aware cases: rhel-10+runc→error, rhel-9+runc→okmake verifyafter rebase onto merged CNTRLPLANE-3553: Wire osImageStream into NodePool controller (hash, token, status, validation) #8730JIRA: CNTRLPLANE-3553
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes