Skip to content

NE-2386: IngressController API for AWS NLB security group selection#2037

Open
aswinsuryan wants to merge 4 commits into
openshift:masterfrom
aswinsuryan:NE-2386-aws-nlb-security-group-selection
Open

NE-2386: IngressController API for AWS NLB security group selection#2037
aswinsuryan wants to merge 4 commits into
openshift:masterfrom
aswinsuryan:NE-2386-aws-nlb-security-group-selection

Conversation

@aswinsuryan

Copy link
Copy Markdown

This enhancement enables administrators to specify custom (Bring Your Own) security groups for AWS Network Load Balancers on IngressControllers. A new securityGroups field is added to AWSNetworkLoadBalancerParameters, which the Cluster
Ingress Operator translates into the service.beta.kubernetes.io/aws-load-balancer-security-groups Service annotation for the Cloud Controller Manager. This targets ROSA as the primary deployment type and builds on the CCM-level BYO security
group support added in upstream cloud-provider-aws#1379.

Add enhancement proposal for specifying custom (BYO) security groups
on AWS NLB IngressControllers via a new securityGroups field on
AWSNetworkLoadBalancerParameters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Aswin Suryanarayanan <asuryana@redhat.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 11, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2026
@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot

openshift-ci-robot commented Jun 11, 2026

Copy link
Copy Markdown

@aswinsuryan: This pull request references NE-2386 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

This enhancement enables administrators to specify custom (Bring Your Own) security groups for AWS Network Load Balancers on IngressControllers. A new securityGroups field is added to AWSNetworkLoadBalancerParameters, which the Cluster
Ingress Operator translates into the service.beta.kubernetes.io/aws-load-balancer-security-groups Service annotation for the Cloud Controller Manager. This targets ROSA as the primary deployment type and builds on the CCM-level BYO security
group support added in upstream cloud-provider-aws#1379.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign miciah for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aswinsuryan aswinsuryan marked this pull request as ready for review June 11, 2026 17:18
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2026
@openshift-ci openshift-ci Bot requested review from Miciah and candita June 11, 2026 17:18
@mtulio

mtulio commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

cc @mfbonfigli

@mtulio mtulio left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! I have some questions and comments about CCM behavior with managed SG.

Comment thread enhancements/ingress/aws-nlb-security-group-selection.md Outdated
Comment thread enhancements/ingress/aws-nlb-security-group-selection.md Outdated
[Open Questions](#open-questions)).
4. The master node IAM role must have the
`elasticloadbalancing:SetSecurityGroups` permission (added in
[openshift/installer#10512](https://github.com/openshift/installer/pull/10512)).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As well is about to be added to the ROSA managed policy at Jira card : TBD

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread enhancements/ingress/aws-nlb-security-group-selection.md Outdated
security groups are not deleted (the user retains ownership). The
exact CCM behavior for transitioning back to a managed security group
after BYO removal is not documented upstream and should be verified
during implementation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I followed correctly when BYOSG annotation is removed in existing Service/NLB which was previously created with BYOSG, a managed SG will be created attaching it to the new SG, leaving SG resource unattached to the NLB:
https://github.com/mfbonfigli/cloud-provider-aws/blob/31a27a5f9ac61ad68f9b4d0a8da765ff060245d3/pkg/providers/v1/aws.go#L2276-L2304

There is also a e2e for it: https://github.com/kubernetes/cloud-provider-aws/blob/c34d66ed717aaffe3319f149c26e28163e206a3e/tests/e2e/loadbalancer.go#L548

cc @mfbonfigli

Comment thread enhancements/ingress/aws-nlb-security-group-selection.md Outdated
Comment thread enhancements/ingress/aws-nlb-security-group-selection.md Outdated
Comment thread enhancements/ingress/aws-nlb-security-group-selection.md Outdated
Comment thread enhancements/ingress/aws-nlb-security-group-selection.md Outdated
Comment thread enhancements/ingress/aws-nlb-security-group-selection.md Outdated
Additional changes based on review feedback:
- Clarify CCM behavior when removing BYO security groups
- Add guidance on Service vs IngressController deletion
- Update Prerequisites to reflect automatic CCCMO enablement

Signed-off-by: Aswin Suryanarayanan <asuryana@redhat.com>
@gcs278

gcs278 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

/assign

Comment on lines +204 to +206
5. The CCM attempts to attach the security groups to the NLB, but AWS
does not allow adding security groups to an NLB that was created
without security group support. The CCM reports an error.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CCM won't even attempt to add the security group if it detects the NLB does not support them, it will essentially just ignore the setting.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification! Updated steps 5 and 6 to reflect that CCM silently ignores the annotation rather than reporting an error.

Comment thread enhancements/ingress/aws-nlb-security-group-selection.md Outdated
Comment on lines +454 to +457
On downgrade to a version that does not recognize the `securityGroups` field,
the field is ignored. Existing NLBs with attached security groups continue
functioning. No managed security groups will be created or modified by older
versions.

@mfbonfigli mfbonfigli Jun 24, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. The old CCM version would error if a service had that specific annotation set:

https://github.com/openshift/cloud-provider-aws/blob/82da06f1b6e1643b06b4463abf25333060cdc816/pkg/providers/v1/aws_validations.go#L81

The reconciliation loop would fail.

Also due to a preexisting behavior, we can't really remove the annotation after the downgrade or else we risk the controller treating the BYO as a managed SG and wiping rules on it (which should never happen since it's not a cluster owned resource). Discussion ref in upstream here

The only correct downgrade way is to remove the annotation before the downgrade and then downgrade.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the downgrade section to state that administrators must remove the securityGroups field before downgrading

  - Fix CCM behavior for NLBs without SG support
  - Clarify NLB recreation needed when enabling Managed mode after NLB creation
  - Rewrite downgrade strategy to prevent reconciliation failures

Signed-off-by: Aswin Suryanarayanan <asuryana@redhat.com>

@gcs278 gcs278 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! I think the big thing is using the established delete/recreate effectuation pattern as mentioned in the subnets EP.

I think we should require Service recreation on any change to securityGroups (add, update, or remove) using the established "always recreate the service" pattern. Security groups don't change often, so the disruption is tolerable, and it avoids the need for complex post-provision validation. The CCM doesn't provide a reliable signal when an invalid SG is applied to an already-provisioned NLB (just like subnets).

does not allow adding security groups to an NLB that was created
without security group support. The CCM reports an error.

6. The administrator must delete and recreate the NLB to enable security

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good info. The IngressController already has a pattern (introduced here) where if a spec field is changed and the LB needs to be recreated, it will add a Progressing=true message with instructions on how to delete the service manually (or revert).

I think we should use that pattern here. However, we don't have a pattern for "only when added for the first time, delete/recreate service". That's new. It's probably not too bad to implement, but see my other github comment below on why I think we should use the established "delete/recreate Service" pattern for this annotation rather than supporting in-place updates.

Either way, I'd recommend to document that in this workflow for clarity - mention that the CIO will add a Progressing=True message with instructions for the user. My subnets EP has a bunch of related info that could help.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopted the effectuation pattern from the subnets enhancement. Added "Effectuating Security Group Updates" section with the full LoadBalancerProgressing workflow. Update, remove, and add operations all follow this pattern.

Comment on lines +222 to +223
(e.g., the cloud-controller-manager-operator has not been updated to set it,
or the cluster is running an older CCM version).

@gcs278 gcs278 Jun 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, this isn't possible in a healthy cluster?

CCM is always shipping cloud-config with NLBSecurityGroupMode = Managed now. The FG openshift/api#2717 is now set to GA, and permanently enabled.

You can keep this if you want, but also it's describing a fundamentally broken cluster.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Removed this section

(must match `sg-` followed by 8-17 hex characters) and accepts it.

3. The CCM attempts to attach the security group and fails. The error
is surfaced as a `SyncLoadBalancerFailed` event on the Service. The

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the logic in the CIO can only detect a LB failure on first provisioning, but not for updates. The CIO only checks for SyncLoadBalancerFailed events when the Service is in isPending state. If the NLB is already provisioned and the user changes to an invalid SG, the Service still has status.LoadBalancer.Ingress populated, so the CIO reports LoadBalancerReady=True and the error is only visible in CCM logs/events.

It's an unfortunate existing blind spot with no easy fix. It's one of the reasons we use the "delete/recreate" pattern for annotations, even when the CCM supports in-place updates, the CIO has no way to detect or surface post-provision failures.

Something to consider. I'd suggest going with "delete/recreate" for any update to security groups out of safety (and it's also a lot easier to implement) and document that in the EP.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All securityGroups changes (add, update, remove) now require Service deletion and recreation via the effectuation pattern.

`eipAllocations` — it reads the `securityGroups` field from the
IngressController spec and sets the corresponding Service annotation.

When `securityGroups` is specified, the operator checks the `cloud-provider-config`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point as above, does this actually add value if we control the cloud-provider-config configmap?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the cloud-provider-config validation logic.

Service directly (allowing automatic recreation) or delete and recreate the
entire IngressController. See the "NLB Created Before Security Group Support"
variation for detailed guidance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with updates to describe the effectuation pattern (delete/recreate): it's worth adding a note/subsection about the annotation ingress.operator.openshift.io/auto-delete-load-balancer - we should support that and document as a workflow.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an "Automatic Service Deletion" subsection under the effectuation section, documenting the auto-delete annotation with a YAML example.


5. Once `NLBSecurityGroupMode = Managed` is set in the cloud-config, the
Cluster Ingress Operator will reconcile and add the annotation to the Service.
However, if the NLB was created before Managed mode was enabled, the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another reason to always require a service recreation (delete/recreate) when changing the security group. It avoids having to write any of this logic.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the pre-existing NLB variation section. The effectuation pattern handles all cases uniformly.

Change the enhancement to require Service deletion and recreation when
the securityGroups field is updated, rather than attempting in-place
annotation updates. This addresses review feedback that in-place updates
create a validation blind spot: the Cluster Ingress Operator cannot
detect CCM failures after initial NLB provisioning because the Service
status remains populated even when the CCM fails to apply changes.

Signed-off-by: Aswin Suryanarayanan <asuryana@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

@aswinsuryan: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint 96c2858 link true /test markdownlint

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gcs278 gcs278 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick updates.

Nearly LGTM - I agree with the approach, just have minor EP-specific points.

Comment on lines +458 to +459
Existing IngressControllers without the `securityGroups` field continue
working unchanged. The new field is optional and has no default value.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upgrade situation is a bit more nuanced than what's described here. We have to handle the case where a user has manually added the service.beta.kubernetes.io/aws-load-balancer-security-groups annotation to the Service before this feature existed. When the CIO upgrades to a version that manages this annotation, the spec field will be empty (no SGs configured), but the annotation exists on the Service. The CIO sees a mismatch — the annotation is present but the spec says it shouldn't be — and we need to avoid automatically stomping on the user's manually-set annotation.

The good news is the effectuation pattern handles this nicely. On upgrade, the CIO will detect that the current annotation doesn't match the desired state (no annotation), and fire LoadBalancerProgressing=True without removing the annotation. The user can then either set spec...securityGroups to match their existing annotation (resolving the mismatch without disruption) or delete the Service to let the CIO recreate it without the annotation.

It's also important that we don't treat an empty spec as "unmanaged" — we've had implementations in the past where empty spec meant "don't touch the annotation," which leads to stale annotations persisting silently when a user adds then removes the field. Empty spec should mean "no SGs desired."

I'd recommend adding a workflow for this upgrade/migration scenario, similar to the Unmanaged Subnet Annotation Migration Workflow in the subnets EP, then I would mention that we support upgrades via the effectuation pattern and link to the workflow in this Upgrades section.


Changing the `securityGroups` field on an existing IngressController requires
deleting and recreating the Service.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a "why" is pretty valuable, here's just a suggestion, feel free to modify:

Suggested change
Although the CCM supports updating security groups in-place, the delete/recreate pattern is used for the following reasons:
1. **The CIO cannot detect post-provision CCM failures.** Once the NLB is provisioned, `LoadBalancerReady` stays `True` even if a subsequent SG update fails. Requiring Service recreation ensures that invalid security groups are caught during initial provisioning via `LoadBalancerReady=False`.
2. **Upgrade compatibility with previously-unmanaged annotations.** If a user manually set the `service.beta.kubernetes.io/aws-load-balancer-security-groups` annotation before this feature existed, the effectuation pattern prevents the CIO from automatically removing it on upgrade. Instead, the CIO sets `LoadBalancerProgressing=True` and waits for the user to act.
See the subnets EP's [Effectuating Subnet Updates](https://github.com/openshift/enhancements/blob/master/enhancements/ingress/lb-subnet-selection-aws.md#effectuating-subnet-updates) for prior art.

Comment on lines +209 to +214
The IngressController securityGroups were changed from ["sg-old123"] to
["sg-new456"]. To effectuate this change, you must delete the service:
`oc -n openshift-ingress delete svc/router-default`; the service
load-balancer will then be deprovisioned and a new one created. This will
cause the new load-balancer to have a different host name and IP address
and will cause disruption.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit just for clarify, this message will be slightly different and also include the patch revert command since you will likely use https://github.com/openshift/cluster-ingress-operator/blob/7b7406ee0d4bf03de360d53c9cc4a83ee14332cc/pkg/operator/controller/ingress/load_balancer_service.go#L862

But not critical to update in the EP, more for your context. I get the gist of the design.

- Adds a new field to the IngressController API, increasing API surface area.
- Users must manage security group rules outside of OpenShift, which requires
AWS knowledge.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delayed effectuation pattern does have a bit of a downside. It solves a lot of problems, but isn't very user friendly. Ultimately, it's the lesser of two evils.

Suggested change
- Changing security groups requires Service recreation, causing ingress downtime while the new NLB is provisioned and DNS is updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants