NE-2386: IngressController API for AWS NLB security group selection by aswinsuryan · Pull Request #2037 · openshift/enhancements

aswinsuryan · 2026-06-11T03:11:36Z

This enhancement enables administrators to specify custom (Bring Your Own) security groups for AWS Network Load Balancers on IngressControllers. A new securityGroups field is added to AWSNetworkLoadBalancerParameters, which the Cluster
Ingress Operator translates into the service.beta.kubernetes.io/aws-load-balancer-security-groups Service annotation for the Cloud Controller Manager. This targets ROSA as the primary deployment type and builds on the CCM-level BYO security
group support added in upstream cloud-provider-aws#1379.

Add enhancement proposal for specifying custom (BYO) security groups on AWS NLB IngressControllers via a new securityGroups field on AWSNetworkLoadBalancerParameters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Aswin Suryanarayanan <asuryana@redhat.com>

openshift-ci · 2026-06-11T03:11:40Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-06-11T03:11:40Z

@aswinsuryan: This pull request references NE-2386 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

This enhancement enables administrators to specify custom (Bring Your Own) security groups for AWS Network Load Balancers on IngressControllers. A new securityGroups field is added to AWSNetworkLoadBalancerParameters, which the Cluster
Ingress Operator translates into the service.beta.kubernetes.io/aws-load-balancer-security-groups Service annotation for the Cloud Controller Manager. This targets ROSA as the primary deployment type and builds on the CCM-level BYO security
group support added in upstream cloud-provider-aws#1379.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-06-11T03:11:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign miciah for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

enhancements/ingress/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mtulio · 2026-06-12T14:47:06Z

cc @mfbonfigli

mtulio

Overall looks good! I have some questions and comments about CCM behavior with managed SG.

mtulio · 2026-06-12T14:58:12Z

+   [Open Questions](#open-questions)).
+4. The master node IAM role must have the
+   `elasticloadbalancing:SetSecurityGroups` permission (added in
+   [openshift/installer#10512](https://github.com/openshift/installer/pull/10512)).


As well is about to be added to the ROSA managed policy at Jira card : TBD

sorry, the "TBD" was referring to https://redhat.atlassian.net/browse/SPLAT-2742 https://redhat.atlassian.net/browse/ROSAENG-14285 :)

mtulio · 2026-06-12T15:39:38Z

+   security groups are not deleted (the user retains ownership). The
+   exact CCM behavior for transitioning back to a managed security group
+   after BYO removal is not documented upstream and should be verified
+   during implementation.


If I followed correctly when BYOSG annotation is removed in existing Service/NLB which was previously created with BYOSG, a managed SG will be created attaching it to the new SG, leaving SG resource unattached to the NLB:
https://github.com/mfbonfigli/cloud-provider-aws/blob/31a27a5f9ac61ad68f9b4d0a8da765ff060245d3/pkg/providers/v1/aws.go#L2276-L2304

There is also a e2e for it: https://github.com/kubernetes/cloud-provider-aws/blob/c34d66ed717aaffe3319f149c26e28163e206a3e/tests/e2e/loadbalancer.go#L548

cc @mfbonfigli

Additional changes based on review feedback: - Clarify CCM behavior when removing BYO security groups - Add guidance on Service vs IngressController deletion - Update Prerequisites to reflect automatic CCCMO enablement Signed-off-by: Aswin Suryanarayanan <asuryana@redhat.com>

gcs278 · 2026-06-23T21:01:08Z

/assign

mfbonfigli · 2026-06-24T10:24:17Z

+5. The CCM attempts to attach the security groups to the NLB, but AWS
+   does not allow adding security groups to an NLB that was created
+   without security group support. The CCM reports an error.


CCM won't even attempt to add the security group if it detects the NLB does not support them, it will essentially just ignore the setting.

Thanks for the clarification! Updated steps 5 and 6 to reflect that CCM silently ignores the annotation rather than reporting an error.

mfbonfigli · 2026-06-24T11:21:08Z

+On downgrade to a version that does not recognize the `securityGroups` field,
+the field is ignored. Existing NLBs with attached security groups continue
+functioning. No managed security groups will be created or modified by older
+versions.


I don't think this is correct. The old CCM version would error if a service had that specific annotation set:

https://github.com/openshift/cloud-provider-aws/blob/82da06f1b6e1643b06b4463abf25333060cdc816/pkg/providers/v1/aws_validations.go#L81

The reconciliation loop would fail.

Also due to a preexisting behavior, we can't really remove the annotation after the downgrade or else we risk the controller treating the BYO as a managed SG and wiping rules on it (which should never happen since it's not a cluster owned resource). Discussion ref in upstream here

The only correct downgrade way is to remove the annotation before the downgrade and then downgrade.

Rewrote the downgrade section to state that administrators must remove the securityGroups field before downgrading

- Fix CCM behavior for NLBs without SG support - Clarify NLB recreation needed when enabling Managed mode after NLB creation - Rewrite downgrade strategy to prevent reconciliation failures Signed-off-by: Aswin Suryanarayanan <asuryana@redhat.com>

gcs278

Overall looks good! I think the big thing is using the established delete/recreate effectuation pattern as mentioned in the subnets EP.

I think we should require Service recreation on any change to securityGroups (add, update, or remove) using the established "always recreate the service" pattern. Security groups don't change often, so the disruption is tolerable, and it avoids the need for complex post-provision validation. The CCM doesn't provide a reliable signal when an invalid SG is applied to an already-provisioned NLB (just like subnets).

gcs278 · 2026-06-24T01:53:02Z

+   does not allow adding security groups to an NLB that was created
+   without security group support. The CCM reports an error.
+
+6. The administrator must delete and recreate the NLB to enable security


Good info. The IngressController already has a pattern (introduced here) where if a spec field is changed and the LB needs to be recreated, it will add a Progressing=true message with instructions on how to delete the service manually (or revert).

I think we should use that pattern here. However, we don't have a pattern for "only when added for the first time, delete/recreate service". That's new. It's probably not too bad to implement, but see my other github comment below on why I think we should use the established "delete/recreate Service" pattern for this annotation rather than supporting in-place updates.

Either way, I'd recommend to document that in this workflow for clarity - mention that the CIO will add a Progressing=True message with instructions for the user. My subnets EP has a bunch of related info that could help.

Adopted the effectuation pattern from the subnets enhancement. Added "Effectuating Security Group Updates" section with the full LoadBalancerProgressing workflow. Update, remove, and add operations all follow this pattern.

gcs278 · 2026-06-24T01:59:57Z

+   (e.g., the cloud-controller-manager-operator has not been updated to set it,
+   or the cluster is running an older CCM version).


To be clear, this isn't possible in a healthy cluster?

CCM is always shipping cloud-config with NLBSecurityGroupMode = Managed now. The FG openshift/api#2717 is now set to GA, and permanently enabled.

You can keep this if you want, but also it's describing a fundamentally broken cluster.

Agreed. Removed this section

gcs278 · 2026-06-24T02:09:42Z

+   (must match `sg-` followed by 8-17 hex characters) and accepts it.
+
+3. The CCM attempts to attach the security group and fails. The error
+   is surfaced as a `SyncLoadBalancerFailed` event on the Service. The


Right now, the logic in the CIO can only detect a LB failure on first provisioning, but not for updates. The CIO only checks for SyncLoadBalancerFailed events when the Service is in isPending state. If the NLB is already provisioned and the user changes to an invalid SG, the Service still has status.LoadBalancer.Ingress populated, so the CIO reports LoadBalancerReady=True and the error is only visible in CCM logs/events.

It's an unfortunate existing blind spot with no easy fix. It's one of the reasons we use the "delete/recreate" pattern for annotations, even when the CCM supports in-place updates, the CIO has no way to detect or surface post-provision failures.

Something to consider. I'd suggest going with "delete/recreate" for any update to security groups out of safety (and it's also a lot easier to implement) and document that in the EP.

All securityGroups changes (add, update, remove) now require Service deletion and recreation via the effectuation pattern.

gcs278 · 2026-06-24T23:55:21Z

+`eipAllocations` — it reads the `securityGroups` field from the
+IngressController spec and sets the corresponding Service annotation.
+
+When `securityGroups` is specified, the operator checks the `cloud-provider-config`


Same point as above, does this actually add value if we control the cloud-provider-config configmap?

Removed the cloud-provider-config validation logic.

gcs278 · 2026-06-25T00:42:31Z

+Service directly (allowing automatic recreation) or delete and recreate the
+entire IngressController. See the "NLB Created Before Security Group Support"
+variation for detailed guidance.
+


Along with updates to describe the effectuation pattern (delete/recreate): it's worth adding a note/subsection about the annotation ingress.operator.openshift.io/auto-delete-load-balancer - we should support that and document as a workflow.

Added an "Automatic Service Deletion" subsection under the effectuation section, documenting the auto-delete annotation with a YAML example.

gcs278 · 2026-06-25T00:48:32Z

+
+5. Once `NLBSecurityGroupMode = Managed` is set in the cloud-config, the
+   Cluster Ingress Operator will reconcile and add the annotation to the Service.
+   However, if the NLB was created before Managed mode was enabled, the


This is another reason to always require a service recreation (delete/recreate) when changing the security group. It avoids having to write any of this logic.

Removed the pre-existing NLB variation section. The effectuation pattern handles all cases uniformly.

Change the enhancement to require Service deletion and recreation when the securityGroups field is updated, rather than attempting in-place annotation updates. This addresses review feedback that in-place updates create a validation blind spot: the Cluster Ingress Operator cannot detect CCM failures after initial NLB provisioning because the Service status remains populated even when the CCM fails to apply changes. Signed-off-by: Aswin Suryanarayanan <asuryana@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci · 2026-06-29T20:19:09Z

@aswinsuryan: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`96c2858`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

gcs278

Thanks for the quick updates.

Nearly LGTM - I agree with the approach, just have minor EP-specific points.

gcs278 · 2026-07-01T21:14:02Z

+Existing IngressControllers without the `securityGroups` field continue
+working unchanged. The new field is optional and has no default value.


The upgrade situation is a bit more nuanced than what's described here. We have to handle the case where a user has manually added the service.beta.kubernetes.io/aws-load-balancer-security-groups annotation to the Service before this feature existed. When the CIO upgrades to a version that manages this annotation, the spec field will be empty (no SGs configured), but the annotation exists on the Service. The CIO sees a mismatch — the annotation is present but the spec says it shouldn't be — and we need to avoid automatically stomping on the user's manually-set annotation.

The good news is the effectuation pattern handles this nicely. On upgrade, the CIO will detect that the current annotation doesn't match the desired state (no annotation), and fire LoadBalancerProgressing=True without removing the annotation. The user can then either set spec...securityGroups to match their existing annotation (resolving the mismatch without disruption) or delete the Service to let the CIO recreate it without the annotation.

It's also important that we don't treat an empty spec as "unmanaged" — we've had implementations in the past where empty spec meant "don't touch the annotation," which leads to stale annotations persisting silently when a user adds then removes the field. Empty spec should mean "no SGs desired."

I'd recommend adding a workflow for this upgrade/migration scenario, similar to the Unmanaged Subnet Annotation Migration Workflow in the subnets EP, then I would mention that we support upgrades via the effectuation pattern and link to the workflow in this Upgrades section.

gcs278 · 2026-07-01T21:21:00Z

+
+Changing the `securityGroups` field on an existing IngressController requires
+deleting and recreating the Service.
+


I think having a "why" is pretty valuable, here's just a suggestion, feel free to modify:

Suggested change

Although the CCM supports updating security groups in-place, the delete/recreate pattern is used for the following reasons:

1. **The CIO cannot detect post-provision CCM failures.** Once the NLB is provisioned, `LoadBalancerReady` stays `True` even if a subsequent SG update fails. Requiring Service recreation ensures that invalid security groups are caught during initial provisioning via `LoadBalancerReady=False`.

2. **Upgrade compatibility with previously-unmanaged annotations.** If a user manually set the `service.beta.kubernetes.io/aws-load-balancer-security-groups` annotation before this feature existed, the effectuation pattern prevents the CIO from automatically removing it on upgrade. Instead, the CIO sets `LoadBalancerProgressing=True` and waits for the user to act.

See the subnets EP's [Effectuating Subnet Updates](https://github.com/openshift/enhancements/blob/master/enhancements/ingress/lb-subnet-selection-aws.md#effectuating-subnet-updates) for prior art.

gcs278 · 2026-07-01T21:26:54Z

+      The IngressController securityGroups were changed from ["sg-old123"] to
+      ["sg-new456"]. To effectuate this change, you must delete the service:
+      `oc -n openshift-ingress delete svc/router-default`; the service
+      load-balancer will then be deprovisioned and a new one created. This will
+      cause the new load-balancer to have a different host name and IP address
+      and will cause disruption.


nit just for clarify, this message will be slightly different and also include the patch revert command since you will likely use https://github.com/openshift/cluster-ingress-operator/blob/7b7406ee0d4bf03de360d53c9cc4a83ee14332cc/pkg/operator/controller/ingress/load_balancer_service.go#L862

But not critical to update in the EP, more for your context. I get the gist of the design.

gcs278 · 2026-07-01T21:28:45Z

+- Adds a new field to the IngressController API, increasing API surface area.
+- Users must manage security group rules outside of OpenShift, which requires
+  AWS knowledge.
+


The delayed effectuation pattern does have a bit of a downside. It solves a lot of problems, but isn't very user friendly. Ultimately, it's the lesser of two evils.

Suggested change

- Changing security groups requires Service recreation, causing ingress downtime while the new NLB is provisioned and DNS is updated.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 11, 2026

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2026

aswinsuryan marked this pull request as ready for review June 11, 2026 17:18

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2026

openshift-ci Bot requested review from Miciah and candita June 11, 2026 17:18

mtulio reviewed Jun 12, 2026

View reviewed changes

openshift-ci Bot assigned gcs278 Jun 23, 2026

mfbonfigli reviewed Jun 24, 2026

View reviewed changes

Comment thread enhancements/ingress/aws-nlb-security-group-selection.md Outdated

mfbonfigli reviewed Jun 24, 2026

View reviewed changes

gcs278 reviewed Jun 25, 2026

View reviewed changes

gcs278 reviewed Jul 1, 2026

View reviewed changes

		(e.g., the cloud-controller-manager-operator has not been updated to set it,
		or the cluster is running an older CCM version).

		Existing IngressControllers without the `securityGroups` field continue
		working unchanged. The new field is optional and has no default value.


		Changing the `securityGroups` field on an existing IngressController requires
		deleting and recreating the Service.

+Although the CCM supports updating security groups in-place, the delete/recreate pattern is used for the following reasons:
+. **The CIO cannot detect post-provision CCM failures.** Once the NLB is provisioned, `LoadBalancerReady` stays `True` even if a subsequent SG update fails. Requiring Service recreation ensures that invalid security groups are caught during initial provisioning via `LoadBalancerReady=False`.
+. **Upgrade compatibility with previously-unmanaged annotations.** If a user manually set the `service.beta.kubernetes.io/aws-load-balancer-security-groups` annotation before this feature existed, the effectuation pattern prevents the CIO from automatically removing it on upgrade. Instead, the CIO sets `LoadBalancerProgressing=True` and waits for the user to act.
+See the subnets EP's [Effectuating Subnet Updates](https://github.com/openshift/enhancements/blob/master/enhancements/ingress/lb-subnet-selection-aws.md#effectuating-subnet-updates) for prior art.


	- Changing security groups requires Service recreation, causing ingress downtime while the new NLB is provisioned and DNS is updated.

Uh oh!

Conversation

aswinsuryan commented Jun 11, 2026

Uh oh!

openshift-ci Bot commented Jun 11, 2026

Uh oh!

openshift-ci-robot commented Jun 11, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci Bot commented Jun 11, 2026

Uh oh!

mtulio commented Jun 12, 2026

Uh oh!

mtulio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gcs278 commented Jun 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mfbonfigli Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gcs278 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gcs278 Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jun 29, 2026

Uh oh!

gcs278 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

openshift-ci-robot commented Jun 11, 2026 •

edited by openshift-ci Bot

Loading

mfbonfigli Jun 24, 2026 •

edited

Loading

gcs278 Jun 24, 2026 •

edited

Loading