NE-2386: IngressController API for AWS NLB security group selection#2037
NE-2386: IngressController API for AWS NLB security group selection#2037aswinsuryan wants to merge 4 commits into
Conversation
Add enhancement proposal for specifying custom (BYO) security groups on AWS NLB IngressControllers via a new securityGroups field on AWSNetworkLoadBalancerParameters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Aswin Suryanarayanan <asuryana@redhat.com>
|
Skipping CI for Draft Pull Request. |
|
@aswinsuryan: This pull request references NE-2386 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
cc @mfbonfigli |
mtulio
left a comment
There was a problem hiding this comment.
Overall looks good! I have some questions and comments about CCM behavior with managed SG.
| [Open Questions](#open-questions)). | ||
| 4. The master node IAM role must have the | ||
| `elasticloadbalancing:SetSecurityGroups` permission (added in | ||
| [openshift/installer#10512](https://github.com/openshift/installer/pull/10512)). |
There was a problem hiding this comment.
As well is about to be added to the ROSA managed policy at Jira card : TBD
There was a problem hiding this comment.
sorry, the "TBD" was referring to https://redhat.atlassian.net/browse/SPLAT-2742 https://redhat.atlassian.net/browse/ROSAENG-14285 :)
| security groups are not deleted (the user retains ownership). The | ||
| exact CCM behavior for transitioning back to a managed security group | ||
| after BYO removal is not documented upstream and should be verified | ||
| during implementation. |
There was a problem hiding this comment.
If I followed correctly when BYOSG annotation is removed in existing Service/NLB which was previously created with BYOSG, a managed SG will be created attaching it to the new SG, leaving SG resource unattached to the NLB:
https://github.com/mfbonfigli/cloud-provider-aws/blob/31a27a5f9ac61ad68f9b4d0a8da765ff060245d3/pkg/providers/v1/aws.go#L2276-L2304
There is also a e2e for it: https://github.com/kubernetes/cloud-provider-aws/blob/c34d66ed717aaffe3319f149c26e28163e206a3e/tests/e2e/loadbalancer.go#L548
cc @mfbonfigli
Additional changes based on review feedback: - Clarify CCM behavior when removing BYO security groups - Add guidance on Service vs IngressController deletion - Update Prerequisites to reflect automatic CCCMO enablement Signed-off-by: Aswin Suryanarayanan <asuryana@redhat.com>
|
/assign |
| 5. The CCM attempts to attach the security groups to the NLB, but AWS | ||
| does not allow adding security groups to an NLB that was created | ||
| without security group support. The CCM reports an error. |
There was a problem hiding this comment.
CCM won't even attempt to add the security group if it detects the NLB does not support them, it will essentially just ignore the setting.
There was a problem hiding this comment.
Thanks for the clarification! Updated steps 5 and 6 to reflect that CCM silently ignores the annotation rather than reporting an error.
| On downgrade to a version that does not recognize the `securityGroups` field, | ||
| the field is ignored. Existing NLBs with attached security groups continue | ||
| functioning. No managed security groups will be created or modified by older | ||
| versions. |
There was a problem hiding this comment.
I don't think this is correct. The old CCM version would error if a service had that specific annotation set:
The reconciliation loop would fail.
Also due to a preexisting behavior, we can't really remove the annotation after the downgrade or else we risk the controller treating the BYO as a managed SG and wiping rules on it (which should never happen since it's not a cluster owned resource). Discussion ref in upstream here
The only correct downgrade way is to remove the annotation before the downgrade and then downgrade.
There was a problem hiding this comment.
Rewrote the downgrade section to state that administrators must remove the securityGroups field before downgrading
- Fix CCM behavior for NLBs without SG support - Clarify NLB recreation needed when enabling Managed mode after NLB creation - Rewrite downgrade strategy to prevent reconciliation failures Signed-off-by: Aswin Suryanarayanan <asuryana@redhat.com>
gcs278
left a comment
There was a problem hiding this comment.
Overall looks good! I think the big thing is using the established delete/recreate effectuation pattern as mentioned in the subnets EP.
I think we should require Service recreation on any change to securityGroups (add, update, or remove) using the established "always recreate the service" pattern. Security groups don't change often, so the disruption is tolerable, and it avoids the need for complex post-provision validation. The CCM doesn't provide a reliable signal when an invalid SG is applied to an already-provisioned NLB (just like subnets).
| does not allow adding security groups to an NLB that was created | ||
| without security group support. The CCM reports an error. | ||
|
|
||
| 6. The administrator must delete and recreate the NLB to enable security |
There was a problem hiding this comment.
Good info. The IngressController already has a pattern (introduced here) where if a spec field is changed and the LB needs to be recreated, it will add a Progressing=true message with instructions on how to delete the service manually (or revert).
I think we should use that pattern here. However, we don't have a pattern for "only when added for the first time, delete/recreate service". That's new. It's probably not too bad to implement, but see my other github comment below on why I think we should use the established "delete/recreate Service" pattern for this annotation rather than supporting in-place updates.
Either way, I'd recommend to document that in this workflow for clarity - mention that the CIO will add a Progressing=True message with instructions for the user. My subnets EP has a bunch of related info that could help.
There was a problem hiding this comment.
Adopted the effectuation pattern from the subnets enhancement. Added "Effectuating Security Group Updates" section with the full LoadBalancerProgressing workflow. Update, remove, and add operations all follow this pattern.
| (e.g., the cloud-controller-manager-operator has not been updated to set it, | ||
| or the cluster is running an older CCM version). |
There was a problem hiding this comment.
To be clear, this isn't possible in a healthy cluster?
CCM is always shipping cloud-config with NLBSecurityGroupMode = Managed now. The FG openshift/api#2717 is now set to GA, and permanently enabled.
You can keep this if you want, but also it's describing a fundamentally broken cluster.
There was a problem hiding this comment.
Agreed. Removed this section
| (must match `sg-` followed by 8-17 hex characters) and accepts it. | ||
|
|
||
| 3. The CCM attempts to attach the security group and fails. The error | ||
| is surfaced as a `SyncLoadBalancerFailed` event on the Service. The |
There was a problem hiding this comment.
Right now, the logic in the CIO can only detect a LB failure on first provisioning, but not for updates. The CIO only checks for SyncLoadBalancerFailed events when the Service is in isPending state. If the NLB is already provisioned and the user changes to an invalid SG, the Service still has status.LoadBalancer.Ingress populated, so the CIO reports LoadBalancerReady=True and the error is only visible in CCM logs/events.
It's an unfortunate existing blind spot with no easy fix. It's one of the reasons we use the "delete/recreate" pattern for annotations, even when the CCM supports in-place updates, the CIO has no way to detect or surface post-provision failures.
Something to consider. I'd suggest going with "delete/recreate" for any update to security groups out of safety (and it's also a lot easier to implement) and document that in the EP.
There was a problem hiding this comment.
All securityGroups changes (add, update, remove) now require Service deletion and recreation via the effectuation pattern.
| `eipAllocations` — it reads the `securityGroups` field from the | ||
| IngressController spec and sets the corresponding Service annotation. | ||
|
|
||
| When `securityGroups` is specified, the operator checks the `cloud-provider-config` |
There was a problem hiding this comment.
Same point as above, does this actually add value if we control the cloud-provider-config configmap?
There was a problem hiding this comment.
Removed the cloud-provider-config validation logic.
| Service directly (allowing automatic recreation) or delete and recreate the | ||
| entire IngressController. See the "NLB Created Before Security Group Support" | ||
| variation for detailed guidance. | ||
|
|
There was a problem hiding this comment.
Along with updates to describe the effectuation pattern (delete/recreate): it's worth adding a note/subsection about the annotation ingress.operator.openshift.io/auto-delete-load-balancer - we should support that and document as a workflow.
There was a problem hiding this comment.
Added an "Automatic Service Deletion" subsection under the effectuation section, documenting the auto-delete annotation with a YAML example.
|
|
||
| 5. Once `NLBSecurityGroupMode = Managed` is set in the cloud-config, the | ||
| Cluster Ingress Operator will reconcile and add the annotation to the Service. | ||
| However, if the NLB was created before Managed mode was enabled, the |
There was a problem hiding this comment.
This is another reason to always require a service recreation (delete/recreate) when changing the security group. It avoids having to write any of this logic.
There was a problem hiding this comment.
Removed the pre-existing NLB variation section. The effectuation pattern handles all cases uniformly.
Change the enhancement to require Service deletion and recreation when the securityGroups field is updated, rather than attempting in-place annotation updates. This addresses review feedback that in-place updates create a validation blind spot: the Cluster Ingress Operator cannot detect CCM failures after initial NLB provisioning because the Service status remains populated even when the CCM fails to apply changes. Signed-off-by: Aswin Suryanarayanan <asuryana@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@aswinsuryan: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
gcs278
left a comment
There was a problem hiding this comment.
Thanks for the quick updates.
Nearly LGTM - I agree with the approach, just have minor EP-specific points.
| Existing IngressControllers without the `securityGroups` field continue | ||
| working unchanged. The new field is optional and has no default value. |
There was a problem hiding this comment.
The upgrade situation is a bit more nuanced than what's described here. We have to handle the case where a user has manually added the service.beta.kubernetes.io/aws-load-balancer-security-groups annotation to the Service before this feature existed. When the CIO upgrades to a version that manages this annotation, the spec field will be empty (no SGs configured), but the annotation exists on the Service. The CIO sees a mismatch — the annotation is present but the spec says it shouldn't be — and we need to avoid automatically stomping on the user's manually-set annotation.
The good news is the effectuation pattern handles this nicely. On upgrade, the CIO will detect that the current annotation doesn't match the desired state (no annotation), and fire LoadBalancerProgressing=True without removing the annotation. The user can then either set spec...securityGroups to match their existing annotation (resolving the mismatch without disruption) or delete the Service to let the CIO recreate it without the annotation.
It's also important that we don't treat an empty spec as "unmanaged" — we've had implementations in the past where empty spec meant "don't touch the annotation," which leads to stale annotations persisting silently when a user adds then removes the field. Empty spec should mean "no SGs desired."
I'd recommend adding a workflow for this upgrade/migration scenario, similar to the Unmanaged Subnet Annotation Migration Workflow in the subnets EP, then I would mention that we support upgrades via the effectuation pattern and link to the workflow in this Upgrades section.
|
|
||
| Changing the `securityGroups` field on an existing IngressController requires | ||
| deleting and recreating the Service. | ||
|
|
There was a problem hiding this comment.
I think having a "why" is pretty valuable, here's just a suggestion, feel free to modify:
| Although the CCM supports updating security groups in-place, the delete/recreate pattern is used for the following reasons: | |
| 1. **The CIO cannot detect post-provision CCM failures.** Once the NLB is provisioned, `LoadBalancerReady` stays `True` even if a subsequent SG update fails. Requiring Service recreation ensures that invalid security groups are caught during initial provisioning via `LoadBalancerReady=False`. | |
| 2. **Upgrade compatibility with previously-unmanaged annotations.** If a user manually set the `service.beta.kubernetes.io/aws-load-balancer-security-groups` annotation before this feature existed, the effectuation pattern prevents the CIO from automatically removing it on upgrade. Instead, the CIO sets `LoadBalancerProgressing=True` and waits for the user to act. | |
| See the subnets EP's [Effectuating Subnet Updates](https://github.com/openshift/enhancements/blob/master/enhancements/ingress/lb-subnet-selection-aws.md#effectuating-subnet-updates) for prior art. |
| The IngressController securityGroups were changed from ["sg-old123"] to | ||
| ["sg-new456"]. To effectuate this change, you must delete the service: | ||
| `oc -n openshift-ingress delete svc/router-default`; the service | ||
| load-balancer will then be deprovisioned and a new one created. This will | ||
| cause the new load-balancer to have a different host name and IP address | ||
| and will cause disruption. |
There was a problem hiding this comment.
nit just for clarify, this message will be slightly different and also include the patch revert command since you will likely use https://github.com/openshift/cluster-ingress-operator/blob/7b7406ee0d4bf03de360d53c9cc4a83ee14332cc/pkg/operator/controller/ingress/load_balancer_service.go#L862
But not critical to update in the EP, more for your context. I get the gist of the design.
| - Adds a new field to the IngressController API, increasing API surface area. | ||
| - Users must manage security group rules outside of OpenShift, which requires | ||
| AWS knowledge. | ||
|
|
There was a problem hiding this comment.
The delayed effectuation pattern does have a bit of a downside. It solves a lot of problems, but isn't very user friendly. Ultimately, it's the lesser of two evils.
| - Changing security groups requires Service recreation, causing ingress downtime while the new NLB is provisioned and DNS is updated. |
This enhancement enables administrators to specify custom (Bring Your Own) security groups for AWS Network Load Balancers on IngressControllers. A new securityGroups field is added to AWSNetworkLoadBalancerParameters, which the Cluster
Ingress Operator translates into the service.beta.kubernetes.io/aws-load-balancer-security-groups Service annotation for the Cloud Controller Manager. This targets ROSA as the primary deployment type and builds on the CCM-level BYO security
group support added in upstream cloud-provider-aws#1379.