CFP-45965: Use Service LoadBalancer IP as backend pod egress source#98
CFP-45965: Use Service LoadBalancer IP as backend pod egress source#98Ayush-Rathor wants to merge 1 commit into
Conversation
Signed-off-by: Ayush-Rathor <ayushrathor104@gmail.com>
55f98cd to
99c7b11
Compare
|
cc @cilium/sig-datapath @cilium/sig-lb @cilium/sig-bgp |
YutaroHayakawa
left a comment
There was a problem hiding this comment.
Made some comments. I'll let others to comment on this, but I feel like this is more like a high-level application that can be implemented using Cilium's Service and Egress Gateway feature and doesn't necessarily need to be implemented in the tree.
This feature can be described as a special case of the Egress Gateway that:
- The EGW IP must be aligned with the Service VIP.
- The Service VIP announcer selection and EGW node selection must be aligned.
You can implement this feature with the combination of the existing Cilium features. Say you have a custom operator that consumes this kind of the resource.
apiVersion: example.com/v1
kind: ServiceEgress
metadata:
name: foo
spec:
serviceRef:
name: service0Say the service looks like this:
apiVersion: v1
kind: Service
metadata:
name: service0
spec:
type: LoadBalancer
selector:
app.kubernetes.io/name: MyApp
...
status:
loadBalancer:
ingress:
- ip: 10.0.0.1Then operator will generate CiliumEgressGatewayPolicy like this:
apiVersion: cilium.io/v2
kind: CiliumEgressGatewayPolicy
metadata:
name: foo
spec:
selectors:
- podSelector:
matchLabels:
app.kubernetes.io/name: MyApp
destinationCIDRs:
- "0.0.0.0/0"One problem here is how to sync the EGW node with announcement. At least for the L2 announcement, you can figure-out the owner node using the lease.
Another problem is how to align the gateway selection on EGW side. My suggestion here is using the per-ServiceEgress singleton node label that your operator manages. Assuming the label is always put to the node that has the L2 announcement lease, the final shape of the CiliumEgressGatewayPolicy is like:
apiVersion: cilium.io/v2
kind: CiliumEgressGatewayPolicy
metadata:
name: foo
spec:
selectors:
- podSelector:
matchLabels:
app.kubernetes.io/name: MyApp
destinationCIDRs:
- "0.0.0.0/0"
egressGateway:
nodeSelector:
matchLabels:
example.com/service-egress-singleton-foo: "true"With this approach, I believe you can achieve the same goal without modifying Cilium side. Some additional works we MAY need on Cilium side are:
- The naming convention of the lease is not a stable-API in my understanding, so you may need to negotiate with the maintainer to make it stable.
- The Service and EgressGateway datapath may have an assumption that Service VIP and EGW IP are non-overlapping. If that's the case, you may need to fix it.
|
|
||
| ## Motivation | ||
|
|
||
| Some environments use Service LoadBalancer IPs as stable tenant-facing or workload-facing identities. For those environments, inbound and outbound identity should be symmetric from an external system's point of view. |
There was a problem hiding this comment.
Is there any example public environment that has this kind of the constraint?
There was a problem hiding this comment.
One of our offering is dedicated AI model and agent sandboxes. The platform exposes that workload through a LoadBalancer Service and assigns a stable public IP (a controller creates an ippool whenever customer requests a static IP)
For the customer, that Service LoadBalancer IP becomes the public identity of the workload.
These workloads often need to call customer-controlled or third-party systems, such as private APIs, webhooks, SaaS platforms, databases, CRM/ticketing systems, partner APIs, or retrieval/vector database endpoints. In enterprise environments, those systems commonly use IP allowlisting.
So customers naturally expect:
inbound traffic to the workload: Service LoadBalancer IP A
outbound traffic from workload: Service LoadBalancer IP A
But today the outbound path uses shared NAT IP. That breaks the static-IP expectation for tenant-facing dedicated workloads.
|
|
||
| #### Option 1: Single-owner BGP advertisement | ||
|
|
||
| Advertise the LoadBalancer IP only from the selected egress owner node when this feature is enabled. |
There was a problem hiding this comment.
We can achieve this behavior by setting CiliumBGPClusterConfig.spec.nodeSelector appropriately. One caveat is unlike the L2 announcement, this doesn't have any automatic failover mechanism.
If we want to have automatic failover, we need to implement L2-announcement-like leader election mechanism in BGP which I personally want to avoid.
There was a problem hiding this comment.
That makes sense.
I agree that implementing a new L2-announcement-like leader election mechanism specifically for BGP would add complexity, and I am not proposing that as part of the initial scope.
So I will update the CFP to avoid implying that Cilium needs to add automatic BGP failover for this feature.
The initial proposal can focus on the L2/single-owner model, and the BGP section can be documented as:
- possible with singleton node selection,
- no automatic failover from Cilium initially,
- external/operator-managed failover if required,
- ECMP active-active explicitly out of scope for now.
|
@YutaroHayakawa Thanks for the suggestion. Outbound SNAT happened correctly. On the gateway node: But the TCP handshake did not complete. The connection timed out. I also tested regular EgressGateway with a non-Service egress IP in the same setup, and that works. So I think the operator + EgressGateway approach is not achievable purely outside Cilium today. Either path still needs Cilium-side changes:
Also, there is already OVN-based prior art for this behavior. OVN-Kubernetes has an Operational concern with assigning Service VIPs to nodes in a multi tenant k8s cluster:In our platform, there can be thousands of tenants and thousands of Service LoadBalancer IPs. The current Kubernetes/Cilium LB model keeps those Service VIPs abstracted as virtual Service IPs; they are not normally configured as regular Linux addresses on the worker node interfaces. If using EgressGateway requires assigning every tenant/customer Service VIP directly to bare-metal gateway nodes, that introduces concerns:
For us, the desired model is that the Service LoadBalancer IP remains a Cilium/Kubernetes-managed virtual IP for ingress, and the same abstraction is reused for egress source identity. The public IP should identify the tenant Service, not become a normal host IP on the bare-metal server. Based on the test above, it does not seem possible to implement this completely as an external operator using existing Cilium behavior today. Prior artThere is already OVN-based prior art for this behavior. OVN-Kubernetes has an EgressService feature where pods backing a LoadBalancer Service can use that Service's ingress IP as their egress source IP. Its design handles the same state/ownership problem by selecting a single node to handle both ingress and egress for that Service, steering the selected pods' egress traffic to that node, and applying SNAT to the Service ingress IP there. This is close to the behavior we need, and it is handled as part of the OVN-Kubernetes service/egress datapath design rather than as a generic external Egress Gateway workaround. |
Yep, then you probed my second point
I see, then what I recommend you now is figuring out why Egress Gateway requires assigning IP to the interface and how we can avoid that. If it's necessary for a good reason, then I think you anyways need to do that for your custom implementation as well. |
|
I agree, the next useful step is to understand why EgressGateway currently requires the egress IP to be assigned to a node interface, and whether that requirement can be avoided for Cilium-managed Service LoadBalancer VIPs. So I will look into both parts:
If the datapath can support this cleanly, then the API/control-plane surface can be discussed separately. One possible API shape could be a higher-level Service-egress object that references a LoadBalancer Service and internally programs EGW-like behavior, while keeping the Service VIP as a Cilium-managed virtual IP rather than requiring it to be assigned as a normal Linux address on the node. But I agree that before discussing the final API shape, the first question is whether the existing EgressGateway assumptions can be relaxed safely for: Service LoadBalancer VIP == EgressGateway egressIP |
|
Just to clarify, the idea is to use the VIP assigned to a LoadBalancer Service, advertise that VIP via BGP or L2 announcement, and then specify the same VIP as My understanding is that one of the reasons EGW currently requires the egress IP to be assigned to a device is so that the gateway node can reply to ARP for that egress IP. If That said, I do not think we should relax this requirement unconditionally, because it may break existing EGW assumptions and use cases. Also, the EGW control-plane code currently assumes that the egress IP is assigned to an interface, derives the device name from it, and uses that device information for things like relaxing Another thing we need to consider is that if the same VIP is advertised by multiple gateway nodes, return traffic may be distributed by ECMP on routers along the path. In that case, we would need some mechanism to ensure that replies return to the same gateway node that performed SNAT, or that the node receiving the reply can perform the correct reverse NAT. This should not be necessary for the single-owner model though. |
|
@ysksuzuki Yes, that is correct. For the initial scope, I am only considering the single-owner model. So for L2 announcement: And for BGP: singleton advertisement only I agree that ECMP / multi-node advertisement is out of scope because that becomes the distributed NAT / return-path symmetry problem. Your understanding matches what I saw in testing. When the Service VIP was not assigned to a node interface, EGW selected the pod and gateway node, but the egress IP resolved to 0.0.0.0. When I assigned the Service VIP to the gateway node, EGW accepted it and outbound SNAT worked, but return traffic failed because packets to: ServiceVIP: were dropped as Service LB traffic: I also tested regular EGW with a non-Service egress IP in the same setup, and that works. So the failure is specific to: Service VIP == EGW egressIP I agree we should not relax the “egress IP must be assigned to an interface” requirement unconditionally. The safe direction seems to be an explicit/special case only when:
I will investigate the current EGW code paths around:
If the interface requirement is mainly for ARP/reachability and device/rp_filter setup, then for the single-owner Service VIP case it may be possible to avoid assigning the VIP to the host interface, because L2/BGP already ensures return traffic reaches the selected owner node. If we find that the datapath genuinely depends on a real Linux device address, then the same limitation would apply to a custom implementation too, and we should document that clearly. |
|
@ysksuzuki @YutaroHayakawa The patch only bypassed the EGW control-plane requirement that the configured Test setup: The EGW map contained the Service VIP as the egress IP: Outbound SNAT worked even though the VIP was not assigned to a host interface: After aligning L2 ownership and EGW node, return traffic also reached the same node: The ARP path was also handled by L2 announcement: I also checked the CT/NAT maps after the failed curl. The NAT state exists for both directions: So the remaining issue is not missing SNAT/reverse-NAT state. The SYN/ACK reaches the EGW/L2 owner node, but is dropped as Service LB traffic: So the current finding is: Service LoadBalancer VIP == EGW egressIP can work for outbound IPv4 SNAT without assigning the VIP to Linux, provided a real egress interface is still known for device/rp_filter handling and the Service VIP ownership is single-owner/aligned with the EGW node. The remaining Cilium-side issue appears to be return-path ordering/handling: return traffic to ServiceVIP: has matching reverse-NAT state, but is still classified as Service LB traffic and dropped as DROP_NO_SERVICE. For this overlap case, EGW/NAT reverse handling likely needs to win before Service frontend lookup/drop. |
|
@ysksuzuki @YutaroHayakawa The missing pieces appear to be:
Initially we should support this feature only for ARP, we may bring BGP support later. |
Summary
This PR adds a CFP for supporting Kubernetes Service LoadBalancer IPs as the egress source IP for selected backend pod outbound traffic.
The proposal is tracked in Cilium issue:
Motivation
Today, a Kubernetes Service LoadBalancer IP is mainly used for inbound traffic into a Service. For some environments, especially bare-metal, private cloud, and tenant-facing platforms, there is also a need for backend pod outbound traffic to use the same Service LoadBalancer IP as its source IP.
This enables:
Proposal
The CFP proposes an opt-in feature where backend pods selected by an annotated LoadBalancer Service can egress using the Service LoadBalancer IP as the source IP.
The initial implementation model described in the CFP uses a single LB VIP owner:
pod_ip -> service_loadbalancer_ip.Current POC status
A working IPv4 POC exists for the L2 / single-owner model.
Validated behavior:
pod_ip -> Service LoadBalancer IPMain design questions
This CFP is primarily seeking maintainer feedback on the following:
Non-goals for the initial proposal
The initial CFP does not attempt to solve:
Those are documented as future milestones or open design areas.
Review requested
Feedback is requested from maintainers familiar with:
The key question is whether the proposed single-owner model is the right foundation before turning the POC into implementation PRs.