Skip to content

CFP-45965: Use Service LoadBalancer IP as backend pod egress source#98

Open
Ayush-Rathor wants to merge 1 commit into
cilium:mainfrom
Ayush-Rathor:cfp-45965-service-lb-egress-source-ip
Open

CFP-45965: Use Service LoadBalancer IP as backend pod egress source#98
Ayush-Rathor wants to merge 1 commit into
cilium:mainfrom
Ayush-Rathor:cfp-45965-service-lb-egress-source-ip

Conversation

@Ayush-Rathor

Copy link
Copy Markdown

Summary

This PR adds a CFP for supporting Kubernetes Service LoadBalancer IPs as the egress source IP for selected backend pod outbound traffic.

The proposal is tracked in Cilium issue:

Motivation

Today, a Kubernetes Service LoadBalancer IP is mainly used for inbound traffic into a Service. For some environments, especially bare-metal, private cloud, and tenant-facing platforms, there is also a need for backend pod outbound traffic to use the same Service LoadBalancer IP as its source IP.

This enables:

  • stable allowlisting by external systems
  • Service-level egress identity
  • tenant attribution and audit logs
  • consistent inbound and outbound Service identity
  • easier integration with customer firewalls and external systems

Proposal

The CFP proposes an opt-in feature where backend pods selected by an annotated LoadBalancer Service can egress using the Service LoadBalancer IP as the source IP.

The initial implementation model described in the CFP uses a single LB VIP owner:

  1. Backend pod sends traffic to an external target.
  2. Source/backend node checks whether the pod should use a Service LB IP for egress.
  3. If the LB VIP owner is remote, traffic is steered/tunneled to the owner node.
  4. The owner node SNATs pod_ip -> service_loadbalancer_ip.
  5. External target replies to the Service LoadBalancer IP.
  6. Owner node reverse-DNATs the reply back to the original pod.
  7. If the backend pod is remote, the owner tunnels the reply back to the backend pod's node.

Current POC status

A working IPv4 POC exists for the L2 / single-owner model.

Validated behavior:

  • remote backend pod traffic is steered to the LB VIP owner node
  • owner node SNATs pod_ip -> Service LoadBalancer IP
  • external replies return to the Service LoadBalancer IP
  • owner reverse-DNATs replies back to the original pod
  • owner tunnels replies back to the backend pod if the pod is remote

Main design questions

This CFP is primarily seeking maintainer feedback on the following:

  1. Is the single LB VIP owner model acceptable for the initial implementation?
  2. For BGP-advertised LoadBalancer IPs, should this feature preserve a single VIP egress owner, or support multi-node advertisement / ECMP?
  3. Should this integrate more deeply with existing Cilium CT/NAT/EgressGateway logic?
  4. Is a Service annotation acceptable as the initial user-facing API, or should this be modeled as a CRD/policy?
  5. What ownership model should be used when LB VIP ownership changes?

Non-goals for the initial proposal

The initial CFP does not attempt to solve:

  • IPv6 support
  • distributed reverse NAT state
  • ECMP-safe multi-node SNAT
  • full EgressGateway replacement
  • arbitrary pod-to-egress-IP policy
  • production-grade implementation details such as final GC/timeout behavior

Those are documented as future milestones or open design areas.

Review requested

Feedback is requested from maintainers familiar with:

  • Service LoadBalancer datapath
  • L2 announcement behavior
  • BGP LoadBalancer IP advertisement
  • CT/NAT integration
  • EgressGateway architecture
  • Cilium BPF datapath ownership

The key question is whether the proposed single-owner model is the right foundation before turning the POC into implementation PRs.

Signed-off-by: Ayush-Rathor <ayushrathor104@gmail.com>
@Ayush-Rathor Ayush-Rathor force-pushed the cfp-45965-service-lb-egress-source-ip branch from 55f98cd to 99c7b11 Compare May 31, 2026 07:40
@joestringer

Copy link
Copy Markdown
Member

cc @cilium/sig-datapath @cilium/sig-lb @cilium/sig-bgp

@YutaroHayakawa YutaroHayakawa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some comments. I'll let others to comment on this, but I feel like this is more like a high-level application that can be implemented using Cilium's Service and Egress Gateway feature and doesn't necessarily need to be implemented in the tree.

This feature can be described as a special case of the Egress Gateway that:

  1. The EGW IP must be aligned with the Service VIP.
  2. The Service VIP announcer selection and EGW node selection must be aligned.

You can implement this feature with the combination of the existing Cilium features. Say you have a custom operator that consumes this kind of the resource.

apiVersion: example.com/v1
kind: ServiceEgress
metadata:
  name: foo
spec:
  serviceRef:
    name: service0

Say the service looks like this:

apiVersion: v1
kind: Service
metadata:
  name: service0
spec:
  type: LoadBalancer
  selector:
    app.kubernetes.io/name: MyApp
...
status:
  loadBalancer:
    ingress:
    - ip: 10.0.0.1

Then operator will generate CiliumEgressGatewayPolicy like this:

apiVersion: cilium.io/v2
kind: CiliumEgressGatewayPolicy
metadata:
  name: foo
spec:
  selectors:
  - podSelector:
      matchLabels:
        app.kubernetes.io/name: MyApp
  destinationCIDRs:
  - "0.0.0.0/0"

One problem here is how to sync the EGW node with announcement. At least for the L2 announcement, you can figure-out the owner node using the lease.

Another problem is how to align the gateway selection on EGW side. My suggestion here is using the per-ServiceEgress singleton node label that your operator manages. Assuming the label is always put to the node that has the L2 announcement lease, the final shape of the CiliumEgressGatewayPolicy is like:

apiVersion: cilium.io/v2
kind: CiliumEgressGatewayPolicy
metadata:
  name: foo
spec:
  selectors:
  - podSelector:
      matchLabels:
        app.kubernetes.io/name: MyApp
  destinationCIDRs:
  - "0.0.0.0/0"
  egressGateway:
    nodeSelector:
      matchLabels:
        example.com/service-egress-singleton-foo: "true"

With this approach, I believe you can achieve the same goal without modifying Cilium side. Some additional works we MAY need on Cilium side are:

  1. The naming convention of the lease is not a stable-API in my understanding, so you may need to negotiate with the maintainer to make it stable.
  2. The Service and EgressGateway datapath may have an assumption that Service VIP and EGW IP are non-overlapping. If that's the case, you may need to fix it.


## Motivation

Some environments use Service LoadBalancer IPs as stable tenant-facing or workload-facing identities. For those environments, inbound and outbound identity should be symmetric from an external system's point of view.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any example public environment that has this kind of the constraint?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of our offering is dedicated AI model and agent sandboxes. The platform exposes that workload through a LoadBalancer Service and assigns a stable public IP (a controller creates an ippool whenever customer requests a static IP)

For the customer, that Service LoadBalancer IP becomes the public identity of the workload.

These workloads often need to call customer-controlled or third-party systems, such as private APIs, webhooks, SaaS platforms, databases, CRM/ticketing systems, partner APIs, or retrieval/vector database endpoints. In enterprise environments, those systems commonly use IP allowlisting.

So customers naturally expect:

inbound traffic to the workload:  Service LoadBalancer IP A
outbound traffic from workload:   Service LoadBalancer IP A

But today the outbound path uses shared NAT IP. That breaks the static-IP expectation for tenant-facing dedicated workloads.


#### Option 1: Single-owner BGP advertisement

Advertise the LoadBalancer IP only from the selected egress owner node when this feature is enabled.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can achieve this behavior by setting CiliumBGPClusterConfig.spec.nodeSelector appropriately. One caveat is unlike the L2 announcement, this doesn't have any automatic failover mechanism.

If we want to have automatic failover, we need to implement L2-announcement-like leader election mechanism in BGP which I personally want to avoid.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

I agree that implementing a new L2-announcement-like leader election mechanism specifically for BGP would add complexity, and I am not proposing that as part of the initial scope.
So I will update the CFP to avoid implying that Cilium needs to add automatic BGP failover for this feature.

The initial proposal can focus on the L2/single-owner model, and the BGP section can be documented as:

  • possible with singleton node selection,
  • no automatic failover from Cilium initially,
  • external/operator-managed failover if required,
  • ECMP active-active explicitly out of scope for now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

Comment thread cilium/CFP-45965-service-lb-egress-source-ip.md
@Ayush-Rathor

Copy link
Copy Markdown
Author

@YutaroHayakawa Thanks for the suggestion.
I tested the suggested composition path. It gets close, but it does not work as-is because Service VIP and EgressGateway egressIP currently conflict in the return path. So either the original feature or the EgressGateway-based approach still requires Cilium-side changes.

Outbound SNAT happened correctly. On the gateway node:

10.20.32.240:<ephemeral-port> -> 10.20.32.9:80 SYN
10.20.32.9:80 -> 10.20.32.240:<ephemeral-port> SYN,ACK

But the TCP handshake did not complete. The connection timed out.
drop reason:

drop (Service backend not found)
10.20.32.9:80 -> 10.20.32.240:<ephemeral-port>

I also tested regular EgressGateway with a non-Service egress IP in the same setup, and that works.
My current understanding is that the return packet to Service VIP: is being handled as Service LB traffic and dropped because there is no Service backend for that ephemeral port, instead of being matched against EGW reverse-NAT state.

So I think the operator + EgressGateway approach is not achievable purely outside Cilium today. Either path still needs Cilium-side changes:

  • A Service-LB-egress feature; or
  • An EgressGateway enhancement that explicitly supports Service VIP == egressIP.
    But making changes to egress gateway for reverse traffic will make sense only when someone creates a custom controller which keeps the LB VIP owner and egress gateway node overlapped.

Also, there is already OVN-based prior art for this behavior. OVN-Kubernetes has an EgressService feature where pods backing a LoadBalancer Service can use that Service's ingress IP as their egress source IP. Its design explicitly handles the same conntrack/state problem by selecting a single node to handle both ingress and egress for that Service, steering the selected pods' egress traffic to that node, and applying SNAT to the Service ingress IP there. It is handled as part of the OVN-Kubernetes service/egress datapath design.

Operational concern with assigning Service VIPs to nodes in a multi tenant k8s cluster:

In our platform, there can be thousands of tenants and thousands of Service LoadBalancer IPs. The current Kubernetes/Cilium LB model keeps those Service VIPs abstracted as virtual Service IPs; they are not normally configured as regular Linux addresses on the worker node interfaces.

If using EgressGateway requires assigning every tenant/customer Service VIP directly to bare-metal gateway nodes, that introduces concerns:

  • many customer public IPs become configured directly on worker nodes
  • host services listening on wildcard addresses may become reachable unless separately firewalled
  • ARP ownership may become ambiguous unless carefully controlled
  • the clean abstraction of “this public IP belongs to a Kubernetes Service, not to the host” becomes weaker

For us, the desired model is that the Service LoadBalancer IP remains a Cilium/Kubernetes-managed virtual IP for ingress, and the same abstraction is reused for egress source identity. The public IP should identify the tenant Service, not become a normal host IP on the bare-metal server.

Based on the test above, it does not seem possible to implement this completely as an external operator using existing Cilium behavior today.

Prior art

There is already OVN-based prior art for this behavior.

OVN-Kubernetes has an EgressService feature where pods backing a LoadBalancer Service can use that Service's ingress IP as their egress source IP. Its design handles the same state/ownership problem by selecting a single node to handle both ingress and egress for that Service, steering the selected pods' egress traffic to that node, and applying SNAT to the Service ingress IP there.

This is close to the behavior we need, and it is handled as part of the OVN-Kubernetes service/egress datapath design rather than as a generic external Egress Gateway workaround.

@YutaroHayakawa

YutaroHayakawa commented Jun 5, 2026

Copy link
Copy Markdown
Member

So I think the operator + EgressGateway approach is not achievable purely outside Cilium today. Either path still needs Cilium-side changes:

Yep, then you probed my second point The Service and EgressGateway datapath may have an assumption that Service VIP and EGW IP are non-overlapping which is nice. With this, we can start the conversation around how to solve this problem.

If using EgressGateway requires assigning every tenant/customer Service VIP directly to bare-metal gateway nodes, that introduces concerns:

I see, then what I recommend you now is figuring out why Egress Gateway requires assigning IP to the interface and how we can avoid that. If it's necessary for a good reason, then I think you anyways need to do that for your custom implementation as well.

@Ayush-Rathor

Copy link
Copy Markdown
Author

I agree, the next useful step is to understand why EgressGateway currently requires the egress IP to be assigned to a node interface, and whether that requirement can be avoided for Cilium-managed Service LoadBalancer VIPs.

So I will look into both parts:

  • where EGW validates/selects the egress IP from node interface addresses;
  • whether a Cilium-managed Service LB VIP can be treated as a valid virtual egress IP;
  • whether EGW reverse-NAT handling can happen before Service LB frontend lookup/drop for this overlap case.

If the datapath can support this cleanly, then the API/control-plane surface can be discussed separately.

One possible API shape could be a higher-level Service-egress object that references a LoadBalancer Service and internally programs EGW-like behavior, while keeping the Service VIP as a Cilium-managed virtual IP rather than requiring it to be assigned as a normal Linux address on the node.

But I agree that before discussing the final API shape, the first question is whether the existing EgressGateway assumptions can be relaxed safely for:

Service LoadBalancer VIP == EgressGateway egressIP

@ysksuzuki

ysksuzuki commented Jun 5, 2026

Copy link
Copy Markdown
Member

Just to clarify, the idea is to use the VIP assigned to a LoadBalancer Service, advertise that VIP via BGP or L2 announcement, and then specify the same VIP as egressGateway.egressIP in a CiliumEgressGatewayPolicy. Is that correct?

My understanding is that one of the reasons EGW currently requires the egress IP to be assigned to a device is so that the gateway node can reply to ARP for that egress IP. If egressIP == LB VIP, and that VIP is already advertised via BGP or L2 announcement, then attaching the VIP to a device on the gateway node may not be strictly necessary, as long as return traffic for that VIP is guaranteed to reach the correct gateway node.

That said, I do not think we should relax this requirement unconditionally, because it may break existing EGW assumptions and use cases. Also, the EGW control-plane code currently assumes that the egress IP is assigned to an interface, derives the device name from it, and uses that device information for things like relaxing rp_filter. So if we want to support an egress IP that is not assigned to a device, we should also investigate the impact on those code paths.

Another thing we need to consider is that if the same VIP is advertised by multiple gateway nodes, return traffic may be distributed by ECMP on routers along the path. In that case, we would need some mechanism to ensure that replies return to the same gateway node that performed SNAT, or that the node receiving the reply can perform the correct reverse NAT. This should not be necessary for the single-owner model though.

@Ayush-Rathor

Copy link
Copy Markdown
Author

@ysksuzuki Yes, that is correct.

For the initial scope, I am only considering the single-owner model.

So for L2 announcement:

Service VIP owner == EgressGateway node

And for BGP:

singleton advertisement only

I agree that ECMP / multi-node advertisement is out of scope because that becomes the distributed NAT / return-path symmetry problem.

Your understanding matches what I saw in testing.

When the Service VIP was not assigned to a node interface, EGW selected the pod and gateway node, but the egress IP resolved to 0.0.0.0.

When I assigned the Service VIP to the gateway node, EGW accepted it and outbound SNAT worked, but return traffic failed because packets to:

ServiceVIP:

were dropped as Service LB traffic:

drop (Service backend not found)
10.20.32.9:80 -> 10.20.32.240:<ephemeral-port>

I also tested regular EGW with a non-Service egress IP in the same setup, and that works. So the failure is specific to:

Service VIP == EGW egressIP

I agree we should not relax the “egress IP must be assigned to an interface” requirement unconditionally.

The safe direction seems to be an explicit/special case only when:

  1. the egress IP is a Cilium-managed LoadBalancer Service VIP;
  2. the Service VIP owner/advertising node and EGW gateway node are the same node;
  3. the mode is single-owner only;
  4. ECMP / active-active advertisement is not enabled;
  5. the policy explicitly opts into using the Service VIP as egress source IP.

I will investigate the current EGW code paths around:

  • where EGW validates/selects egress IPs from node interface addresses;
  • where it derives the device name;
  • how that device is used for rp_filter relaxation and other setup;
  • whether outbound SNAT itself actually requires the IP to be assigned to Linux;
  • and whether EGW reverse-NAT can be checked before Service LB frontend lookup/drop when ServiceVIP == egressIP.

If the interface requirement is mainly for ARP/reachability and device/rp_filter setup, then for the single-owner Service VIP case it may be possible to avoid assigning the VIP to the host interface, because L2/BGP already ensures return traffic reaches the selected owner node.

If we find that the datapath genuinely depends on a real Linux device address, then the same limitation would apply to a custom implementation too, and we should document that clearly.

@Ayush-Rathor

Copy link
Copy Markdown
Author

@ysksuzuki @YutaroHayakawa
I validated this further with a dirty local patch to separate the control-plane/interface requirement from the datapath behavior.

The patch only bypassed the EGW control-plane requirement that the configured egressIP must be assigned to a Linux interface. I still provided a real egress interface name for device/rp_filter handling.

Test setup:

Service VIP / EGW egressIP: 10.20.32.240
Backend pod IP:             10.233.83.26
External target:            10.20.32.9:80
EGW gateway node:           worker-1
L2 announcement owner:      worker-1
VIP assigned to Linux:      no

The EGW map contained the Service VIP as the egress IP:

Source IP      Destination CIDR   Egress IP      Gateway IP
10.233.83.26   10.20.32.9/32      10.20.32.240   10.20.36.184

Outbound SNAT worked even though the VIP was not assigned to a host interface:

10.20.32.240:<port> -> 10.20.32.9:80 SYN

After aligning L2 ownership and EGW node, return traffic also reached the same node:

10.20.32.9:80 -> 10.20.32.240:<port> SYN/ACK

The ARP path was also handled by L2 announcement:

ARP who-has 10.20.32.240 tell 10.20.32.9
ARP reply 10.20.32.240 is-at <worker-1-mac>

I also checked the CT/NAT maps after the failed curl. The NAT state exists for both directions:

TCP OUT 10.233.83.26:<port> -> 10.20.32.9:80
XLATE_SRC 10.20.32.240:<port>
TCP IN 10.20.32.9:80 -> 10.20.32.240:<port>
XLATE_DST 10.233.83.26:<port>

So the remaining issue is not missing SNAT/reverse-NAT state.

The SYN/ACK reaches the EGW/L2 owner node, but is dropped as Service LB traffic:

drop (Service backend not found)
10.20.32.9:80 -> 10.20.32.240:<ephemeral-port> tcp SYN, ACK

So the current finding is:

Service LoadBalancer VIP == EGW egressIP

can work for outbound IPv4 SNAT without assigning the VIP to Linux, provided a real egress interface is still known for device/rp_filter handling and the Service VIP ownership is single-owner/aligned with the EGW node.

The remaining Cilium-side issue appears to be return-path ordering/handling: return traffic to ServiceVIP: has matching reverse-NAT state, but is still classified as Service LB traffic and dropped as DROP_NO_SERVICE. For this overlap case, EGW/NAT reverse handling likely needs to win before Service frontend lookup/drop.

@Ayush-Rathor

Copy link
Copy Markdown
Author

@ysksuzuki @YutaroHayakawa
Based on the testing above, it seems this use case may fit naturally as an extension of the existing Egress Gateway implementation.

The missing pieces appear to be:

  • Allowing an egress IP that is advertised via Service LB mechanisms rather than assigned to a Linux interface.
  • Automatic alignment between the Service VIP owner and the egress node.
  • Correct reverse-NAT handling when ServiceVIP == egressIP.

Initially we should support this feature only for ARP, we may bring BGP support later.
Whether that is ultimately exposed through EGW APIs, Service APIs, or a dedicated higher-level resource is probably a separate discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants