CFP-44188: Vtep Improvements with CRD#92
Conversation
Replace the static CLI flag-based VTEP configuration with a cluster-scoped CiliumVTEPConfig CRD that supports dynamic updates, per-node assignment via nodeSelector, and per-endpoint status reporting. Signed-off-by: Murat Parlakisik <parlakisik@gmail.com>
e1f2862 to
d0c2ee8
Compare
|
cc: @cilium/sig-datapath |
| If user doesnt want to manage BGP or L2 annoutment to send traffic to some network via external gateway. | ||
| The VTEP approach offers a fundamentally simpler model: | ||
|
|
||
| Pods send traffic via the existing VXLAN overlay directly to an external | ||
| vtep endpoint. No BGP sessions to configure and maintain. No L2 announcement | ||
| policies. No route redistribution. The Cilium agent simply encapsulates | ||
| traffic destined for external CIDRs and sends it to a known VTEP endpoint. |
There was a problem hiding this comment.
BGP and L2 announcements features are about advertising addresses on connected networks in order to configure remote clusters how to transmit towards Cilium. This feature rather seems to be about how to configure Cilium in order to route traffic from Cilium. Are they really equivalent?
There was a problem hiding this comment.
They're not equivalent, but they're all connectivity mechanisms. VTEP is a bidirectional tunnel.
Iit fits environments where the advertise-based approaches don't apply: no BGP-speaking upstream to peer with and no L2 adjacency to the destination (L2 announcement is ARP-based and single-segment). In those cases VTEP gives pods a simple two-way overlay path to an external gateway over a routed underlay.
I will reframe "The Case for Overlay-to-Gateway Simplicity" around connectivity alternatives with different mechanisms, rather than implying equivalence.
|
|
||
| 6. It updates Linux routing table entries for VTEP CIDRs. | ||
|
|
||
| 7. It writes per-endpoint status back to the CRD's `.status` subresource. |
There was a problem hiding this comment.
As a general statement, we avoid putting logic into cilium-agent to update .status because as you scale up, this causes significant load and conflicts on kube-apiserver due to competing agents attempting to make similar updates, often at the same time.
There was a problem hiding this comment.
Probably this is a case for understanding the tradeoff with @cilium/sig-scalability , especially size of targeted environments, and then considering how we might gain the desired operational visibility without introducing scalability concerns.
There was a problem hiding this comment.
Thats a good catch . i will update this one as adding a per-node CiliumVTEPNodeConfig : one object per node, Status holds the resolved endpoints plus per-endpoint sync state . Agent reconciles only after a local BPF map sync completes.
I a m trying to align with the pattern defined for CiliumBGPNodeConfig.
There was a problem hiding this comment.
I have two high level thoughts, related, which I wonder about for the overall architecture. I don't have a strong opinion on these approaches, but they seem within the possible design space, so they're worth considering:
- Could Cilium integrate natively to the Linux stack to delegate routing of this traffic to the Linux routing table, then have another component sync the desired state into the kernel routing table?
- Alternatively if this is difficult due to VNI selection, could perhaps datapath plugins provide an alternative integration mode?
| - Changed endpoints → `UpdateEntry()` (overwrite) | ||
| - Removed CIDRs → `DeleteByCIDR()` | ||
|
|
||
| 6. It updates Linux routing table entries for VTEP CIDRs. |
There was a problem hiding this comment.
This actually seems like the crux of the functionality proposed here (+ the encap/decap config). I wonder whether we really want to have a dedicated VTEP CRD for this or whether it's time to consider a more generic "routing" CRD (even if initially focused just on the VTEP use case). I had a draft for such an idea about three years ago, but it never got traction so I didn't end up posting it publicly. But if this is interesting, maybe I can dust it off.
There was a problem hiding this comment.
I would love to see routing CRD case . it may match with this changes .
My suggestion is , shipping the VTEP work as CiliumVTEPConfig in v2alpha1 now and treat the generic routing direction as the next step .
|
|
||
| The LPM Trie is strictly more capable. Existing configurations with uniform | ||
| prefix lengths produce identical routing behavior. | ||
|
|
There was a problem hiding this comment.
Key question: How will this integrate with the broader architecture? Ideally whatever we come up with, it has a path to eventually integrate with all other GA features where applicable.
Specifically, consider how does this interface with masquerading, egress gateway, encryption?
There was a problem hiding this comment.
in initial release , i focused on much more crd models . but i will improve this in the next releases
|
|
||
| Replace the static CLI flag-based VTEP configuration with a cluster-scoped | ||
| `CiliumVTEPConfig` CRD that supports dynamic updates, per-node assignment via | ||
| `nodeSelector`, and per-endpoint status reporting. This enables production use |
There was a problem hiding this comment.
is "endpoint" here a "VTEP endpoint"? I would suggest not using shorthand, because endpoint is already overloaded (twice - k8s and cilium have established meanings for this word which are not fully aligned)
There was a problem hiding this comment.
Agree The term is a "VTEP endpoint" (VXLAN tunnel endpoint) — I'll use that consistently and drop the bare "endpoint"
|
|
||
| ### CRD API | ||
|
|
||
| ```yaml |
There was a problem hiding this comment.
I won't review the API now since there's plenty of other open questions to consider first, but the API will need to be reviewed.
There was a problem hiding this comment.
I will move the api change to v2alpha1 since this is a brand-new, unreviewed surface
|
Converting to draft due to inactivity. |
|
I will work on that. I am in another conference |
chanigng api to v2alpha1 , renaming endpoints to vtependpoints .
removing status fro cluster-scoped config .
Introduce a per-node CiliumVTEPNodeConfig (operator-resolved spec,
agent-owned status), mirroring the CiliumBGPNodeConfig pattern; split
reconciliation into operator (nodeSelector resolution + conflict
detection) and agent (BPF/route reconcile + per-node status) roles.
Signed-off-by: Murat Parlakisik <parlakisik@gmail.com>
Replace the static CLI flag-based VTEP configuration with a cluster-scoped CiliumVTEPConfig CRD that supports dynamic updates, per-node assignment via nodeSelector, and per-endpoint status reporting.
Signed-off-by: Murat Parlakisik parlakisik@gmail.com