feat(translator): rebalance traffic to healthy backendRefs when peers have no endpoints#9075
Open
suvarineko wants to merge 1 commit into
Open
Conversation
Signed-off-by: Tkachev Sergei <suvarineko@gmail.com>
✅ Deploy Preview for cerulean-figolla-1f9435 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When an
HTTPRoute(orGRPCRoute) references multiplebackendRefsand one of them temporarily has no Ready endpoints, this change excludes that backend from the route's destination settings so the remaining healthy backends absorb 100% of the traffic for that rule. Today the gateway returns 503 to the share of traffic proportional to the unavailable backend's weight (via theNoEndpointsbucket inBackendWeightsand the weighted-cluster machinery). The change is intentionally scoped to the "service exists but has zero Ready endpoints" case. Misconfigured backends (BackendNotFound, invalid kind, invalid TLS, etc.) are still kept in the destination settings and still return 500 proportionally to their weight — those errors signal a real configuration problem and should not be silently masked.Motivation / User story
Concrete examples this addresses:
prod-svc(weight 90) andcanary-svc(weight 10). During a canary rollout, every pod ofcanary-svcis briefly NotReady (image pull, readiness probe warm-up, PodDisruptionBudget rollover). Today, 10% of users see 503 for the duration of that window. After this change,prod-svcabsorbs 100% untilcanary-svcis back.svc-region-aandsvc-region-b. When region B is drained for maintenance and its Service has zero Ready endpoints, region A picks up all traffic without ever returning a partial 503.In all of these cases, the user has already expressed the intent "distribute across these backends" by listing them in
backendRefs; when one of them has no live endpoints there is no reason to deliberately route a share of traffic into a black hole.Why existing rebalancing mechanisms don't cover this
Envoy Gateway already supports several rebalancing/resilience mechanisms, but all of them operate inside a single Envoy cluster:
PreferLocalA backend service in a multi-backend
HTTPRouterule is materialized as its own cluster (or as a weighted cluster entry). When that cluster has zero endpoints, none of the mechanisms above can move traffic to a different cluster — by design, they're cluster-local. There is no mechanism today that operates one level up, at the "set of backendRefs in a route rule" granularity. This PR adds exactly that: a route-level rebalance across backendRefs, which is complementary to the existing cluster-level mechanisms.Behavior matrix
Given a single
HTTPRouterule with N backendRefs:Invalid=true)The
BackendsAvailable=Falseroute status condition is still set when any backend has no endpoints, so operators have full observability of the degraded state — only the data-plane behavior changes.Implementation
In
internal/gatewayapi/route.go, inside bothprocessHTTPRouteRulesandprocessGRPCRouteRules, after the existing "skip backendRefs with weight 0" guard, an additional guard skips backends with no healthy endpoints fromallDs:The downstream failedNoReadyEndpoints && ToBackendWeights().Valid == 0 case still fires when every backend in the rule is empty, so the existing all-empty → 503 path is preserved.
Drive-by consistency fix for GRPC
In processGRPCRouteRules, the non-EndpointsNotFound error branch did not mark the destination with ds.Invalid = true, unlike the equivalent branch in processHTTPRouteRules. As a result, a misconfigured GRPC backend was counted in the NoEndpoints bucket of BackendWeights (returning 503) rather than the Invalid bucket (returning 500). This PR aligns GRPC with HTTP by setting ds.Invalid = true in that branch. Without this alignment, the new rebalance logic would inadvertently drop misconfigured GRPC backends from the route and silently mask configuration errors.
Scope of the change
The same situation can in principle arise for TLSRoute, TCPRoute and UDPRoute, but those paths already continue on any processDestination error (including EndpointsNotFound), so empty-endpoint backendRefs are effectively skipped today. No change needed there.
Backward compatibility
This is a data-plane behavior change for HTTPRoute/GRPCRoute rules with multiple backendRefs where at least one (but not all) backends have zero Ready endpoints. Today such routes return a proportional share of 503s; after this change they serve 100% from the healthy backends.
Gateway API conformance is preserved: ResolvedRefs semantics and the BackendsAvailable condition are unchanged.
If maintainers prefer to gate this behavior, I think this option is straightforward to layer on top and respect current api design:
Happy to follow up with whichever route the maintainers prefer; I left this PR as a behavior change to keep the diff minimal and to surface the question explicitly.