Skip to content

feat(translator): rebalance traffic to healthy backendRefs when peers have no endpoints#9075

Open
suvarineko wants to merge 1 commit into
envoyproxy:mainfrom
suvarineko:feature/rebalance-traffic-on-no-endpoints
Open

feat(translator): rebalance traffic to healthy backendRefs when peers have no endpoints#9075
suvarineko wants to merge 1 commit into
envoyproxy:mainfrom
suvarineko:feature/rebalance-traffic-on-no-endpoints

Conversation

@suvarineko
Copy link
Copy Markdown

Summary

When an HTTPRoute (or GRPCRoute) references multiple backendRefs and one of them temporarily has no Ready endpoints, this change excludes that backend from the route's destination settings so the remaining healthy backends absorb 100% of the traffic for that rule. Today the gateway returns 503 to the share of traffic proportional to the unavailable backend's weight (via the NoEndpoints bucket in BackendWeights and the weighted-cluster machinery). The change is intentionally scoped to the "service exists but has zero Ready endpoints" case. Misconfigured backends (BackendNotFound, invalid kind, invalid TLS, etc.) are still kept in the destination settings and still return 500 proportionally to their weight — those errors signal a real configuration problem and should not be silently masked.

Motivation / User story

As a platform operator running a multi-backend HTTPRoute (e.g. a blue/green or active/standby setup, or a route fronting multiple regional Services), I want a backend that temporarily has no Ready endpoints to be drained from the route, so that the rest of my backends serve 100% of the requests while the affected backend recovers — instead of my users seeing a fraction of 503s.

Concrete examples this addresses:

  1. Blue/green or canary rollout. A route has prod-svc (weight 90) and canary-svc (weight 10). During a canary rollout, every pod of canary-svc is briefly NotReady (image pull, readiness probe warm-up, PodDisruptionBudget rollover). Today, 10% of users see 503 for the duration of that window. After this change, prod-svc absorbs 100% until canary-svc is back.
  2. Multi-region / failover service set. A route fans out to svc-region-a and svc-region-b. When region B is drained for maintenance and its Service has zero Ready endpoints, region A picks up all traffic without ever returning a partial 503.
  3. Scale-to-zero during off-hours. A backend in a multi-backend route is scaled to 0 replicas (e.g. by KEDA) outside of business hours. The route continues serving via its remaining backends instead of returning 503 for the scaled-down backend's slice of traffic.

In all of these cases, the user has already expressed the intent "distribute across these backends" by listing them in backendRefs; when one of them has no live endpoints there is no reason to deliberately route a share of traffic into a black hole.

Why existing rebalancing mechanisms don't cover this

Envoy Gateway already supports several rebalancing/resilience mechanisms, but all of them operate inside a single Envoy cluster:

Mechanism Scope
Load balancing policy (Round Robin, Least Request, Random, Maglev) Selects between endpoints within a cluster
Active / passive health checking, outlier detection Ejects endpoints within a cluster
Zone-aware routing / PreferLocal Prefers localities within a cluster
Priority levels and locality fallback Falls back between endpoint sets within a cluster
Retries Retries to another endpoint of the same cluster (and across priorities)

A backend service in a multi-backend HTTPRoute rule is materialized as its own cluster (or as a weighted cluster entry). When that cluster has zero endpoints, none of the mechanisms above can move traffic to a different cluster — by design, they're cluster-local. There is no mechanism today that operates one level up, at the "set of backendRefs in a route rule" granularity. This PR adds exactly that: a route-level rebalance across backendRefs, which is complementary to the existing cluster-level mechanisms.

Behavior matrix

Given a single HTTPRoute rule with N backendRefs:

Scenario Before After
All backends have endpoints Traffic split per weights Unchanged
All backends have zero endpoints 503 direct response Unchanged (still 503)
Some backends have endpoints, some have zero endpoints Weighted split; unhealthy share returns 503 Unhealthy backends excluded; healthy backends serve 100% per their relative weights
Some backends are misconfigured (Invalid=true) Weighted split; invalid share returns 500 Unchanged (still kept, still 500)
Dynamic-resolver / custom backend without endpoints Kept Unchanged (kept)

The BackendsAvailable=False route status condition is still set when any backend has no endpoints, so operators have full observability of the degraded state — only the data-plane behavior changes.

Implementation

In internal/gatewayapi/route.go, inside both processHTTPRouteRules and processGRPCRouteRules, after the existing "skip backendRefs with weight 0" guard, an additional guard skips backends with no healthy endpoints from allDs:

// Skip backendRefs with no healthy endpoints so the remaining healthy
// backends in the same rule absorb 100% of the traffic instead of getting
// 503s in proportion to the unavailable backend's weight. Dynamic resolver,
// custom backend and invalid destinations are kept to preserve their own
// handling.
if len(ds.Endpoints) == 0 && !ds.IsDynamicResolver && !ds.IsCustomBackend && !ds.Invalid {
   continue
}

The downstream failedNoReadyEndpoints && ToBackendWeights().Valid == 0 case still fires when every backend in the rule is empty, so the existing all-empty → 503 path is preserved.

Drive-by consistency fix for GRPC

In processGRPCRouteRules, the non-EndpointsNotFound error branch did not mark the destination with ds.Invalid = true, unlike the equivalent branch in processHTTPRouteRules. As a result, a misconfigured GRPC backend was counted in the NoEndpoints bucket of BackendWeights (returning 503) rather than the Invalid bucket (returning 500). This PR aligns GRPC with HTTP by setting ds.Invalid = true in that branch. Without this alignment, the new rebalance logic would inadvertently drop misconfigured GRPC backends from the route and silently mask configuration errors.

Scope of the change

The same situation can in principle arise for TLSRoute, TCPRoute and UDPRoute, but those paths already continue on any processDestination error (including EndpointsNotFound), so empty-endpoint backendRefs are effectively skipped today. No change needed there.

Backward compatibility

This is a data-plane behavior change for HTTPRoute/GRPCRoute rules with multiple backendRefs where at least one (but not all) backends have zero Ready endpoints. Today such routes return a proportional share of 503s; after this change they serve 100% from the healthy backends.

Gateway API conformance is preserved: ResolvedRefs semantics and the BackendsAvailable condition are unchanged.

If maintainers prefer to gate this behavior, I think this option is straightforward to layer on top and respect current api design:

  • An opt-in/opt-out via a BackendTrafficPolicy field (e.g. rebalanceOnNoEndpoints: true|false), with the standard policy-target hierarchy (Gateway → xRoute);

Happy to follow up with whichever route the maintainers prefer; I left this PR as a behavior change to keep the diff minimal and to surface the question explicitly.

Signed-off-by: Tkachev Sergei <suvarineko@gmail.com>
@suvarineko suvarineko requested a review from a team as a code owner May 25, 2026 11:58
@netlify
Copy link
Copy Markdown

netlify Bot commented May 25, 2026

Deploy Preview for cerulean-figolla-1f9435 ready!

Name Link
🔨 Latest commit 4969bbb
🔍 Latest deploy log https://app.netlify.com/projects/cerulean-figolla-1f9435/deploys/6a14395bf9f2e50008f21b92
😎 Deploy Preview https://deploy-preview-9075--cerulean-figolla-1f9435.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant