feat(translator): rebalance traffic to healthy backendRefs when peers have no endpoints by suvarineko · Pull Request #9075 · envoyproxy/gateway

suvarineko · 2026-05-25T11:58:16Z

Summary

When an HTTPRoute (or GRPCRoute) references multiple backendRefs and one of them temporarily has no Ready endpoints, this change excludes that backend from the route's destination settings so the remaining healthy backends absorb 100% of the traffic for that rule. Today the gateway returns 503 to the share of traffic proportional to the unavailable backend's weight (via the NoEndpoints bucket in BackendWeights and the weighted-cluster machinery). The change is intentionally scoped to the "service exists but has zero Ready endpoints" case. Misconfigured backends (BackendNotFound, invalid kind, invalid TLS, etc.) are still kept in the destination settings and still return 500 proportionally to their weight — those errors signal a real configuration problem and should not be silently masked.

Motivation / User story

As a platform operator running a multi-backend HTTPRoute (e.g. a blue/green or active/standby setup, or a route fronting multiple regional Services), I want a backend that temporarily has no Ready endpoints to be drained from the route, so that the rest of my backends serve 100% of the requests while the affected backend recovers — instead of my users seeing a fraction of 503s.

Concrete examples this addresses:

Blue/green or canary rollout. A route has prod-svc (weight 90) and canary-svc (weight 10). During a canary rollout, every pod of canary-svc is briefly NotReady (image pull, readiness probe warm-up, PodDisruptionBudget rollover). Today, 10% of users see 503 for the duration of that window. After this change, prod-svc absorbs 100% until canary-svc is back.
Multi-region / failover service set. A route fans out to svc-region-a and svc-region-b. When region B is drained for maintenance and its Service has zero Ready endpoints, region A picks up all traffic without ever returning a partial 503.
Scale-to-zero during off-hours. A backend in a multi-backend route is scaled to 0 replicas (e.g. by KEDA) outside of business hours. The route continues serving via its remaining backends instead of returning 503 for the scaled-down backend's slice of traffic.

In all of these cases, the user has already expressed the intent "distribute across these backends" by listing them in backendRefs; when one of them has no live endpoints there is no reason to deliberately route a share of traffic into a black hole.

Why existing rebalancing mechanisms don't cover this

Envoy Gateway already supports several rebalancing/resilience mechanisms, but all of them operate inside a single Envoy cluster:

Mechanism	Scope
Load balancing policy (Round Robin, Least Request, Random, Maglev)	Selects between endpoints within a cluster
Active / passive health checking, outlier detection	Ejects endpoints within a cluster
Zone-aware routing / `PreferLocal`	Prefers localities within a cluster
Priority levels and locality fallback	Falls back between endpoint sets within a cluster
Retries	Retries to another endpoint of the same cluster (and across priorities)

A backend service in a multi-backend HTTPRoute rule is materialized as its own cluster (or as a weighted cluster entry). When that cluster has zero endpoints, none of the mechanisms above can move traffic to a different cluster — by design, they're cluster-local. There is no mechanism today that operates one level up, at the "set of backendRefs in a route rule" granularity. This PR adds exactly that: a route-level rebalance across backendRefs, which is complementary to the existing cluster-level mechanisms.

Behavior matrix

Given a single HTTPRoute rule with N backendRefs:

Scenario	Before	After
All backends have endpoints	Traffic split per weights	Unchanged
All backends have zero endpoints	503 direct response	Unchanged (still 503)
Some backends have endpoints, some have zero endpoints	Weighted split; unhealthy share returns 503	Unhealthy backends excluded; healthy backends serve 100% per their relative weights
Some backends are misconfigured (`Invalid=true`)	Weighted split; invalid share returns 500	Unchanged (still kept, still 500)
Dynamic-resolver / custom backend without endpoints	Kept	Unchanged (kept)

The BackendsAvailable=False route status condition is still set when any backend has no endpoints, so operators have full observability of the degraded state — only the data-plane behavior changes.

Implementation

In internal/gatewayapi/route.go, inside both processHTTPRouteRules and processGRPCRouteRules, after the existing "skip backendRefs with weight 0" guard, an additional guard skips backends with no healthy endpoints from allDs:

// Skip backendRefs with no healthy endpoints so the remaining healthy
// backends in the same rule absorb 100% of the traffic instead of getting
// 503s in proportion to the unavailable backend's weight. Dynamic resolver,
// custom backend and invalid destinations are kept to preserve their own
// handling.
if len(ds.Endpoints) == 0 && !ds.IsDynamicResolver && !ds.IsCustomBackend && !ds.Invalid {
   continue
}

The downstream failedNoReadyEndpoints && ToBackendWeights().Valid == 0 case still fires when every backend in the rule is empty, so the existing all-empty → 503 path is preserved.

Drive-by consistency fix for GRPC

In processGRPCRouteRules, the non-EndpointsNotFound error branch did not mark the destination with ds.Invalid = true, unlike the equivalent branch in processHTTPRouteRules. As a result, a misconfigured GRPC backend was counted in the NoEndpoints bucket of BackendWeights (returning 503) rather than the Invalid bucket (returning 500). This PR aligns GRPC with HTTP by setting ds.Invalid = true in that branch. Without this alignment, the new rebalance logic would inadvertently drop misconfigured GRPC backends from the route and silently mask configuration errors.

Scope of the change

The same situation can in principle arise for TLSRoute, TCPRoute and UDPRoute, but those paths already continue on any processDestination error (including EndpointsNotFound), so empty-endpoint backendRefs are effectively skipped today. No change needed there.

Backward compatibility

This is a data-plane behavior change for HTTPRoute/GRPCRoute rules with multiple backendRefs where at least one (but not all) backends have zero Ready endpoints. Today such routes return a proportional share of 503s; after this change they serve 100% from the healthy backends.

Gateway API conformance is preserved: ResolvedRefs semantics and the BackendsAvailable condition are unchanged.

If maintainers prefer to gate this behavior, I think this option is straightforward to layer on top and respect current api design:

An opt-in/opt-out via a BackendTrafficPolicy field (e.g. rebalanceOnNoEndpoints: true|false), with the standard policy-target hierarchy (Gateway → xRoute);

Happy to follow up with whichever route the maintainers prefer; I left this PR as a behavior change to keep the diff minimal and to surface the question explicitly.

Signed-off-by: Tkachev Sergei <suvarineko@gmail.com>

netlify · 2026-05-25T11:58:21Z

✅ Deploy Preview for cerulean-figolla-1f9435 ready!

Name	Link
🔨 Latest commit	`4969bbb`
🔍 Latest deploy log	https://app.netlify.com/projects/cerulean-figolla-1f9435/deploys/6a14395bf9f2e50008f21b92
😎 Deploy Preview	https://deploy-preview-9075--cerulean-figolla-1f9435.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

rebalance traffic on no endpoints

4969bbb

Signed-off-by: Tkachev Sergei <suvarineko@gmail.com>

suvarineko requested a review from a team as a code owner May 25, 2026 11:58

suvarineko mentioned this pull request May 25, 2026

Rebalancing traffic from failed backend to healthy one in httproute #8031

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(translator): rebalance traffic to healthy backendRefs when peers have no endpoints#9075

feat(translator): rebalance traffic to healthy backendRefs when peers have no endpoints#9075
suvarineko wants to merge 1 commit into
envoyproxy:mainfrom
suvarineko:feature/rebalance-traffic-on-no-endpoints

suvarineko commented May 25, 2026

Uh oh!

netlify Bot commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

suvarineko commented May 25, 2026

Summary

Motivation / User story

Why existing rebalancing mechanisms don't cover this

Behavior matrix

Implementation

Drive-by consistency fix for GRPC

Scope of the change

Backward compatibility

Uh oh!

netlify Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cerulean-figolla-1f9435 ready!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

netlify Bot commented May 25, 2026 •

edited

Loading