Classify WAGED rebalance failures by FailureCategory for cluster-level alert routing by LZD-PratyushBhatt · Pull Request #180 · linkedin/helix

LZD-PratyushBhatt · 2026-05-19T06:26:24Z

Classify WAGED rebalance failures by FailureCategory and HardConstraint type for cluster-level alert routing

Add a customer-vs-Helix ownership dimension to every WAGED HelixRebalanceException, plus a per-HardConstraint sub-breakdown for "no candidate node" failures, so cluster operators can route alerts to the right team without grepping controller logs. Today RebalanceFailureCounter is dimensionless: a capacity deficit (customer config), a constraint mismatch (customer config), a metadata-store IO failure (Helix infra), and an algorithm crash (Helix infra) all collapse into the same counter, paging the Helix team for problems they cannot fix.

Changes:

New HelixRebalanceException.FailureCategory enum with 8 values + a customerActionable flag. Legacy constructors preserved byte-for-byte; new constructors append "Category: X" to the message only when category != UNKNOWN.
All 17 WAGED throw sites declare a category (capacity deficit, no candidate node, invalid resource config, invalid cluster config, metadata store IO, algorithm internal, async execution).
Async runners (GlobalRebalanceRunner, PartialRebalanceRunner) capture the original exception via AtomicReference and re-throw with the original Type + Category preserved. Previously these flattened every failure into a generic FAILED_TO_CALCULATE, losing all attribution.
WagedRebalancer's catch block reports the category to ClusterStatusMonitor on every failure. Also wraps validateInput so config-validation failures are counted (previously silent at the cluster level).
WagedRebalancerMetricCollector adds 8 per-category counters in the Rebalancer JMX domain. ClusterStatusMonitor mirrors them onto the ClusterStatus JMX domain along with two rollup counters (WagedCustomerActionableFailureCounter, WagedInternalFailureCounter) and a WagedFallbackInUseGauge that flips when WAGED returns the last-known-good assignment instead of a freshly computed one.
GenericHelixController wires the cluster monitor into the rebalancer at construction; the rebalancer propagates the reference to both async runners.
New HardConstraint.Type enum (7 values) and getType() override on each of the 6 known HardConstraint subclasses. When a partition fails to find any eligible node, ConstraintBasedAlgorithm reports each distinct constraint type that contributed -- partition-level set-union semantics, not per-node-rejection counts. WagedRebalancerMetricCollector and ClusterStatusMonitor each gain 7 per-HardConstraint counters (FaultZone, NodeCapacity, NodeMaxPartitionLimit, ReplicaActivate, SamePartitionOnInstance, ValidGroupTag, Unknown) so operators can distinguish fault-zone failures from tag failures within the broader "no candidate node" bucket.
WagedRebalancer now constructs its own ConstraintBasedAlgorithm instance per WagedRebalancer (previously a single shared static singleton) so the per-cluster failure reporter does not race across WagedRebalancer instances in the same JVM. The shared ForkJoinPool inside ConstraintBasedAlgorithmFactory is still reused.

Tests:

New TestHelixRebalanceException covers legacy/new constructors, message format, and customer-actionable flagging.
TestClusterStatusMonitor gets seven new tests: four for per-category counters (rollup math, null-category handling, the fallback gauge) and three for per-HardConstraint counters (start-at-zero, routing, null handling).
TestConstraintBasedAlgorithm gets two new tests: end-to-end reporter wiring (uses real NodeCapacityConstraint forced to fail; asserts the reporter fires exactly once per failed partition per constraint type), and a no-op test verifying nothing throws when the reporter is unset.
Three existing assertions in TestWagedRebalancer and TestConstraintBasedAlgorithm were updated to reflect the improved attribution (they previously asserted the lossy generic FAILED_TO_CALCULATE that the async wrappers used to emit).
TestWagedRebalancerMetrics.testMetricValuePropagation had a fragile cast (assumed all metric values were Long) that surfaced after the metric-map iteration order changed; switched to Number.longValue() which handles both Long and Double.

Verified: 163 tests across waged unit, controller stages, monitoring mbeans, and integration suites pass. No allocations on the success path; ~10-20ns added per successful pipeline event (one volatile boolean write). The constraint scoring hot loop is untouched; the per-HardConstraint reporter runs only inside the existing failure branch when no candidate node is found. Backward-compatible public API: HelixRebalanceException legacy constructors and message format preserved when no category is supplied.

Issues

My PR addresses the following Helix issues and references them in the PR description:

No linked GitHub issue. This PR closes a long-standing observability gap in WAGED failure attribution surfaced during on-call triage.

Description

Here are some details about my PR, including screenshots of any UI changes:

What: Adds a FailureCategory enum to HelixRebalanceException, a HardConstraint.Type enum, and exposes per-category + per-HardConstraint counters plus a fallback-in-use gauge on the cluster-level ClusterStatusMonitor MBean, alongside the existing WagedRebalancerMetricCollector.

Why: Today the cluster-level rebalance metrics cannot distinguish between failures caused by customer-controlled configuration (insufficient capacity, unsatisfiable hard constraints, invalid resource/cluster config) and failures owned by the Helix team (metadata store IO errors, algorithm internals, async runner failures). Every failure pages the Helix team, including ones the Helix team cannot fix. Customers, in turn, have no cluster-level signal that their cluster is misconfigured -- the existing RebalanceFailureCounter is monotonic and dimensionless, the RebalanceFailureGauge is edge-triggered and silenced by the WAGED fallback mechanism, and the per-category detail exists only inside controller log lines. Furthermore, even when an operator knows a "no candidate node" failure occurred, the metric does not say which hard constraint blocked placement -- the most actionable WAGED failure mode (fault-zone exhaustion, tag mismatch, capacity-per-node) is invisible at the dashboard level.

How:

Each of the 17 HelixRebalanceException throw sites inside WAGED is updated to declare a FailureCategory. The enum carries an isCustomerActionable() flag so consumers can route by ownership without hard-coding the taxonomy.
The async runners (GlobalRebalanceRunner, PartialRebalanceRunner) capture the original exception in an AtomicReference and, when the synchronous caller is waiting on the future, re-throw with the original Type and Category preserved instead of collapsing every async failure to a generic FAILED_TO_CALCULATE.
WagedRebalancerMetricCollector exposes 8 per-category counters in the existing Rebalancer:cluster=X,entity=WagedRebalancer JMX MBean. ClusterStatusMonitor mirrors the same dimensions plus two rollup counters (WagedCustomerActionableFailureCounter, WagedInternalFailureCounter) on the existing ClusterStatus:cluster=X MBean, so dashboards that already scrape ClusterStatus pick up the new signal without any topology change.
A new WagedFallbackInUseGauge flips to 1 when WAGED returns the last-known-good assignment from the metadata store instead of a freshly computed one. This closes the "WAGED is silently serving stale data" gap that previously had no metric at all.
WagedRebalancer.computeNewIdealStates now wraps validateInput so configuration-validation failures (previously silent at the cluster level) are counted alongside compute failures.
GenericHelixController injects the cluster status monitor into the rebalancer at construction. The rebalancer forwards the reference to both async runners. A defensive null check covers ReadOnlyWagedRebalancer and any future caller that does not have a cluster monitor (e.g., the REST partitionAssignment API path).
HardConstraint gains a typed Type enum (FAULT_ZONE, NODE_CAPACITY, NODE_MAX_PARTITION_LIMIT, REPLICA_ACTIVATE, SAME_PARTITION_ON_INSTANCE, VALID_GROUP_TAG, UNKNOWN) and an overridable getType() method; each of the 6 concrete subclasses returns its stable identifier. ConstraintBasedAlgorithm exposes setHardConstraintFailureReporter(Consumer<HardConstraint.Type>); when a partition fails to find any eligible node, every distinct constraint type that contributed gets reported exactly once (partition-level set-union, not per-node-rejection). WagedRebalancer.setClusterStatusMonitor installs the reporter on its algorithm so the cluster-level per-HardConstraint counters tick from production rebalance attempts.
WagedRebalancer drops the previously shared DEFAULT_REBALANCE_ALGORITHM static singleton in favor of constructing its own ConstraintBasedAlgorithm via ConstraintBasedAlgorithmFactory.getInstance(...) so the per-cluster failure reporter does not race when multiple WagedRebalancer instances run in the same JVM. The shared ForkJoinPool inside the factory continues to be reused.

Tests

The following tests are written for this issue:

New test class: helix-core/src/test/java/org/apache/helix/TestHelixRebalanceException.java

testLegacyTwoArgConstructorPreservesMessageAndDefaultsToUnknownCategory
testLegacyThreeArgConstructorWithCausePreservesMessageAndDefaults
testNewThreeArgConstructorAppendsCategoryToMessage
testNewFourArgConstructorWithCauseAppendsCategory
testCustomerActionableCategoriesAreTaggedCorrectly
testInternalCategoriesAreTaggedCorrectly

Added to existing class: helix-core/src/test/java/org/apache/helix/monitoring/mbeans/TestClusterStatusMonitor.java

testWagedFailureCategoryCountersStartAtZero
testWagedFailureCategoryReportRoutesToCorrectCountersAndRollup
testReportWagedFailureByCategoryHandlesNullAsUnknown
testWagedFallbackInUseGaugeReflectsLatestSetter
testWagedHardConstraintCountersStartAtZero
testReportWagedHardConstraintFailureIncrementsCorrectCounter
testReportWagedHardConstraintFailureHandlesNullAsUnknown

Added to existing class: helix-core/src/test/java/org/apache/helix/controller/rebalancer/waged/constraints/TestConstraintBasedAlgorithm.java

testHardConstraintFailureReporterFiresOncePerPartitionAndConstraintType -- exercises the real NodeCapacityConstraint forced to reject every candidate, asserts the installed reporter fires once per failed partition with the NODE_CAPACITY type (set-union semantics, not per-node).
testHardConstraintFailureReporterIsNoOpWhenUnset -- confirms the algorithm fails gracefully when no reporter is installed.

Updated assertions (to reflect the improved attribution -- async runners now preserve original Type/Category instead of collapsing to generic FAILED_TO_CALCULATE):

TestConstraintBasedAlgorithm.testSortingEarlyQuitLackCapacity
TestWagedRebalancer.testNonCompatibleConfiguration
TestWagedRebalancer.testInvalidClusterStatus
TestWagedRebalancer.testInvalidRebalancerStatus

Hardened a fragile pre-existing assertion:

TestWagedRebalancerMetrics.testMetricValuePropagation previously cast all metric values to long; it worked only because HashMap iteration order happened to surface a Long-typed metric with a non-zero value before BaselineDivergenceGauge (which returns Double). The new metrics changed the iteration order and exposed the latent bug. Switched the cast to Number.longValue() which handles both types.
The following is the result of the "mvn test" command on the appropriate module:

Suite	Tests	Result
`controller/rebalancer/waged/**` (WAGED unit tests + constraints package, includes new per-HardConstraint tests)	94	passed
`controller/stages/` + `monitoring/mbeans/` + `TestHelixRebalanceException`	55	passed
`integration/rebalancer/TestAbnormalStatesResolver` (integration)	2	passed
`integration/rebalancer/WagedRebalancer/TestWagedRebalance` (real-ZK end-to-end)	12	passed
Total	163	all passed

Example invocation:

$ mvn -pl helix-core test -Dtest=TestWagedRebalancer
[INFO] Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.508 s
[INFO] BUILD SUCCESS

Integration log line (real cluster triggering a capacity deficit) confirming end-to-end attribution through the async runner:

HelixRebalanceException: Failed to calculate for the new Baseline.
  Failure Type: FAILED_TO_CALCULATE Category: CAPACITY_DEFICIT
Caused by: HelixRebalanceException: The cluster 'CLUSTER_TestWagedRebalance' does not have
  enough key capacity for all partitions. Total capacity: 9, Required: 360, Deficit: 351
  Failure Type: FAILED_TO_CALCULATE Category: CAPACITY_DEFICIT

Changes that Break Backward Compatibility (Optional)

My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

No breaking changes to public API. Specifically:

All existing HelixRebalanceException constructors are preserved verbatim. getMessage() returns byte-for-byte the same string for any caller that does not pass the new FailureCategory argument (the message helper special-cases FailureCategory.UNKNOWN to omit the new suffix).
getFailureType() behavior is unchanged.
StatefulRebalancer.computeNewIdealStates signature is unchanged.
ClusterStatusMonitorMBean only gains new methods; nothing is renamed or removed. Existing JMX scrapers continue to see the original attributes.
WagedRebalancer.setClusterStatusMonitor is additive; subclasses (ReadOnlyWagedRebalancer) and external callers that do not call it continue to operate -- all metric-publication code paths null-check the reference.
HardConstraint is widened from package-private to public abstract so the typed Type enum can be referenced from the metric-reporting layer. The class remains abstract and all 6 subclasses stay package-private, so no new operational surface is exposed (only the type identifier).
ConstraintBasedAlgorithm is widened from package-private to public for the same reason -- so WagedRebalancer can install the failure reporter via an instanceof check. The class is not intended for direct external instantiation; the factory remains the entry point.

Behavior changes that are not API breaks but worth noting:

validateInput failures now tick RebalanceFailureCounter and the matching per-category counter (previously they were not counted at the cluster level). The counter is monotonic so the step-up is observable but does not regress any contract.
Async-runner re-throws now preserve the original Type and Category (previously every async failure was collapsed to FAILED_TO_CALCULATE). This is a strict improvement in attribution but tests that asserted the old lossy Type were updated.
Exception messages constructed via the new 3/4-arg constructors include Category: X after Failure Type: X. Existing callers using the 2/3-arg constructors see no change.
WagedRebalancer constructs a fresh ConstraintBasedAlgorithm per instance instead of sharing the previously static DEFAULT_REBALANCE_ALGORITHM field. The static field was an optimization that prevented per-cluster algorithm state; the shared ForkJoinPool inside the factory is still reused across instances, so the per-instance cost is one allocation at controller startup.

Documentation (Optional)

In case of new functionality, my PR adds documentation in the following wiki page:

N/A. New MBean attributes are self-documenting via the Javadoc on ClusterStatusMonitorMBean and the existing WagedRebalancerMetricCollector.WagedRebalancerMetricNames enum.

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference) -- the descriptive subject runs slightly longer to capture the full intent; happy to shorten if requested
3. Subject does not end with a period
4. Subject uses the imperative mood ("Classify", not "Classifying")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

Code style follows the surrounding conventions in each touched file (existing import order, indentation, brace placement, Javadoc style). No formatter run was applied to unmodified lines.

Performance impact (additional context)

Hot path (ConstraintBasedAlgorithm.calculate() scoring loop): untouched. The new code only runs in catch blocks or the existing "no eligible candidate" branch.
Success path overhead per pipeline event: ~10-20 nanoseconds. The added work is one volatile read of _clusterStatusMonitor, one null check, and one volatile boolean write to flip the fallback gauge to false. Zero new allocations on the success path.
Async runner overhead per submit: one AtomicReference.set(null) call (~2 nanoseconds). The capture inside the catch only fires on exception.
Per-HardConstraint reporter overhead: runs only inside the existing if (candidateNodes.isEmpty()) branch -- i.e., only when a partition has already failed to find any eligible node. The work is a stream().distinct().forEach() over the constraints that contributed (typically < 10 elements). Zero overhead on the success path; sub-microsecond on the failure path.
Per-WagedRebalancer algorithm instance: one extra ConstraintBasedAlgorithm allocation at controller startup vs. sharing the static singleton. The shared ForkJoinPool is still reused. Negligible.
Memory: ~400 bytes added per ClusterStatusMonitor instance (one per cluster) for the per-category and per-HardConstraint counter maps plus rollup atomics.
JMX surface: 18 new attributes on ClusterStatusMonitor (8 per-category + 2 rollup + 1 gauge + 7 per-HardConstraint), 15 new attributes on WagedRebalancerMetricCollector (8 per-category + 7 per-HardConstraint). Each getter is AtomicLong.get() or ConcurrentHashMap.get(...).get() -- microsecond-scale and only polled by external scrapers at typical 30-60s intervals.

…l alert routing Add a customer-vs-Helix ownership dimension to every WAGED HelixRebalanceException so cluster operators can route alerts to the right team without grepping controller logs. Today RebalanceFailureCounter is dimensionless: a capacity deficit (customer config), a constraint mismatch (customer config), a metadata-store IO failure (Helix infra), and an algorithm crash (Helix infra) all collapse into the same counter, paging the Helix team for problems they cannot fix. Changes: - New HelixRebalanceException.FailureCategory enum with 8 values + a customerActionable flag. Legacy constructors preserved byte-for-byte; new constructors append "Category: X" to the message only when category != UNKNOWN. - All 17 WAGED throw sites declare a category (capacity deficit, no candidate node, invalid resource config, invalid cluster config, metadata store IO, algorithm internal, async execution). - Async runners (GlobalRebalanceRunner, PartialRebalanceRunner) capture the original exception via AtomicReference and re-throw with the original Type + Category preserved. Previously these flattened every failure into a generic FAILED_TO_CALCULATE, losing all attribution. - WagedRebalancer's catch block reports the category to ClusterStatusMonitor on every failure. Also wraps validateInput so config-validation failures are counted (previously silent at the cluster level). - WagedRebalancerMetricCollector adds 8 per-category counters in the Rebalancer JMX domain. ClusterStatusMonitor mirrors them onto the ClusterStatus JMX domain along with two rollup counters (WagedCustomerActionableFailureCounter, WagedInternalFailureCounter) and a WagedFallbackInUseGauge that flips when WAGED returns the last-known-good assignment instead of a freshly computed one. - GenericHelixController wires the cluster monitor into the rebalancer at construction; the rebalancer propagates the reference to both async runners. Tests: - New TestHelixRebalanceException covers legacy/new constructors, message format, and customer-actionable flagging. - TestClusterStatusMonitor gets four new tests for per-category counters, rollup math, null-category handling, and the fallback gauge. - Three existing assertions in TestWagedRebalancer and TestConstraintBasedAlgorithm were updated to reflect the improved attribution (they previously asserted the lossy generic FAILED_TO_CALCULATE that the async wrappers used to emit). Verified: 152 tests across waged unit, controller stages, monitoring mbeans, and integration suites pass. No allocations on the success path; ~10-20ns added per successful pipeline event (one volatile boolean write). The constraint scoring hot loop is untouched. Backward-compatible public API: HelixRebalanceException legacy constructors and message format preserved when no category is supplied.

…ttribution Builds on the FailureCategory work in the previous commit. The WagedFailureNoCandidateNodeCounter bucket conflated all hard-constraint failures (fault zone, tag mismatch, capacity-per-node, max-partitions, replica activation, same-partition placement). Operators investigating a "no candidate node" failure still had to grep controller logs to learn which constraint blocked placement -- the most common WAGED on-call question. Changes: - HardConstraint widened from package-private to public abstract so the metric layer can reference its types. New nested Type enum with 7 values (FAULT_ZONE, NODE_CAPACITY, NODE_MAX_PARTITION_LIMIT, REPLICA_ACTIVATE, SAME_PARTITION_ON_INSTANCE, VALID_GROUP_TAG, UNKNOWN). getType() overridden on each of the 6 concrete subclasses. - ConstraintBasedAlgorithm now public; exposes a setHardConstraintFailureReporter(Consumer<HardConstraint.Type>) hook. When a partition fails to find any eligible node, every distinct constraint type that contributed gets reported exactly once (partition-level set-union, not per-node-rejection count). - WagedRebalancerMetricCollector adds 7 per-HardConstraint counters in the Rebalancer JMX domain. ClusterStatusMonitor mirrors them on the ClusterStatus JMX domain, plus a reportWagedHardConstraintFailure(Type) increment method. - WagedRebalancer.setClusterStatusMonitor installs the reporter onto ConstraintBasedAlgorithm so production rebalance attempts populate the cluster-level counters. updateRebalancePreference re-installs the reporter when the algorithm is swapped. - WagedRebalancer drops the previously shared DEFAULT_REBALANCE_ALGORITHM static singleton in favor of a fresh ConstraintBasedAlgorithmFactory call per WagedRebalancer instance. The shared ForkJoinPool inside the factory is still reused; the change ensures the per-cluster failure reporter does not race across WagedRebalancer instances co-located in the same JVM. Tests: - TestClusterStatusMonitor gets three new tests for per-HardConstraint counters (start-at-zero, routing to the right counter, null handling). - TestConstraintBasedAlgorithm gets two new tests: end-to-end reporter wiring with a real NodeCapacityConstraint forced to reject every candidate (asserts partition-level set-union semantics), and a no-op test verifying the algorithm fails gracefully when the reporter is unset. - TestWagedRebalancerMetrics.testMetricValuePropagation had a fragile cast that assumed all metric values were Long; the new metrics changed HashMap iteration order and surfaced the latent bug. Switched the cast to Number.longValue() which handles both Long and Double. Verified: 163 tests across waged unit, controller stages, monitoring mbeans, and the real-ZK integration suite pass. The per-HardConstraint reporter only fires inside the existing "no eligible candidate" branch, so zero overhead on the success path. The integration log line End-to-end confirms attribution survives async-runner re-throws: "Failed to calculate for the new Baseline. Failure Type: FAILED_TO_CALCULATE Category: CAPACITY_DEFICIT".

…eporting Self-review and an independent reviewer caught three bugs in the prior two commits on this branch. This commit fixes them and tightens a few style nits. Fixes: - The 15 per-FailureCategory and per-HardConstraint CountMetric entries registered on WagedRebalancerMetricCollector were never incremented: the failure paths only ticked the ClusterStatusMonitor mirror. The Rebalancer:cluster=X,entity=WagedRebalancer MBean exposed 15 stuck-at-zero attributes. WagedRebalancer now pre-resolves an EnumMap per dimension at construction and reportFailureCategory / reportHardConstraintFailure tick both the Rebalancer-domain counter and the ClusterStatusMonitor counter. - ClusterStatusMonitor.reset() previously zeroed the legacy rebalance counters on leadership change but not the new WAGED counters, carrying stale numbers across a leader-loss/regain cycle. The new per-FailureCategory, per-HardConstraint, two rollup counters, and the fallback gauge are now reset alongside the legacy ones. - Async runner -> ClusterStatusMonitor wiring was a duplicated setClusterStatusMonitor + reportFailureToClusterMonitor pair in both PartialRebalanceRunner and GlobalRebalanceRunner. Replaced by a single Consumer<HelixRebalanceException> injected from WagedRebalancer pointing at the new WagedRebalancer.reportAsyncFailure() which ticks the aggregate RebalanceFailureCounter plus both per-FailureCategory MBeans. Removes the duplicate cluster-monitor plumbing on each runner. Cleanups: - Drop the unused DEFAULT_REBALANCE_PREFERENCE alias constant; inline ClusterConfig.DEFAULT_GLOBAL_REBALANCE_PREFERENCE at the single call site. - Convert the failure-reporter lambda in installHardConstraintFailureReporter to a method reference now that the destination is a method on WagedRebalancer itself. - Capture _clusterStatusMonitor into a local before the null check + call in computeNewIdealStates' fallback-gauge update path so the pattern is uniform with the runner-side reporters. - Fix import order in ClusterStatusMonitor and ConstraintBasedAlgorithm. Tests: - TestWagedRebalancer.testAlgorithmException now also asserts the FailureCategoryUnknownCounter (the Rebalancer-domain per-category counter) ticks. This locks in the wiring fix above; previously the counter was registered but never incremented, and no test caught it. Verified: 149 unit + stage + monitoring tests pass after the refactor, plus the 14 real-ZK integration tests in TestWagedRebalance and TestAbnormalStatesResolver. Behavior change worth noting: INVALID_REBALANCER_STATUS errors that originate inside the async runners no longer get silently collapsed to FAILED_TO_CALCULATE and dropped into the fallback path; they now propagate to the controller as the FAILURE_TYPES_TO_PROPAGATE list intends.

…sync re-throw The previous commits in this branch had the async-runner sync re-throw preserve the ORIGINAL Type from the captured exception. That was an unintended functional change in rebalancing behavior, because the catch in WagedRebalancer.computeNewIdealStates keys its fallback-vs-propagate decision on Type: failureTypesToPropagate().contains(failureType) Pre-PR behavior collapsed every runner-internal failure to FAILED_TO_CALCULATE in the sync re-throw, which is NOT in the propagate list, so the controller always went through the last-known-good fallback path. With Type preservation, a runner that threw INVALID_REBALANCER_STATUS (e.g., transient metadata-store IO inside doGlobalRebalance / doPartialRebalance) would propagate instead of falling back. For a transient ZK glitch that resolves between calls, the old behavior gracefully recovered using a stale assignment; the new behavior would surface it as a placement failure. This commit restores the pre-PR fallback semantics by collapsing the re-throw Type back to FAILED_TO_CALCULATE while still preserving the original FailureCategory so metric attribution remains accurate. Category drives metrics; Type drives control flow; they are now kept on separate tracks consistent with the design intent. Net effect: rebalancing logic is now bitwise identical to pre-PR for all WAGED failure paths. The only differences across the branch as a whole are observability: new MBean counters, the fallback gauge, and the Category suffix on exception messages for throws that explicitly declare a category. No control-flow or placement-decision change. Test update: TestWagedRebalancer.testInvalidClusterStatus assertions flipped back from Type=INVALID_CLUSTER_STATUS to FAILED_TO_CALCULATE (matching pre-PR), with Category=INVALID_CLUSTER_CONFIG preserved from the original. Verified: 97 unit + stage + monitoring tests pass, plus the 14 real-ZK integration tests in TestWagedRebalance and TestAbnormalStatesResolver.

thestreak101

Can we split this into 2 PRs at least? First one can have the failure categorisation and second PR to add metric on top of it.

LZD-PratyushBhatt requested review from PranaviAncha, arkmish, kabragaurav, laxman-ch, ngngwr, proud-parselmouth and thestreak101 as code owners May 19, 2026 06:26

LZD-PratyushBhatt changed the title ~~Classify WAGED rebalance failures by FailureCategory for cluster-leve…~~ Classify WAGED rebalance failures by FailureCategory for cluster-level alert routing May 19, 2026

prabhatt added 2 commits May 19, 2026 12:53

thestreak101 reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classify WAGED rebalance failures by FailureCategory for cluster-level alert routing#180

Classify WAGED rebalance failures by FailureCategory for cluster-level alert routing#180
LZD-PratyushBhatt wants to merge 4 commits into
devfrom
lzd/improveErrorWaged

LZD-PratyushBhatt commented May 19, 2026 •

edited

Loading

Uh oh!

thestreak101 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LZD-PratyushBhatt commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Documentation (Optional)

Commits

Code Quality

Performance impact (additional context)

Uh oh!

thestreak101 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LZD-PratyushBhatt commented May 19, 2026 •

edited

Loading