Classify WAGED rebalance failures by FailureCategory for cluster-level alert routing#180
Open
LZD-PratyushBhatt wants to merge 4 commits into
Open
Classify WAGED rebalance failures by FailureCategory for cluster-level alert routing#180LZD-PratyushBhatt wants to merge 4 commits into
LZD-PratyushBhatt wants to merge 4 commits into
Conversation
…l alert routing Add a customer-vs-Helix ownership dimension to every WAGED HelixRebalanceException so cluster operators can route alerts to the right team without grepping controller logs. Today RebalanceFailureCounter is dimensionless: a capacity deficit (customer config), a constraint mismatch (customer config), a metadata-store IO failure (Helix infra), and an algorithm crash (Helix infra) all collapse into the same counter, paging the Helix team for problems they cannot fix. Changes: - New HelixRebalanceException.FailureCategory enum with 8 values + a customerActionable flag. Legacy constructors preserved byte-for-byte; new constructors append "Category: X" to the message only when category != UNKNOWN. - All 17 WAGED throw sites declare a category (capacity deficit, no candidate node, invalid resource config, invalid cluster config, metadata store IO, algorithm internal, async execution). - Async runners (GlobalRebalanceRunner, PartialRebalanceRunner) capture the original exception via AtomicReference and re-throw with the original Type + Category preserved. Previously these flattened every failure into a generic FAILED_TO_CALCULATE, losing all attribution. - WagedRebalancer's catch block reports the category to ClusterStatusMonitor on every failure. Also wraps validateInput so config-validation failures are counted (previously silent at the cluster level). - WagedRebalancerMetricCollector adds 8 per-category counters in the Rebalancer JMX domain. ClusterStatusMonitor mirrors them onto the ClusterStatus JMX domain along with two rollup counters (WagedCustomerActionableFailureCounter, WagedInternalFailureCounter) and a WagedFallbackInUseGauge that flips when WAGED returns the last-known-good assignment instead of a freshly computed one. - GenericHelixController wires the cluster monitor into the rebalancer at construction; the rebalancer propagates the reference to both async runners. Tests: - New TestHelixRebalanceException covers legacy/new constructors, message format, and customer-actionable flagging. - TestClusterStatusMonitor gets four new tests for per-category counters, rollup math, null-category handling, and the fallback gauge. - Three existing assertions in TestWagedRebalancer and TestConstraintBasedAlgorithm were updated to reflect the improved attribution (they previously asserted the lossy generic FAILED_TO_CALCULATE that the async wrappers used to emit). Verified: 152 tests across waged unit, controller stages, monitoring mbeans, and integration suites pass. No allocations on the success path; ~10-20ns added per successful pipeline event (one volatile boolean write). The constraint scoring hot loop is untouched. Backward-compatible public API: HelixRebalanceException legacy constructors and message format preserved when no category is supplied.
…ttribution Builds on the FailureCategory work in the previous commit. The WagedFailureNoCandidateNodeCounter bucket conflated all hard-constraint failures (fault zone, tag mismatch, capacity-per-node, max-partitions, replica activation, same-partition placement). Operators investigating a "no candidate node" failure still had to grep controller logs to learn which constraint blocked placement -- the most common WAGED on-call question. Changes: - HardConstraint widened from package-private to public abstract so the metric layer can reference its types. New nested Type enum with 7 values (FAULT_ZONE, NODE_CAPACITY, NODE_MAX_PARTITION_LIMIT, REPLICA_ACTIVATE, SAME_PARTITION_ON_INSTANCE, VALID_GROUP_TAG, UNKNOWN). getType() overridden on each of the 6 concrete subclasses. - ConstraintBasedAlgorithm now public; exposes a setHardConstraintFailureReporter(Consumer<HardConstraint.Type>) hook. When a partition fails to find any eligible node, every distinct constraint type that contributed gets reported exactly once (partition-level set-union, not per-node-rejection count). - WagedRebalancerMetricCollector adds 7 per-HardConstraint counters in the Rebalancer JMX domain. ClusterStatusMonitor mirrors them on the ClusterStatus JMX domain, plus a reportWagedHardConstraintFailure(Type) increment method. - WagedRebalancer.setClusterStatusMonitor installs the reporter onto ConstraintBasedAlgorithm so production rebalance attempts populate the cluster-level counters. updateRebalancePreference re-installs the reporter when the algorithm is swapped. - WagedRebalancer drops the previously shared DEFAULT_REBALANCE_ALGORITHM static singleton in favor of a fresh ConstraintBasedAlgorithmFactory call per WagedRebalancer instance. The shared ForkJoinPool inside the factory is still reused; the change ensures the per-cluster failure reporter does not race across WagedRebalancer instances co-located in the same JVM. Tests: - TestClusterStatusMonitor gets three new tests for per-HardConstraint counters (start-at-zero, routing to the right counter, null handling). - TestConstraintBasedAlgorithm gets two new tests: end-to-end reporter wiring with a real NodeCapacityConstraint forced to reject every candidate (asserts partition-level set-union semantics), and a no-op test verifying the algorithm fails gracefully when the reporter is unset. - TestWagedRebalancerMetrics.testMetricValuePropagation had a fragile cast that assumed all metric values were Long; the new metrics changed HashMap iteration order and surfaced the latent bug. Switched the cast to Number.longValue() which handles both Long and Double. Verified: 163 tests across waged unit, controller stages, monitoring mbeans, and the real-ZK integration suite pass. The per-HardConstraint reporter only fires inside the existing "no eligible candidate" branch, so zero overhead on the success path. The integration log line End-to-end confirms attribution survives async-runner re-throws: "Failed to calculate for the new Baseline. Failure Type: FAILED_TO_CALCULATE Category: CAPACITY_DEFICIT".
added 2 commits
May 19, 2026 12:53
…eporting Self-review and an independent reviewer caught three bugs in the prior two commits on this branch. This commit fixes them and tightens a few style nits. Fixes: - The 15 per-FailureCategory and per-HardConstraint CountMetric entries registered on WagedRebalancerMetricCollector were never incremented: the failure paths only ticked the ClusterStatusMonitor mirror. The Rebalancer:cluster=X,entity=WagedRebalancer MBean exposed 15 stuck-at-zero attributes. WagedRebalancer now pre-resolves an EnumMap per dimension at construction and reportFailureCategory / reportHardConstraintFailure tick both the Rebalancer-domain counter and the ClusterStatusMonitor counter. - ClusterStatusMonitor.reset() previously zeroed the legacy rebalance counters on leadership change but not the new WAGED counters, carrying stale numbers across a leader-loss/regain cycle. The new per-FailureCategory, per-HardConstraint, two rollup counters, and the fallback gauge are now reset alongside the legacy ones. - Async runner -> ClusterStatusMonitor wiring was a duplicated setClusterStatusMonitor + reportFailureToClusterMonitor pair in both PartialRebalanceRunner and GlobalRebalanceRunner. Replaced by a single Consumer<HelixRebalanceException> injected from WagedRebalancer pointing at the new WagedRebalancer.reportAsyncFailure() which ticks the aggregate RebalanceFailureCounter plus both per-FailureCategory MBeans. Removes the duplicate cluster-monitor plumbing on each runner. Cleanups: - Drop the unused DEFAULT_REBALANCE_PREFERENCE alias constant; inline ClusterConfig.DEFAULT_GLOBAL_REBALANCE_PREFERENCE at the single call site. - Convert the failure-reporter lambda in installHardConstraintFailureReporter to a method reference now that the destination is a method on WagedRebalancer itself. - Capture _clusterStatusMonitor into a local before the null check + call in computeNewIdealStates' fallback-gauge update path so the pattern is uniform with the runner-side reporters. - Fix import order in ClusterStatusMonitor and ConstraintBasedAlgorithm. Tests: - TestWagedRebalancer.testAlgorithmException now also asserts the FailureCategoryUnknownCounter (the Rebalancer-domain per-category counter) ticks. This locks in the wiring fix above; previously the counter was registered but never incremented, and no test caught it. Verified: 149 unit + stage + monitoring tests pass after the refactor, plus the 14 real-ZK integration tests in TestWagedRebalance and TestAbnormalStatesResolver. Behavior change worth noting: INVALID_REBALANCER_STATUS errors that originate inside the async runners no longer get silently collapsed to FAILED_TO_CALCULATE and dropped into the fallback path; they now propagate to the controller as the FAILURE_TYPES_TO_PROPAGATE list intends.
…sync re-throw The previous commits in this branch had the async-runner sync re-throw preserve the ORIGINAL Type from the captured exception. That was an unintended functional change in rebalancing behavior, because the catch in WagedRebalancer.computeNewIdealStates keys its fallback-vs-propagate decision on Type: failureTypesToPropagate().contains(failureType) Pre-PR behavior collapsed every runner-internal failure to FAILED_TO_CALCULATE in the sync re-throw, which is NOT in the propagate list, so the controller always went through the last-known-good fallback path. With Type preservation, a runner that threw INVALID_REBALANCER_STATUS (e.g., transient metadata-store IO inside doGlobalRebalance / doPartialRebalance) would propagate instead of falling back. For a transient ZK glitch that resolves between calls, the old behavior gracefully recovered using a stale assignment; the new behavior would surface it as a placement failure. This commit restores the pre-PR fallback semantics by collapsing the re-throw Type back to FAILED_TO_CALCULATE while still preserving the original FailureCategory so metric attribution remains accurate. Category drives metrics; Type drives control flow; they are now kept on separate tracks consistent with the design intent. Net effect: rebalancing logic is now bitwise identical to pre-PR for all WAGED failure paths. The only differences across the branch as a whole are observability: new MBean counters, the fallback gauge, and the Category suffix on exception messages for throws that explicitly declare a category. No control-flow or placement-decision change. Test update: TestWagedRebalancer.testInvalidClusterStatus assertions flipped back from Type=INVALID_CLUSTER_STATUS to FAILED_TO_CALCULATE (matching pre-PR), with Category=INVALID_CLUSTER_CONFIG preserved from the original. Verified: 97 unit + stage + monitoring tests pass, plus the 14 real-ZK integration tests in TestWagedRebalance and TestAbnormalStatesResolver.
Collaborator
thestreak101
left a comment
There was a problem hiding this comment.
Can we split this into 2 PRs at least? First one can have the failure categorisation and second PR to add metric on top of it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Classify WAGED rebalance failures by FailureCategory and HardConstraint type for cluster-level alert routing
Add a customer-vs-Helix ownership dimension to every WAGED HelixRebalanceException, plus a per-HardConstraint sub-breakdown for "no candidate node" failures, so cluster operators can route alerts to the right team without grepping controller logs. Today RebalanceFailureCounter is dimensionless: a capacity deficit (customer config), a constraint mismatch (customer config), a metadata-store IO failure (Helix infra), and an algorithm crash (Helix infra) all collapse into the same counter, paging the Helix team for problems they cannot fix.
Changes:
Tests:
Verified: 163 tests across waged unit, controller stages, monitoring mbeans, and integration suites pass. No allocations on the success path; ~10-20ns added per successful pipeline event (one volatile boolean write). The constraint scoring hot loop is untouched; the per-HardConstraint reporter runs only inside the existing failure branch when no candidate node is found. Backward-compatible public API: HelixRebalanceException legacy constructors and message format preserved when no category is supplied.
Issues
No linked GitHub issue. This PR closes a long-standing observability gap in WAGED failure attribution surfaced during on-call triage.
Description
What: Adds a
FailureCategoryenum toHelixRebalanceException, aHardConstraint.Typeenum, and exposes per-category + per-HardConstraint counters plus a fallback-in-use gauge on the cluster-levelClusterStatusMonitorMBean, alongside the existingWagedRebalancerMetricCollector.Why: Today the cluster-level rebalance metrics cannot distinguish between failures caused by customer-controlled configuration (insufficient capacity, unsatisfiable hard constraints, invalid resource/cluster config) and failures owned by the Helix team (metadata store IO errors, algorithm internals, async runner failures). Every failure pages the Helix team, including ones the Helix team cannot fix. Customers, in turn, have no cluster-level signal that their cluster is misconfigured -- the existing
RebalanceFailureCounteris monotonic and dimensionless, theRebalanceFailureGaugeis edge-triggered and silenced by the WAGED fallback mechanism, and the per-category detail exists only inside controller log lines. Furthermore, even when an operator knows a "no candidate node" failure occurred, the metric does not say which hard constraint blocked placement -- the most actionable WAGED failure mode (fault-zone exhaustion, tag mismatch, capacity-per-node) is invisible at the dashboard level.How:
HelixRebalanceExceptionthrow sites inside WAGED is updated to declare aFailureCategory. The enum carries anisCustomerActionable()flag so consumers can route by ownership without hard-coding the taxonomy.GlobalRebalanceRunner,PartialRebalanceRunner) capture the original exception in anAtomicReferenceand, when the synchronous caller is waiting on the future, re-throw with the originalTypeandCategorypreserved instead of collapsing every async failure to a genericFAILED_TO_CALCULATE.WagedRebalancerMetricCollectorexposes 8 per-category counters in the existingRebalancer:cluster=X,entity=WagedRebalancerJMX MBean.ClusterStatusMonitormirrors the same dimensions plus two rollup counters (WagedCustomerActionableFailureCounter,WagedInternalFailureCounter) on the existingClusterStatus:cluster=XMBean, so dashboards that already scrapeClusterStatuspick up the new signal without any topology change.WagedFallbackInUseGaugeflips to 1 when WAGED returns the last-known-good assignment from the metadata store instead of a freshly computed one. This closes the "WAGED is silently serving stale data" gap that previously had no metric at all.WagedRebalancer.computeNewIdealStatesnow wrapsvalidateInputso configuration-validation failures (previously silent at the cluster level) are counted alongside compute failures.GenericHelixControllerinjects the cluster status monitor into the rebalancer at construction. The rebalancer forwards the reference to both async runners. A defensive null check coversReadOnlyWagedRebalancerand any future caller that does not have a cluster monitor (e.g., the RESTpartitionAssignmentAPI path).HardConstraintgains a typedTypeenum (FAULT_ZONE, NODE_CAPACITY, NODE_MAX_PARTITION_LIMIT, REPLICA_ACTIVATE, SAME_PARTITION_ON_INSTANCE, VALID_GROUP_TAG, UNKNOWN) and an overridablegetType()method; each of the 6 concrete subclasses returns its stable identifier.ConstraintBasedAlgorithmexposessetHardConstraintFailureReporter(Consumer<HardConstraint.Type>); when a partition fails to find any eligible node, every distinct constraint type that contributed gets reported exactly once (partition-level set-union, not per-node-rejection).WagedRebalancer.setClusterStatusMonitorinstalls the reporter on its algorithm so the cluster-level per-HardConstraint counters tick from production rebalance attempts.WagedRebalancerdrops the previously sharedDEFAULT_REBALANCE_ALGORITHMstatic singleton in favor of constructing its ownConstraintBasedAlgorithmviaConstraintBasedAlgorithmFactory.getInstance(...)so the per-cluster failure reporter does not race when multipleWagedRebalancerinstances run in the same JVM. The sharedForkJoinPoolinside the factory continues to be reused.Tests
New test class:
helix-core/src/test/java/org/apache/helix/TestHelixRebalanceException.javatestLegacyTwoArgConstructorPreservesMessageAndDefaultsToUnknownCategorytestLegacyThreeArgConstructorWithCausePreservesMessageAndDefaultstestNewThreeArgConstructorAppendsCategoryToMessagetestNewFourArgConstructorWithCauseAppendsCategorytestCustomerActionableCategoriesAreTaggedCorrectlytestInternalCategoriesAreTaggedCorrectlyAdded to existing class:
helix-core/src/test/java/org/apache/helix/monitoring/mbeans/TestClusterStatusMonitor.javatestWagedFailureCategoryCountersStartAtZerotestWagedFailureCategoryReportRoutesToCorrectCountersAndRolluptestReportWagedFailureByCategoryHandlesNullAsUnknowntestWagedFallbackInUseGaugeReflectsLatestSettertestWagedHardConstraintCountersStartAtZerotestReportWagedHardConstraintFailureIncrementsCorrectCountertestReportWagedHardConstraintFailureHandlesNullAsUnknownAdded to existing class:
helix-core/src/test/java/org/apache/helix/controller/rebalancer/waged/constraints/TestConstraintBasedAlgorithm.javatestHardConstraintFailureReporterFiresOncePerPartitionAndConstraintType-- exercises the realNodeCapacityConstraintforced to reject every candidate, asserts the installed reporter fires once per failed partition with theNODE_CAPACITYtype (set-union semantics, not per-node).testHardConstraintFailureReporterIsNoOpWhenUnset-- confirms the algorithm fails gracefully when no reporter is installed.Updated assertions (to reflect the improved attribution -- async runners now preserve original
Type/Categoryinstead of collapsing to genericFAILED_TO_CALCULATE):TestConstraintBasedAlgorithm.testSortingEarlyQuitLackCapacityTestWagedRebalancer.testNonCompatibleConfigurationTestWagedRebalancer.testInvalidClusterStatusTestWagedRebalancer.testInvalidRebalancerStatusHardened a fragile pre-existing assertion:
TestWagedRebalancerMetrics.testMetricValuePropagationpreviously cast all metric values tolong; it worked only becauseHashMapiteration order happened to surface aLong-typed metric with a non-zero value beforeBaselineDivergenceGauge(which returnsDouble). The new metrics changed the iteration order and exposed the latent bug. Switched the cast toNumber.longValue()which handles both types.The following is the result of the "mvn test" command on the appropriate module:
controller/rebalancer/waged/**(WAGED unit tests + constraints package, includes new per-HardConstraint tests)controller/stages/**+monitoring/mbeans/**+TestHelixRebalanceExceptionintegration/rebalancer/TestAbnormalStatesResolver(integration)integration/rebalancer/WagedRebalancer/TestWagedRebalance(real-ZK end-to-end)Example invocation:
Integration log line (real cluster triggering a capacity deficit) confirming end-to-end attribution through the async runner:
Changes that Break Backward Compatibility (Optional)
No breaking changes to public API. Specifically:
HelixRebalanceExceptionconstructors are preserved verbatim.getMessage()returns byte-for-byte the same string for any caller that does not pass the newFailureCategoryargument (the message helper special-casesFailureCategory.UNKNOWNto omit the new suffix).getFailureType()behavior is unchanged.StatefulRebalancer.computeNewIdealStatessignature is unchanged.ClusterStatusMonitorMBeanonly gains new methods; nothing is renamed or removed. Existing JMX scrapers continue to see the original attributes.WagedRebalancer.setClusterStatusMonitoris additive; subclasses (ReadOnlyWagedRebalancer) and external callers that do not call it continue to operate -- all metric-publication code paths null-check the reference.HardConstraintis widened from package-private to public abstract so the typedTypeenum can be referenced from the metric-reporting layer. The class remains abstract and all 6 subclasses stay package-private, so no new operational surface is exposed (only the type identifier).ConstraintBasedAlgorithmis widened from package-private to public for the same reason -- soWagedRebalancercan install the failure reporter via aninstanceofcheck. The class is not intended for direct external instantiation; the factory remains the entry point.Behavior changes that are not API breaks but worth noting:
validateInputfailures now tickRebalanceFailureCounterand the matching per-category counter (previously they were not counted at the cluster level). The counter is monotonic so the step-up is observable but does not regress any contract.TypeandCategory(previously every async failure was collapsed toFAILED_TO_CALCULATE). This is a strict improvement in attribution but tests that asserted the old lossyTypewere updated.Category: XafterFailure Type: X. Existing callers using the 2/3-arg constructors see no change.WagedRebalancerconstructs a freshConstraintBasedAlgorithmper instance instead of sharing the previously staticDEFAULT_REBALANCE_ALGORITHMfield. The static field was an optimization that prevented per-cluster algorithm state; the sharedForkJoinPoolinside the factory is still reused across instances, so the per-instance cost is one allocation at controller startup.Documentation (Optional)
N/A. New MBean attributes are self-documenting via the Javadoc on
ClusterStatusMonitorMBeanand the existingWagedRebalancerMetricCollector.WagedRebalancerMetricNamesenum.Commits
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)
Code style follows the surrounding conventions in each touched file (existing import order, indentation, brace placement, Javadoc style). No formatter run was applied to unmodified lines.
Performance impact (additional context)
ConstraintBasedAlgorithm.calculate()scoring loop): untouched. The new code only runs in catch blocks or the existing "no eligible candidate" branch._clusterStatusMonitor, one null check, and one volatile boolean write to flip the fallback gauge tofalse. Zero new allocations on the success path.AtomicReference.set(null)call (~2 nanoseconds). The capture inside the catch only fires on exception.if (candidateNodes.isEmpty())branch -- i.e., only when a partition has already failed to find any eligible node. The work is astream().distinct().forEach()over the constraints that contributed (typically < 10 elements). Zero overhead on the success path; sub-microsecond on the failure path.ConstraintBasedAlgorithmallocation at controller startup vs. sharing the static singleton. The sharedForkJoinPoolis still reused. Negligible.ClusterStatusMonitorinstance (one per cluster) for the per-category and per-HardConstraint counter maps plus rollup atomics.ClusterStatusMonitor(8 per-category + 2 rollup + 1 gauge + 7 per-HardConstraint), 15 new attributes onWagedRebalancerMetricCollector(8 per-category + 7 per-HardConstraint). Each getter isAtomicLong.get()orConcurrentHashMap.get(...).get()-- microsecond-scale and only polled by external scrapers at typical 30-60s intervals.