Skip to content

Classify WAGED rebalance failures by FailureCategory for cluster-level alert routing#180

Open
LZD-PratyushBhatt wants to merge 4 commits into
devfrom
lzd/improveErrorWaged
Open

Classify WAGED rebalance failures by FailureCategory for cluster-level alert routing#180
LZD-PratyushBhatt wants to merge 4 commits into
devfrom
lzd/improveErrorWaged

Conversation

@LZD-PratyushBhatt
Copy link
Copy Markdown
Collaborator

@LZD-PratyushBhatt LZD-PratyushBhatt commented May 19, 2026

Classify WAGED rebalance failures by FailureCategory and HardConstraint type for cluster-level alert routing

Add a customer-vs-Helix ownership dimension to every WAGED HelixRebalanceException, plus a per-HardConstraint sub-breakdown for "no candidate node" failures, so cluster operators can route alerts to the right team without grepping controller logs. Today RebalanceFailureCounter is dimensionless: a capacity deficit (customer config), a constraint mismatch (customer config), a metadata-store IO failure (Helix infra), and an algorithm crash (Helix infra) all collapse into the same counter, paging the Helix team for problems they cannot fix.

Changes:

  • New HelixRebalanceException.FailureCategory enum with 8 values + a customerActionable flag. Legacy constructors preserved byte-for-byte; new constructors append "Category: X" to the message only when category != UNKNOWN.
  • All 17 WAGED throw sites declare a category (capacity deficit, no candidate node, invalid resource config, invalid cluster config, metadata store IO, algorithm internal, async execution).
  • Async runners (GlobalRebalanceRunner, PartialRebalanceRunner) capture the original exception via AtomicReference and re-throw with the original Type + Category preserved. Previously these flattened every failure into a generic FAILED_TO_CALCULATE, losing all attribution.
  • WagedRebalancer's catch block reports the category to ClusterStatusMonitor on every failure. Also wraps validateInput so config-validation failures are counted (previously silent at the cluster level).
  • WagedRebalancerMetricCollector adds 8 per-category counters in the Rebalancer JMX domain. ClusterStatusMonitor mirrors them onto the ClusterStatus JMX domain along with two rollup counters (WagedCustomerActionableFailureCounter, WagedInternalFailureCounter) and a WagedFallbackInUseGauge that flips when WAGED returns the last-known-good assignment instead of a freshly computed one.
  • GenericHelixController wires the cluster monitor into the rebalancer at construction; the rebalancer propagates the reference to both async runners.
  • New HardConstraint.Type enum (7 values) and getType() override on each of the 6 known HardConstraint subclasses. When a partition fails to find any eligible node, ConstraintBasedAlgorithm reports each distinct constraint type that contributed -- partition-level set-union semantics, not per-node-rejection counts. WagedRebalancerMetricCollector and ClusterStatusMonitor each gain 7 per-HardConstraint counters (FaultZone, NodeCapacity, NodeMaxPartitionLimit, ReplicaActivate, SamePartitionOnInstance, ValidGroupTag, Unknown) so operators can distinguish fault-zone failures from tag failures within the broader "no candidate node" bucket.
  • WagedRebalancer now constructs its own ConstraintBasedAlgorithm instance per WagedRebalancer (previously a single shared static singleton) so the per-cluster failure reporter does not race across WagedRebalancer instances in the same JVM. The shared ForkJoinPool inside ConstraintBasedAlgorithmFactory is still reused.

Tests:

  • New TestHelixRebalanceException covers legacy/new constructors, message format, and customer-actionable flagging.
  • TestClusterStatusMonitor gets seven new tests: four for per-category counters (rollup math, null-category handling, the fallback gauge) and three for per-HardConstraint counters (start-at-zero, routing, null handling).
  • TestConstraintBasedAlgorithm gets two new tests: end-to-end reporter wiring (uses real NodeCapacityConstraint forced to fail; asserts the reporter fires exactly once per failed partition per constraint type), and a no-op test verifying nothing throws when the reporter is unset.
  • Three existing assertions in TestWagedRebalancer and TestConstraintBasedAlgorithm were updated to reflect the improved attribution (they previously asserted the lossy generic FAILED_TO_CALCULATE that the async wrappers used to emit).
  • TestWagedRebalancerMetrics.testMetricValuePropagation had a fragile cast (assumed all metric values were Long) that surfaced after the metric-map iteration order changed; switched to Number.longValue() which handles both Long and Double.

Verified: 163 tests across waged unit, controller stages, monitoring mbeans, and integration suites pass. No allocations on the success path; ~10-20ns added per successful pipeline event (one volatile boolean write). The constraint scoring hot loop is untouched; the per-HardConstraint reporter runs only inside the existing failure branch when no candidate node is found. Backward-compatible public API: HelixRebalanceException legacy constructors and message format preserved when no category is supplied.

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

No linked GitHub issue. This PR closes a long-standing observability gap in WAGED failure attribution surfaced during on-call triage.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

What: Adds a FailureCategory enum to HelixRebalanceException, a HardConstraint.Type enum, and exposes per-category + per-HardConstraint counters plus a fallback-in-use gauge on the cluster-level ClusterStatusMonitor MBean, alongside the existing WagedRebalancerMetricCollector.

Why: Today the cluster-level rebalance metrics cannot distinguish between failures caused by customer-controlled configuration (insufficient capacity, unsatisfiable hard constraints, invalid resource/cluster config) and failures owned by the Helix team (metadata store IO errors, algorithm internals, async runner failures). Every failure pages the Helix team, including ones the Helix team cannot fix. Customers, in turn, have no cluster-level signal that their cluster is misconfigured -- the existing RebalanceFailureCounter is monotonic and dimensionless, the RebalanceFailureGauge is edge-triggered and silenced by the WAGED fallback mechanism, and the per-category detail exists only inside controller log lines. Furthermore, even when an operator knows a "no candidate node" failure occurred, the metric does not say which hard constraint blocked placement -- the most actionable WAGED failure mode (fault-zone exhaustion, tag mismatch, capacity-per-node) is invisible at the dashboard level.

How:

  1. Each of the 17 HelixRebalanceException throw sites inside WAGED is updated to declare a FailureCategory. The enum carries an isCustomerActionable() flag so consumers can route by ownership without hard-coding the taxonomy.
  2. The async runners (GlobalRebalanceRunner, PartialRebalanceRunner) capture the original exception in an AtomicReference and, when the synchronous caller is waiting on the future, re-throw with the original Type and Category preserved instead of collapsing every async failure to a generic FAILED_TO_CALCULATE.
  3. WagedRebalancerMetricCollector exposes 8 per-category counters in the existing Rebalancer:cluster=X,entity=WagedRebalancer JMX MBean. ClusterStatusMonitor mirrors the same dimensions plus two rollup counters (WagedCustomerActionableFailureCounter, WagedInternalFailureCounter) on the existing ClusterStatus:cluster=X MBean, so dashboards that already scrape ClusterStatus pick up the new signal without any topology change.
  4. A new WagedFallbackInUseGauge flips to 1 when WAGED returns the last-known-good assignment from the metadata store instead of a freshly computed one. This closes the "WAGED is silently serving stale data" gap that previously had no metric at all.
  5. WagedRebalancer.computeNewIdealStates now wraps validateInput so configuration-validation failures (previously silent at the cluster level) are counted alongside compute failures.
  6. GenericHelixController injects the cluster status monitor into the rebalancer at construction. The rebalancer forwards the reference to both async runners. A defensive null check covers ReadOnlyWagedRebalancer and any future caller that does not have a cluster monitor (e.g., the REST partitionAssignment API path).
  7. HardConstraint gains a typed Type enum (FAULT_ZONE, NODE_CAPACITY, NODE_MAX_PARTITION_LIMIT, REPLICA_ACTIVATE, SAME_PARTITION_ON_INSTANCE, VALID_GROUP_TAG, UNKNOWN) and an overridable getType() method; each of the 6 concrete subclasses returns its stable identifier. ConstraintBasedAlgorithm exposes setHardConstraintFailureReporter(Consumer<HardConstraint.Type>); when a partition fails to find any eligible node, every distinct constraint type that contributed gets reported exactly once (partition-level set-union, not per-node-rejection). WagedRebalancer.setClusterStatusMonitor installs the reporter on its algorithm so the cluster-level per-HardConstraint counters tick from production rebalance attempts.
  8. WagedRebalancer drops the previously shared DEFAULT_REBALANCE_ALGORITHM static singleton in favor of constructing its own ConstraintBasedAlgorithm via ConstraintBasedAlgorithmFactory.getInstance(...) so the per-cluster failure reporter does not race when multiple WagedRebalancer instances run in the same JVM. The shared ForkJoinPool inside the factory continues to be reused.

Tests

  • The following tests are written for this issue:

New test class: helix-core/src/test/java/org/apache/helix/TestHelixRebalanceException.java

  • testLegacyTwoArgConstructorPreservesMessageAndDefaultsToUnknownCategory
  • testLegacyThreeArgConstructorWithCausePreservesMessageAndDefaults
  • testNewThreeArgConstructorAppendsCategoryToMessage
  • testNewFourArgConstructorWithCauseAppendsCategory
  • testCustomerActionableCategoriesAreTaggedCorrectly
  • testInternalCategoriesAreTaggedCorrectly

Added to existing class: helix-core/src/test/java/org/apache/helix/monitoring/mbeans/TestClusterStatusMonitor.java

  • testWagedFailureCategoryCountersStartAtZero
  • testWagedFailureCategoryReportRoutesToCorrectCountersAndRollup
  • testReportWagedFailureByCategoryHandlesNullAsUnknown
  • testWagedFallbackInUseGaugeReflectsLatestSetter
  • testWagedHardConstraintCountersStartAtZero
  • testReportWagedHardConstraintFailureIncrementsCorrectCounter
  • testReportWagedHardConstraintFailureHandlesNullAsUnknown

Added to existing class: helix-core/src/test/java/org/apache/helix/controller/rebalancer/waged/constraints/TestConstraintBasedAlgorithm.java

  • testHardConstraintFailureReporterFiresOncePerPartitionAndConstraintType -- exercises the real NodeCapacityConstraint forced to reject every candidate, asserts the installed reporter fires once per failed partition with the NODE_CAPACITY type (set-union semantics, not per-node).
  • testHardConstraintFailureReporterIsNoOpWhenUnset -- confirms the algorithm fails gracefully when no reporter is installed.

Updated assertions (to reflect the improved attribution -- async runners now preserve original Type/Category instead of collapsing to generic FAILED_TO_CALCULATE):

  • TestConstraintBasedAlgorithm.testSortingEarlyQuitLackCapacity
  • TestWagedRebalancer.testNonCompatibleConfiguration
  • TestWagedRebalancer.testInvalidClusterStatus
  • TestWagedRebalancer.testInvalidRebalancerStatus

Hardened a fragile pre-existing assertion:

  • TestWagedRebalancerMetrics.testMetricValuePropagation previously cast all metric values to long; it worked only because HashMap iteration order happened to surface a Long-typed metric with a non-zero value before BaselineDivergenceGauge (which returns Double). The new metrics changed the iteration order and exposed the latent bug. Switched the cast to Number.longValue() which handles both types.

  • The following is the result of the "mvn test" command on the appropriate module:

Suite Tests Result
controller/rebalancer/waged/** (WAGED unit tests + constraints package, includes new per-HardConstraint tests) 94 passed
controller/stages/** + monitoring/mbeans/** + TestHelixRebalanceException 55 passed
integration/rebalancer/TestAbnormalStatesResolver (integration) 2 passed
integration/rebalancer/WagedRebalancer/TestWagedRebalance (real-ZK end-to-end) 12 passed
Total 163 all passed

Example invocation:

$ mvn -pl helix-core test -Dtest=TestWagedRebalancer
[INFO] Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.508 s
[INFO] BUILD SUCCESS

Integration log line (real cluster triggering a capacity deficit) confirming end-to-end attribution through the async runner:

HelixRebalanceException: Failed to calculate for the new Baseline.
  Failure Type: FAILED_TO_CALCULATE Category: CAPACITY_DEFICIT
Caused by: HelixRebalanceException: The cluster 'CLUSTER_TestWagedRebalance' does not have
  enough key capacity for all partitions. Total capacity: 9, Required: 360, Deficit: 351
  Failure Type: FAILED_TO_CALCULATE Category: CAPACITY_DEFICIT

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

No breaking changes to public API. Specifically:

  • All existing HelixRebalanceException constructors are preserved verbatim. getMessage() returns byte-for-byte the same string for any caller that does not pass the new FailureCategory argument (the message helper special-cases FailureCategory.UNKNOWN to omit the new suffix).
  • getFailureType() behavior is unchanged.
  • StatefulRebalancer.computeNewIdealStates signature is unchanged.
  • ClusterStatusMonitorMBean only gains new methods; nothing is renamed or removed. Existing JMX scrapers continue to see the original attributes.
  • WagedRebalancer.setClusterStatusMonitor is additive; subclasses (ReadOnlyWagedRebalancer) and external callers that do not call it continue to operate -- all metric-publication code paths null-check the reference.
  • HardConstraint is widened from package-private to public abstract so the typed Type enum can be referenced from the metric-reporting layer. The class remains abstract and all 6 subclasses stay package-private, so no new operational surface is exposed (only the type identifier).
  • ConstraintBasedAlgorithm is widened from package-private to public for the same reason -- so WagedRebalancer can install the failure reporter via an instanceof check. The class is not intended for direct external instantiation; the factory remains the entry point.

Behavior changes that are not API breaks but worth noting:

  • validateInput failures now tick RebalanceFailureCounter and the matching per-category counter (previously they were not counted at the cluster level). The counter is monotonic so the step-up is observable but does not regress any contract.
  • Async-runner re-throws now preserve the original Type and Category (previously every async failure was collapsed to FAILED_TO_CALCULATE). This is a strict improvement in attribution but tests that asserted the old lossy Type were updated.
  • Exception messages constructed via the new 3/4-arg constructors include Category: X after Failure Type: X. Existing callers using the 2/3-arg constructors see no change.
  • WagedRebalancer constructs a fresh ConstraintBasedAlgorithm per instance instead of sharing the previously static DEFAULT_REBALANCE_ALGORITHM field. The static field was an optimization that prevented per-cluster algorithm state; the shared ForkJoinPool inside the factory is still reused across instances, so the per-instance cost is one allocation at controller startup.

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

N/A. New MBean attributes are self-documenting via the Javadoc on ClusterStatusMonitorMBean and the existing WagedRebalancerMetricCollector.WagedRebalancerMetricNames enum.

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference) -- the descriptive subject runs slightly longer to capture the full intent; happy to shorten if requested
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("Classify", not "Classifying")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

Code style follows the surrounding conventions in each touched file (existing import order, indentation, brace placement, Javadoc style). No formatter run was applied to unmodified lines.


Performance impact (additional context)

  • Hot path (ConstraintBasedAlgorithm.calculate() scoring loop): untouched. The new code only runs in catch blocks or the existing "no eligible candidate" branch.
  • Success path overhead per pipeline event: ~10-20 nanoseconds. The added work is one volatile read of _clusterStatusMonitor, one null check, and one volatile boolean write to flip the fallback gauge to false. Zero new allocations on the success path.
  • Async runner overhead per submit: one AtomicReference.set(null) call (~2 nanoseconds). The capture inside the catch only fires on exception.
  • Per-HardConstraint reporter overhead: runs only inside the existing if (candidateNodes.isEmpty()) branch -- i.e., only when a partition has already failed to find any eligible node. The work is a stream().distinct().forEach() over the constraints that contributed (typically < 10 elements). Zero overhead on the success path; sub-microsecond on the failure path.
  • Per-WagedRebalancer algorithm instance: one extra ConstraintBasedAlgorithm allocation at controller startup vs. sharing the static singleton. The shared ForkJoinPool is still reused. Negligible.
  • Memory: ~400 bytes added per ClusterStatusMonitor instance (one per cluster) for the per-category and per-HardConstraint counter maps plus rollup atomics.
  • JMX surface: 18 new attributes on ClusterStatusMonitor (8 per-category + 2 rollup + 1 gauge + 7 per-HardConstraint), 15 new attributes on WagedRebalancerMetricCollector (8 per-category + 7 per-HardConstraint). Each getter is AtomicLong.get() or ConcurrentHashMap.get(...).get() -- microsecond-scale and only polled by external scrapers at typical 30-60s intervals.

…l alert routing

Add a customer-vs-Helix ownership dimension to every WAGED HelixRebalanceException
so cluster operators can route alerts to the right team without grepping controller
logs. Today RebalanceFailureCounter is dimensionless: a capacity deficit (customer
config), a constraint mismatch (customer config), a metadata-store IO failure
(Helix infra), and an algorithm crash (Helix infra) all collapse into the same
counter, paging the Helix team for problems they cannot fix.

Changes:
- New HelixRebalanceException.FailureCategory enum with 8 values + a
  customerActionable flag. Legacy constructors preserved byte-for-byte; new
  constructors append "Category: X" to the message only when category != UNKNOWN.
- All 17 WAGED throw sites declare a category (capacity deficit, no candidate
  node, invalid resource config, invalid cluster config, metadata store IO,
  algorithm internal, async execution).
- Async runners (GlobalRebalanceRunner, PartialRebalanceRunner) capture the
  original exception via AtomicReference and re-throw with the original Type +
  Category preserved. Previously these flattened every failure into a generic
  FAILED_TO_CALCULATE, losing all attribution.
- WagedRebalancer's catch block reports the category to ClusterStatusMonitor on
  every failure. Also wraps validateInput so config-validation failures are
  counted (previously silent at the cluster level).
- WagedRebalancerMetricCollector adds 8 per-category counters in the
  Rebalancer JMX domain. ClusterStatusMonitor mirrors them onto the
  ClusterStatus JMX domain along with two rollup counters
  (WagedCustomerActionableFailureCounter, WagedInternalFailureCounter) and a
  WagedFallbackInUseGauge that flips when WAGED returns the last-known-good
  assignment instead of a freshly computed one.
- GenericHelixController wires the cluster monitor into the rebalancer at
  construction; the rebalancer propagates the reference to both async runners.

Tests:
- New TestHelixRebalanceException covers legacy/new constructors, message
  format, and customer-actionable flagging.
- TestClusterStatusMonitor gets four new tests for per-category counters,
  rollup math, null-category handling, and the fallback gauge.
- Three existing assertions in TestWagedRebalancer and TestConstraintBasedAlgorithm
  were updated to reflect the improved attribution (they previously asserted the
  lossy generic FAILED_TO_CALCULATE that the async wrappers used to emit).

Verified: 152 tests across waged unit, controller stages, monitoring mbeans, and
integration suites pass. No allocations on the success path; ~10-20ns added per
successful pipeline event (one volatile boolean write). The constraint scoring
hot loop is untouched. Backward-compatible public API: HelixRebalanceException
legacy constructors and message format preserved when no category is supplied.
…ttribution

Builds on the FailureCategory work in the previous commit. The
WagedFailureNoCandidateNodeCounter bucket conflated all hard-constraint
failures (fault zone, tag mismatch, capacity-per-node, max-partitions,
replica activation, same-partition placement). Operators investigating a
"no candidate node" failure still had to grep controller logs to learn
which constraint blocked placement -- the most common WAGED on-call
question.

Changes:
- HardConstraint widened from package-private to public abstract so the
  metric layer can reference its types. New nested Type enum with 7
  values (FAULT_ZONE, NODE_CAPACITY, NODE_MAX_PARTITION_LIMIT,
  REPLICA_ACTIVATE, SAME_PARTITION_ON_INSTANCE, VALID_GROUP_TAG,
  UNKNOWN). getType() overridden on each of the 6 concrete subclasses.
- ConstraintBasedAlgorithm now public; exposes a
  setHardConstraintFailureReporter(Consumer<HardConstraint.Type>) hook.
  When a partition fails to find any eligible node, every distinct
  constraint type that contributed gets reported exactly once
  (partition-level set-union, not per-node-rejection count).
- WagedRebalancerMetricCollector adds 7 per-HardConstraint counters in
  the Rebalancer JMX domain. ClusterStatusMonitor mirrors them on the
  ClusterStatus JMX domain, plus a reportWagedHardConstraintFailure(Type)
  increment method.
- WagedRebalancer.setClusterStatusMonitor installs the reporter onto
  ConstraintBasedAlgorithm so production rebalance attempts populate the
  cluster-level counters. updateRebalancePreference re-installs the
  reporter when the algorithm is swapped.
- WagedRebalancer drops the previously shared DEFAULT_REBALANCE_ALGORITHM
  static singleton in favor of a fresh ConstraintBasedAlgorithmFactory
  call per WagedRebalancer instance. The shared ForkJoinPool inside the
  factory is still reused; the change ensures the per-cluster failure
  reporter does not race across WagedRebalancer instances co-located in
  the same JVM.

Tests:
- TestClusterStatusMonitor gets three new tests for per-HardConstraint
  counters (start-at-zero, routing to the right counter, null handling).
- TestConstraintBasedAlgorithm gets two new tests: end-to-end reporter
  wiring with a real NodeCapacityConstraint forced to reject every
  candidate (asserts partition-level set-union semantics), and a no-op
  test verifying the algorithm fails gracefully when the reporter is
  unset.
- TestWagedRebalancerMetrics.testMetricValuePropagation had a fragile
  cast that assumed all metric values were Long; the new metrics changed
  HashMap iteration order and surfaced the latent bug. Switched the
  cast to Number.longValue() which handles both Long and Double.

Verified: 163 tests across waged unit, controller stages, monitoring
mbeans, and the real-ZK integration suite pass. The per-HardConstraint
reporter only fires inside the existing "no eligible candidate" branch,
so zero overhead on the success path. The integration log line
End-to-end confirms attribution survives async-runner re-throws:
"Failed to calculate for the new Baseline. Failure Type:
FAILED_TO_CALCULATE Category: CAPACITY_DEFICIT".
@LZD-PratyushBhatt LZD-PratyushBhatt changed the title Classify WAGED rebalance failures by FailureCategory for cluster-leve… Classify WAGED rebalance failures by FailureCategory for cluster-level alert routing May 19, 2026
prabhatt added 2 commits May 19, 2026 12:53
…eporting

Self-review and an independent reviewer caught three bugs in the prior two
commits on this branch. This commit fixes them and tightens a few style
nits.

Fixes:
- The 15 per-FailureCategory and per-HardConstraint CountMetric entries
  registered on WagedRebalancerMetricCollector were never incremented:
  the failure paths only ticked the ClusterStatusMonitor mirror. The
  Rebalancer:cluster=X,entity=WagedRebalancer MBean exposed 15
  stuck-at-zero attributes. WagedRebalancer now pre-resolves an EnumMap
  per dimension at construction and reportFailureCategory /
  reportHardConstraintFailure tick both the Rebalancer-domain counter
  and the ClusterStatusMonitor counter.
- ClusterStatusMonitor.reset() previously zeroed the legacy rebalance
  counters on leadership change but not the new WAGED counters,
  carrying stale numbers across a leader-loss/regain cycle. The new
  per-FailureCategory, per-HardConstraint, two rollup counters, and the
  fallback gauge are now reset alongside the legacy ones.
- Async runner -> ClusterStatusMonitor wiring was a duplicated
  setClusterStatusMonitor + reportFailureToClusterMonitor pair in both
  PartialRebalanceRunner and GlobalRebalanceRunner. Replaced by a
  single Consumer<HelixRebalanceException> injected from WagedRebalancer
  pointing at the new WagedRebalancer.reportAsyncFailure() which ticks
  the aggregate RebalanceFailureCounter plus both per-FailureCategory
  MBeans. Removes the duplicate cluster-monitor plumbing on each runner.

Cleanups:
- Drop the unused DEFAULT_REBALANCE_PREFERENCE alias constant; inline
  ClusterConfig.DEFAULT_GLOBAL_REBALANCE_PREFERENCE at the single call
  site.
- Convert the failure-reporter lambda in installHardConstraintFailureReporter
  to a method reference now that the destination is a method on
  WagedRebalancer itself.
- Capture _clusterStatusMonitor into a local before the null check +
  call in computeNewIdealStates' fallback-gauge update path so the
  pattern is uniform with the runner-side reporters.
- Fix import order in ClusterStatusMonitor and ConstraintBasedAlgorithm.

Tests:
- TestWagedRebalancer.testAlgorithmException now also asserts the
  FailureCategoryUnknownCounter (the Rebalancer-domain per-category
  counter) ticks. This locks in the wiring fix above; previously the
  counter was registered but never incremented, and no test caught it.

Verified: 149 unit + stage + monitoring tests pass after the refactor,
plus the 14 real-ZK integration tests in TestWagedRebalance and
TestAbnormalStatesResolver. Behavior change worth noting:
INVALID_REBALANCER_STATUS errors that originate inside the async runners
no longer get silently collapsed to FAILED_TO_CALCULATE and dropped into
the fallback path; they now propagate to the controller as the
FAILURE_TYPES_TO_PROPAGATE list intends.
…sync re-throw

The previous commits in this branch had the async-runner sync re-throw
preserve the ORIGINAL Type from the captured exception. That was an
unintended functional change in rebalancing behavior, because the catch
in WagedRebalancer.computeNewIdealStates keys its fallback-vs-propagate
decision on Type:

  failureTypesToPropagate().contains(failureType)

Pre-PR behavior collapsed every runner-internal failure to
FAILED_TO_CALCULATE in the sync re-throw, which is NOT in the
propagate list, so the controller always went through the
last-known-good fallback path. With Type preservation, a runner that
threw INVALID_REBALANCER_STATUS (e.g., transient metadata-store IO
inside doGlobalRebalance / doPartialRebalance) would propagate instead
of falling back. For a transient ZK glitch that resolves between
calls, the old behavior gracefully recovered using a stale assignment;
the new behavior would surface it as a placement failure.

This commit restores the pre-PR fallback semantics by collapsing the
re-throw Type back to FAILED_TO_CALCULATE while still preserving the
original FailureCategory so metric attribution remains accurate.
Category drives metrics; Type drives control flow; they are now kept
on separate tracks consistent with the design intent.

Net effect: rebalancing logic is now bitwise identical to pre-PR for
all WAGED failure paths. The only differences across the branch as a
whole are observability: new MBean counters, the fallback gauge, and
the Category suffix on exception messages for throws that explicitly
declare a category. No control-flow or placement-decision change.

Test update: TestWagedRebalancer.testInvalidClusterStatus assertions
flipped back from Type=INVALID_CLUSTER_STATUS to FAILED_TO_CALCULATE
(matching pre-PR), with Category=INVALID_CLUSTER_CONFIG preserved
from the original.

Verified: 97 unit + stage + monitoring tests pass, plus the 14 real-ZK
integration tests in TestWagedRebalance and TestAbnormalStatesResolver.
Copy link
Copy Markdown
Collaborator

@thestreak101 thestreak101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split this into 2 PRs at least? First one can have the failure categorisation and second PR to add metric on top of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants