Expose WAGED FailureCategory and HardConstraint counters as cluster-level MBean attributes by LZD-PratyushBhatt · Pull Request #188 · linkedin/helix

LZD-PratyushBhatt · 2026-05-27T05:44:16Z

Issues

My PR addresses the following Helix issues and references them in the PR description:

Follow-up to #187. No separate GitHub issue.

Description

Here are some details about my PR, including screenshots of any UI changes:

What: Exposes the FailureCategory and HardConstraint.Type dimensions introduced by #187 as JMX MBean attributes on both the existing Rebalancer:cluster=X,entity=WagedRebalancer MBean (via WagedRebalancerMetricCollector) and the existing ClusterStatus:cluster=X MBean (via ClusterStatusMonitor). Adds a new WagedFallbackInUseGauge that flips to 1 when WAGED returns the last-known-good assignment instead of a freshly computed one. Plumbs the cluster monitor reference through GenericHelixController -> WagedRebalancer -> the async runners.

Why: #187 added the categorization data to HelixRebalanceException and surfaced it in logs, but cluster-level dashboards still see only the dimensionless RebalanceFailureCounter. Customers cannot author alerts that route on category from the existing JMX surface; on-call cannot distinguish customer-actionable failures from Helix-team-owned failures without scraping logs. This PR closes that gap by mirroring the typed counters onto both MBean domains. The two rollup counters (WagedCustomerActionableFailureCounter, WagedInternalFailureCounter) are the recommended alert-routing signals.

The new WagedFallbackInUseGauge closes a separate silent-failure gap: today, when a WAGED failure with Type in the non-propagating set occurs, the rebalancer falls back to the last-known-good assignment and the controller pipeline succeeds. No cluster-level signal indicates that WAGED is serving stale data. This PR introduces a binary gauge that flips to 1 on the next pipeline run that uses the fallback, and back to 0 when a fresh calc succeeds.

How:

WagedRebalancerMetricCollector -- adds 15 new CountMetric entries to the existing WagedRebalancerMetricNames enum:
- 8 per-FailureCategory counters: FailureCategoryCapacityDeficitCounter, FailureCategoryNoCandidateNodeCounter, FailureCategoryInvalidResourceConfigCounter, FailureCategoryInvalidClusterConfigCounter, FailureCategoryMetadataStoreIoCounter, FailureCategoryAlgorithmInternalCounter, FailureCategoryAsyncExecutionCounter, FailureCategoryUnknownCounter.
- 7 per-HardConstraint.Type counters: HardConstraintFaultZoneFailureCounter, HardConstraintNodeCapacityFailureCounter, HardConstraintNodeMaxPartitionLimitFailureCounter, HardConstraintReplicaActivateFailureCounter, HardConstraintSamePartitionOnInstanceFailureCounter, HardConstraintValidGroupTagFailureCounter, HardConstraintUnknownFailureCounter.
ClusterStatusMonitorMBean + ClusterStatusMonitor -- 18 new MBean attributes mirroring the same dimensions, plus two rollup counters and the fallback gauge:
- Per-category and per-HardConstraint counters use Map<Enum, AtomicLong> pre-populated for every enum value in the constructor, so dashboards see a stable 0 for unused dimensions.
- WagedCustomerActionableFailureCounter / WagedInternalFailureCounter -- rollup counters; reportWagedFailureByCategory increments both the per-category counter and the appropriate rollup based on FailureCategory.isCustomerActionable().
- WagedFallbackInUseGauge -- 0/1 flag set by setWagedFallbackInUseGauge(boolean) from WagedRebalancer.
- reset() extended to zero all new counters/gauge on leadership change.
HardConstraint and ConstraintBasedAlgorithm -- widened from package-private to public so the metric-reporting layer in monitoring.mbeans can reference HardConstraint.Type and install a Consumer<HardConstraint.Type> reporter on ConstraintBasedAlgorithm. Concrete HardConstraint subclasses stay package-private; no new operational surface beyond the type identifier.
ConstraintBasedAlgorithm -- adds setHardConstraintFailureReporter(Consumer<HardConstraint.Type>). Inside the existing "no eligible candidate" branch (the same branch that already builds the per-node-per-constraint failure map), the reporter is invoked once per distinct constraint type that contributed to the partition's failure (set union across nodes, not summed). This ensures partition-level attribution: a constraint that rejected 50 nodes for a single partition fires the counter +1, not +50.
WagedRebalancer -- substantial wiring:
- Pre-resolves EnumMap<FailureCategory, CountMetric> and EnumMap<HardConstraint.Type, CountMetric> at construction; subsequent reporting paths increment by _failureCategoryMetrics.get(category).increment(1L) instead of looking up by name on every failure.
- New methods reportFailureCategory(HelixRebalanceException), reportHardConstraintFailure(HardConstraint.Type), reportAsyncFailure(HelixRebalanceException) -- each ticks both the Rebalancer-domain and ClusterStatus-domain counters via a single call. reportAsyncFailure additionally ticks the aggregate RebalanceFailureCounter.
- setClusterStatusMonitor(ClusterStatusMonitor) installs the monitor and wires the per-HardConstraint reporter onto the algorithm via installHardConstraintFailureReporter. updateRebalancePreference re-installs the reporter on the new algorithm instance.
- computeNewIdealStates now wraps validateInput in a try-catch that increments the aggregate + per-category counters before rethrowing, so INVALID_INPUT / INVALID_RESOURCE_CONFIG failures are no longer silent at the cluster level. A usedFallback boolean tracks whether the catch block returned the last-known-good fallback; setWagedFallbackInUseGauge is called at the end of the method to reflect the outcome.
- Per-instance algorithm refactor: drops the static DEFAULT_REBALANCE_ALGORITHM singleton in favor of ConstraintBasedAlgorithmFactory.getInstance(DEFAULT_GLOBAL_REBALANCE_PREFERENCE) in the constructor. The previous singleton was shared across all WagedRebalancer instances in the JVM, which would have caused the per-cluster failure reporter to race. The shared ForkJoinPool inside the factory is still reused; the only cost is one extra ConstraintBasedAlgorithm allocation at controller startup.
Async runners (GlobalRebalanceRunner, PartialRebalanceRunner) -- constructor signatures change from CountMetric rebalanceFailureCount to Consumer<HelixRebalanceException> asyncFailureReporter. The injected Consumer is WagedRebalancer::reportAsyncFailure and handles all three counter increments uniformly. Removes the duplicate cluster-monitor plumbing that Classify WAGED rebalance failures by FailureCategory; enrich logs and exception messages #187's intermediate design had on each runner.
GenericHelixController.createRebalancer() -- after constructing the WagedRebalancer, calls setClusterStatusMonitor(_clusterStatusMonitor) so the rebalancer can attribute failures to the cluster's monitor.

Tests

The following tests are written for this issue:

Added to TestClusterStatusMonitor (7 new tests):

testWagedFailureCategoryCountersStartAtZero
testWagedFailureCategoryReportRoutesToCorrectCountersAndRollup
testReportWagedFailureByCategoryHandlesNullAsUnknown
testWagedFallbackInUseGaugeReflectsLatestSetter
testWagedHardConstraintCountersStartAtZero
testReportWagedHardConstraintFailureIncrementsCorrectCounter
testReportWagedHardConstraintFailureHandlesNullAsUnknown

Added to TestConstraintBasedAlgorithm (2 new tests):

testHardConstraintFailureReporterFiresOncePerPartitionAndConstraintType -- exercises a real NodeCapacityConstraint forced to reject every candidate, asserts the installed reporter fires exactly once per failed partition with the NODE_CAPACITY type (set-union semantics).
testHardConstraintFailureReporterIsNoOpWhenUnset -- confirms the algorithm fails gracefully when no reporter is installed (no NPE on a null reporter).

Updated assertions:

TestWagedRebalancer.testAlgorithmException -- now also asserts the Rebalancer-domain FailureCategoryUnknownCounter ticks alongside the existing aggregate counter assertion. Locks in that the new per-category counters are wired (without this assertion, a regression that registers but does not increment them would go undetected).
TestWagedRebalancerMetrics.testMetricValuePropagation -- cast hardened from (long) metric.getLastEmittedMetricValue() to ((Number) ...).longValue(). The new metrics changed HashMap iteration order and surfaced a latent bug where a Double-typed metric (BaselineDivergenceGauge) could now appear before the first Long counter and trigger ClassCastException.
The following is the result of the mvn test command on the appropriate module:

Suite	Tests	Result
WAGED unit + constraint package	65	passed
Controller stages (BestPossibleStateCalcStage, RebalancePipeline)	13	passed
`TestClusterStatusMonitor` (now includes 7 new tests)	14	passed
`TestHelixRebalanceException`	6	passed
`TestWagedRebalance` (real-ZK integration)	12	passed
`TestAbnormalStatesResolver` (integration)	2	passed
Total	112	all passed

Sample integration log line confirming end-to-end attribution through both the async runner and the new metric layer:

HelixRebalanceException: Failed to calculate for the new Baseline.
  Failure Type: FAILED_TO_CALCULATE Category: CAPACITY_DEFICIT
Caused by: HelixRebalanceException: The cluster 'CLUSTER_TestWagedRebalance' does not have
  enough key capacity for all partitions. Total capacity: 9, Required: 360, Deficit: 351
  Failure Type: FAILED_TO_CALCULATE Category: CAPACITY_DEFICIT

Changes that Break Backward Compatibility (Optional)

My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

No breaking changes to public API. Specifically:

ClusterStatusMonitorMBean only gains new methods; nothing is renamed or removed. Existing JMX scrapers see the original attributes plus 18 new ones.
WagedRebalancerMetricCollector only gains new entries in its WagedRebalancerMetricNames enum (which is already public). New metrics are additive.
HardConstraint was widened from package-private to public. The class is still abstract and all 6 concrete subclasses keep their original visibility (package-private), so no new operational surface is exposed beyond the type identifier needed by the metric layer.
ConstraintBasedAlgorithm was widened from package-private to public for the same reason -- so WagedRebalancer can install the failure reporter via an instanceof check. The class is not intended for direct external instantiation; ConstraintBasedAlgorithmFactory remains the entry point.
WagedRebalancer.setClusterStatusMonitor(ClusterStatusMonitor) is additive. ReadOnlyWagedRebalancer and any external caller that does not call it continue to operate -- all metric-publication code paths null-check the reference.

Behavior guarantee on rebalancing logic:
Rebalancing decisions are bitwise identical to pre-PR. The per-instance ConstraintBasedAlgorithm instance is a fresh allocation of the same code, sharing the same ForkJoinPool inside the factory. The FAILURE_TYPES_TO_PROPAGATE decision in WagedRebalancer.computeNewIdealStates is unchanged; the fallback path runs in the same scenarios as before; the algorithm computation is untouched.

Behavior changes that are not API breaks but worth noting:

validateInput failures now tick RebalanceFailureCounter and the matching per-category counter (previously they were not counted at the cluster level). The counter is monotonic so the step-up is observable but does not regress any contract.
The WagedRebalancer constructor allocates a fresh ConstraintBasedAlgorithm instead of sharing the previously static DEFAULT_REBALANCE_ALGORITHM field. This is a one-time cost at controller startup; the shared ForkJoinPool inside the factory is still reused, so steady-state behavior is identical.

Documentation (Optional)

In case of new functionality, my PR adds documentation in the following wiki page:

N/A. New MBean attributes are self-documenting via Javadoc on ClusterStatusMonitorMBean and the WagedRebalancerMetricCollector.WagedRebalancerMetricNames enum.

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines.

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

Code style follows the surrounding conventions in each touched file (existing import order, indentation, brace placement, Javadoc style).

Performance impact

Hot path (ConstraintBasedAlgorithm.calculate() scoring loop): untouched. The new code only runs in catch blocks or the existing "no eligible candidate" branch.
Success path overhead per pipeline event: ~10-20 nanoseconds. The added work is one volatile read of _clusterStatusMonitor, one null check, and one volatile boolean write to flip the fallback gauge to false. Zero new allocations on the success path.
Per-HardConstraint reporter overhead: runs only inside the existing if (candidateNodes.isEmpty()) branch -- i.e., only when a partition has already failed to find any eligible node. The work is a stream().distinct().forEach() over the constraints that contributed (typically < 10 elements). Zero overhead on the success path; sub-microsecond on the failure path.
Per-WagedRebalancer algorithm instance: one extra ConstraintBasedAlgorithm allocation at controller startup vs. sharing the static singleton. The shared ForkJoinPool is still reused. Negligible.
Memory: ~400 bytes added per ClusterStatusMonitor instance (one per cluster) for the per-category and per-HardConstraint counter maps plus rollup atomics.
JMX surface: 18 new attributes on ClusterStatusMonitor (8 per-category + 2 rollup + 1 gauge + 7 per-HardConstraint), 15 new attributes on WagedRebalancerMetricCollector (8 per-category + 7 per-HardConstraint). Each getter is AtomicLong.get() or Map.get(...).get() -- microsecond-scale and only polled by external scrapers at typical 30-60s intervals.

…evel MBean attributes Builds on #187. PR-A landed the FailureCategory enum, HardConstraint.Type, throw-site updates, log enrichment, and async-runner Category preservation. This PR exposes those dimensions as JMX MBean attributes on both WagedRebalancerMetricCollector (Rebalancer domain) and ClusterStatusMonitor (ClusterStatus domain) so cluster-level dashboards and alerts can route on customer-vs-Helix ownership without scraping controller logs. Changes: - WagedRebalancerMetricCollector adds 15 new CountMetric entries: 8 per-FailureCategory counters and 7 per-HardConstraint.Type counters. Each WAGED failure increments exactly one per-category counter; each partition that fails to find any eligible node increments each distinct hard-constraint counter that contributed (set-union per partition, not per node-rejection). - ClusterStatusMonitor + MBean interface gain 18 attributes mirroring the same dimensions: 8 per-category counters, 2 rollup counters (WagedCustomerActionableFailureCounter, WagedInternalFailureCounter), 7 per-HardConstraint counters, and a WagedFallbackInUseGauge that flips to 1 when WAGED returns the last-known-good assignment instead of a freshly computed one. reset() zeros all new fields on leadership change so a new leader does not inherit stale counts. - HardConstraint and ConstraintBasedAlgorithm widened from package-private to public so the metric layer (in monitoring.mbeans) can reference HardConstraint.Type and install a failure reporter on ConstraintBasedAlgorithm via instanceof. Concrete HardConstraint subclasses stay package-private; no new operational surface beyond the type identifier. - ConstraintBasedAlgorithm exposes setHardConstraintFailureReporter(Consumer<HardConstraint.Type>) and invokes it inside the existing "no eligible candidate" branch, exactly once per distinct constraint type that contributed to the partition's failure. - WagedRebalancer pre-resolves EnumMap<FailureCategory, CountMetric> and EnumMap<HardConstraint.Type, CountMetric> at construction. New reportFailureCategory / reportHardConstraintFailure / reportAsyncFailure methods tick both the Rebalancer-domain and ClusterStatus-domain counters via a single call. validateInput is now wrapped in a try-catch so config-validation failures also tick the per-category counters (previously silent at the cluster level). A usedFallback boolean tracks whether the catch block returned the last-known-good fallback; the WagedFallbackInUseGauge is updated at the end of computeNewIdealStates to reflect either outcome. - WagedRebalancer drops the static DEFAULT_REBALANCE_ALGORITHM singleton in favor of a per-instance algorithm built via ConstraintBasedAlgorithmFactory. This prevents the per-cluster failure reporter from racing across WagedRebalancer instances in the same JVM. The shared ForkJoinPool inside the factory is still reused, so the per-instance cost is one extra allocation at controller startup. - Async runners (GlobalRebalanceRunner, PartialRebalanceRunner) constructor signatures change from CountMetric rebalanceFailureCount to Consumer<HelixRebalanceException> asyncFailureReporter. The Consumer is this::reportAsyncFailure on WagedRebalancer and handles the aggregate counter, per-category Rebalancer-domain counter, and ClusterStatus-domain mirror with a single call. The previous duplicated cluster-monitor plumbing in each runner is removed. - GenericHelixController.createRebalancer() now calls wagedRebalancer.setClusterStatusMonitor(_clusterStatusMonitor) after construction, wiring the cluster monitor into the rebalancer. Tests: - TestClusterStatusMonitor gains 7 new tests: per-category counters start at zero, routing increments the correct counter and rollup, null category is treated as UNKNOWN, fallback gauge reflects the latest setter, per-HardConstraint counters start at zero, routing works, and null Type is treated as UNKNOWN. - TestConstraintBasedAlgorithm gains 2 new tests: end-to-end reporter wiring with a real NodeCapacityConstraint forced to reject every candidate (asserts partition-level set-union semantics: NODE_CAPACITY fires exactly once per failed partition), and a no-op test verifying the algorithm fails gracefully when the reporter is unset. - TestWagedRebalancer.testAlgorithmException now also asserts the Rebalancer-domain FailureCategoryUnknownCounter ticks, locking in the wiring fix. - TestWagedRebalancerMetrics.testMetricValuePropagation cast hardened from (long) to ((Number) ...).longValue(); the new metrics changed HashMap iteration order and surfaced a latent bug where a Double metric would now appear before the first Long counter. Behavior guarantees: - Rebalancing logic is bitwise identical to pre-PR. Every Type value at every throw site is preserved (PR-A); FAILURE_TYPES_TO_PROPAGATE decision in WagedRebalancer.computeNewIdealStates is unchanged; the fallback path runs in the same scenarios; algorithm computation is untouched. The per-instance algorithm change is a fresh instance of the same code, sharing the same ForkJoinPool. - HelixRebalanceException, StatefulRebalancer, and existing JMX MBeans are all backward-compatible. New MBean attributes are additive; nothing is renamed or removed. - No new MBean attributes appear until a rebalance failure occurs; pre-populated counter maps return 0 from the start so dashboards see stable 0 instead of NPE for unused dimensions. Verified: 92 tests pass across WAGED unit, controller stages, monitoring MBeans, and the real-ZK integration suite (TestWagedRebalance, TestAbnormalStatesResolver).

LZD-PratyushBhatt requested review from PranaviAncha, arkmish, kabragaurav, laxman-ch, ngngwr, proud-parselmouth and thestreak101 as code owners May 27, 2026 05:44

LZD-PratyushBhatt changed the title ~~Expose WAGED FailureCategory and HardConstraint counters as cluster-l…~~ Expose WAGED FailureCategory and HardConstraint counters as cluster-level MBean attributes May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose WAGED FailureCategory and HardConstraint counters as cluster-level MBean attributes#188

Expose WAGED FailureCategory and HardConstraint counters as cluster-level MBean attributes#188
LZD-PratyushBhatt wants to merge 1 commit into
devfrom
lzd/wagedFailureMetrics

LZD-PratyushBhatt commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LZD-PratyushBhatt commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Documentation (Optional)

Commits

Code Quality

Performance impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LZD-PratyushBhatt commented May 27, 2026 •

edited

Loading