Skip to content

Expose WAGED FailureCategory and HardConstraint counters as cluster-level MBean attributes#188

Open
LZD-PratyushBhatt wants to merge 1 commit into
devfrom
lzd/wagedFailureMetrics
Open

Expose WAGED FailureCategory and HardConstraint counters as cluster-level MBean attributes#188
LZD-PratyushBhatt wants to merge 1 commit into
devfrom
lzd/wagedFailureMetrics

Conversation

@LZD-PratyushBhatt
Copy link
Copy Markdown
Collaborator

@LZD-PratyushBhatt LZD-PratyushBhatt commented May 27, 2026

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

Follow-up to #187. No separate GitHub issue.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

What: Exposes the FailureCategory and HardConstraint.Type dimensions introduced by #187 as JMX MBean attributes on both the existing Rebalancer:cluster=X,entity=WagedRebalancer MBean (via WagedRebalancerMetricCollector) and the existing ClusterStatus:cluster=X MBean (via ClusterStatusMonitor). Adds a new WagedFallbackInUseGauge that flips to 1 when WAGED returns the last-known-good assignment instead of a freshly computed one. Plumbs the cluster monitor reference through GenericHelixController -> WagedRebalancer -> the async runners.

Why: #187 added the categorization data to HelixRebalanceException and surfaced it in logs, but cluster-level dashboards still see only the dimensionless RebalanceFailureCounter. Customers cannot author alerts that route on category from the existing JMX surface; on-call cannot distinguish customer-actionable failures from Helix-team-owned failures without scraping logs. This PR closes that gap by mirroring the typed counters onto both MBean domains. The two rollup counters (WagedCustomerActionableFailureCounter, WagedInternalFailureCounter) are the recommended alert-routing signals.

The new WagedFallbackInUseGauge closes a separate silent-failure gap: today, when a WAGED failure with Type in the non-propagating set occurs, the rebalancer falls back to the last-known-good assignment and the controller pipeline succeeds. No cluster-level signal indicates that WAGED is serving stale data. This PR introduces a binary gauge that flips to 1 on the next pipeline run that uses the fallback, and back to 0 when a fresh calc succeeds.

How:

  1. WagedRebalancerMetricCollector -- adds 15 new CountMetric entries to the existing WagedRebalancerMetricNames enum:

    • 8 per-FailureCategory counters: FailureCategoryCapacityDeficitCounter, FailureCategoryNoCandidateNodeCounter, FailureCategoryInvalidResourceConfigCounter, FailureCategoryInvalidClusterConfigCounter, FailureCategoryMetadataStoreIoCounter, FailureCategoryAlgorithmInternalCounter, FailureCategoryAsyncExecutionCounter, FailureCategoryUnknownCounter.
    • 7 per-HardConstraint.Type counters: HardConstraintFaultZoneFailureCounter, HardConstraintNodeCapacityFailureCounter, HardConstraintNodeMaxPartitionLimitFailureCounter, HardConstraintReplicaActivateFailureCounter, HardConstraintSamePartitionOnInstanceFailureCounter, HardConstraintValidGroupTagFailureCounter, HardConstraintUnknownFailureCounter.
  2. ClusterStatusMonitorMBean + ClusterStatusMonitor -- 18 new MBean attributes mirroring the same dimensions, plus two rollup counters and the fallback gauge:

    • Per-category and per-HardConstraint counters use Map<Enum, AtomicLong> pre-populated for every enum value in the constructor, so dashboards see a stable 0 for unused dimensions.
    • WagedCustomerActionableFailureCounter / WagedInternalFailureCounter -- rollup counters; reportWagedFailureByCategory increments both the per-category counter and the appropriate rollup based on FailureCategory.isCustomerActionable().
    • WagedFallbackInUseGauge -- 0/1 flag set by setWagedFallbackInUseGauge(boolean) from WagedRebalancer.
    • reset() extended to zero all new counters/gauge on leadership change.
  3. HardConstraint and ConstraintBasedAlgorithm -- widened from package-private to public so the metric-reporting layer in monitoring.mbeans can reference HardConstraint.Type and install a Consumer<HardConstraint.Type> reporter on ConstraintBasedAlgorithm. Concrete HardConstraint subclasses stay package-private; no new operational surface beyond the type identifier.

  4. ConstraintBasedAlgorithm -- adds setHardConstraintFailureReporter(Consumer<HardConstraint.Type>). Inside the existing "no eligible candidate" branch (the same branch that already builds the per-node-per-constraint failure map), the reporter is invoked once per distinct constraint type that contributed to the partition's failure (set union across nodes, not summed). This ensures partition-level attribution: a constraint that rejected 50 nodes for a single partition fires the counter +1, not +50.

  5. WagedRebalancer -- substantial wiring:

    • Pre-resolves EnumMap<FailureCategory, CountMetric> and EnumMap<HardConstraint.Type, CountMetric> at construction; subsequent reporting paths increment by _failureCategoryMetrics.get(category).increment(1L) instead of looking up by name on every failure.
    • New methods reportFailureCategory(HelixRebalanceException), reportHardConstraintFailure(HardConstraint.Type), reportAsyncFailure(HelixRebalanceException) -- each ticks both the Rebalancer-domain and ClusterStatus-domain counters via a single call. reportAsyncFailure additionally ticks the aggregate RebalanceFailureCounter.
    • setClusterStatusMonitor(ClusterStatusMonitor) installs the monitor and wires the per-HardConstraint reporter onto the algorithm via installHardConstraintFailureReporter. updateRebalancePreference re-installs the reporter on the new algorithm instance.
    • computeNewIdealStates now wraps validateInput in a try-catch that increments the aggregate + per-category counters before rethrowing, so INVALID_INPUT / INVALID_RESOURCE_CONFIG failures are no longer silent at the cluster level. A usedFallback boolean tracks whether the catch block returned the last-known-good fallback; setWagedFallbackInUseGauge is called at the end of the method to reflect the outcome.
    • Per-instance algorithm refactor: drops the static DEFAULT_REBALANCE_ALGORITHM singleton in favor of ConstraintBasedAlgorithmFactory.getInstance(DEFAULT_GLOBAL_REBALANCE_PREFERENCE) in the constructor. The previous singleton was shared across all WagedRebalancer instances in the JVM, which would have caused the per-cluster failure reporter to race. The shared ForkJoinPool inside the factory is still reused; the only cost is one extra ConstraintBasedAlgorithm allocation at controller startup.
  6. Async runners (GlobalRebalanceRunner, PartialRebalanceRunner) -- constructor signatures change from CountMetric rebalanceFailureCount to Consumer<HelixRebalanceException> asyncFailureReporter. The injected Consumer is WagedRebalancer::reportAsyncFailure and handles all three counter increments uniformly. Removes the duplicate cluster-monitor plumbing that Classify WAGED rebalance failures by FailureCategory; enrich logs and exception messages #187's intermediate design had on each runner.

  7. GenericHelixController.createRebalancer() -- after constructing the WagedRebalancer, calls setClusterStatusMonitor(_clusterStatusMonitor) so the rebalancer can attribute failures to the cluster's monitor.

Tests

  • The following tests are written for this issue:

Added to TestClusterStatusMonitor (7 new tests):

  • testWagedFailureCategoryCountersStartAtZero
  • testWagedFailureCategoryReportRoutesToCorrectCountersAndRollup
  • testReportWagedFailureByCategoryHandlesNullAsUnknown
  • testWagedFallbackInUseGaugeReflectsLatestSetter
  • testWagedHardConstraintCountersStartAtZero
  • testReportWagedHardConstraintFailureIncrementsCorrectCounter
  • testReportWagedHardConstraintFailureHandlesNullAsUnknown

Added to TestConstraintBasedAlgorithm (2 new tests):

  • testHardConstraintFailureReporterFiresOncePerPartitionAndConstraintType -- exercises a real NodeCapacityConstraint forced to reject every candidate, asserts the installed reporter fires exactly once per failed partition with the NODE_CAPACITY type (set-union semantics).
  • testHardConstraintFailureReporterIsNoOpWhenUnset -- confirms the algorithm fails gracefully when no reporter is installed (no NPE on a null reporter).

Updated assertions:

  • TestWagedRebalancer.testAlgorithmException -- now also asserts the Rebalancer-domain FailureCategoryUnknownCounter ticks alongside the existing aggregate counter assertion. Locks in that the new per-category counters are wired (without this assertion, a regression that registers but does not increment them would go undetected).

  • TestWagedRebalancerMetrics.testMetricValuePropagation -- cast hardened from (long) metric.getLastEmittedMetricValue() to ((Number) ...).longValue(). The new metrics changed HashMap iteration order and surfaced a latent bug where a Double-typed metric (BaselineDivergenceGauge) could now appear before the first Long counter and trigger ClassCastException.

  • The following is the result of the mvn test command on the appropriate module:

Suite Tests Result
WAGED unit + constraint package 65 passed
Controller stages (BestPossibleStateCalcStage, RebalancePipeline) 13 passed
TestClusterStatusMonitor (now includes 7 new tests) 14 passed
TestHelixRebalanceException 6 passed
TestWagedRebalance (real-ZK integration) 12 passed
TestAbnormalStatesResolver (integration) 2 passed
Total 112 all passed

Sample integration log line confirming end-to-end attribution through both the async runner and the new metric layer:

HelixRebalanceException: Failed to calculate for the new Baseline.
  Failure Type: FAILED_TO_CALCULATE Category: CAPACITY_DEFICIT
Caused by: HelixRebalanceException: The cluster 'CLUSTER_TestWagedRebalance' does not have
  enough key capacity for all partitions. Total capacity: 9, Required: 360, Deficit: 351
  Failure Type: FAILED_TO_CALCULATE Category: CAPACITY_DEFICIT

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

No breaking changes to public API. Specifically:

  • ClusterStatusMonitorMBean only gains new methods; nothing is renamed or removed. Existing JMX scrapers see the original attributes plus 18 new ones.
  • WagedRebalancerMetricCollector only gains new entries in its WagedRebalancerMetricNames enum (which is already public). New metrics are additive.
  • HardConstraint was widened from package-private to public. The class is still abstract and all 6 concrete subclasses keep their original visibility (package-private), so no new operational surface is exposed beyond the type identifier needed by the metric layer.
  • ConstraintBasedAlgorithm was widened from package-private to public for the same reason -- so WagedRebalancer can install the failure reporter via an instanceof check. The class is not intended for direct external instantiation; ConstraintBasedAlgorithmFactory remains the entry point.
  • WagedRebalancer.setClusterStatusMonitor(ClusterStatusMonitor) is additive. ReadOnlyWagedRebalancer and any external caller that does not call it continue to operate -- all metric-publication code paths null-check the reference.

Behavior guarantee on rebalancing logic:
Rebalancing decisions are bitwise identical to pre-PR. The per-instance ConstraintBasedAlgorithm instance is a fresh allocation of the same code, sharing the same ForkJoinPool inside the factory. The FAILURE_TYPES_TO_PROPAGATE decision in WagedRebalancer.computeNewIdealStates is unchanged; the fallback path runs in the same scenarios as before; the algorithm computation is untouched.

Behavior changes that are not API breaks but worth noting:

  • validateInput failures now tick RebalanceFailureCounter and the matching per-category counter (previously they were not counted at the cluster level). The counter is monotonic so the step-up is observable but does not regress any contract.
  • The WagedRebalancer constructor allocates a fresh ConstraintBasedAlgorithm instead of sharing the previously static DEFAULT_REBALANCE_ALGORITHM field. This is a one-time cost at controller startup; the shared ForkJoinPool inside the factory is still reused, so steady-state behavior is identical.

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

N/A. New MBean attributes are self-documenting via Javadoc on ClusterStatusMonitorMBean and the WagedRebalancerMetricCollector.WagedRebalancerMetricNames enum.

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines.

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

Code style follows the surrounding conventions in each touched file (existing import order, indentation, brace placement, Javadoc style).


Performance impact

  • Hot path (ConstraintBasedAlgorithm.calculate() scoring loop): untouched. The new code only runs in catch blocks or the existing "no eligible candidate" branch.
  • Success path overhead per pipeline event: ~10-20 nanoseconds. The added work is one volatile read of _clusterStatusMonitor, one null check, and one volatile boolean write to flip the fallback gauge to false. Zero new allocations on the success path.
  • Per-HardConstraint reporter overhead: runs only inside the existing if (candidateNodes.isEmpty()) branch -- i.e., only when a partition has already failed to find any eligible node. The work is a stream().distinct().forEach() over the constraints that contributed (typically < 10 elements). Zero overhead on the success path; sub-microsecond on the failure path.
  • Per-WagedRebalancer algorithm instance: one extra ConstraintBasedAlgorithm allocation at controller startup vs. sharing the static singleton. The shared ForkJoinPool is still reused. Negligible.
  • Memory: ~400 bytes added per ClusterStatusMonitor instance (one per cluster) for the per-category and per-HardConstraint counter maps plus rollup atomics.
  • JMX surface: 18 new attributes on ClusterStatusMonitor (8 per-category + 2 rollup + 1 gauge + 7 per-HardConstraint), 15 new attributes on WagedRebalancerMetricCollector (8 per-category + 7 per-HardConstraint). Each getter is AtomicLong.get() or Map.get(...).get() -- microsecond-scale and only polled by external scrapers at typical 30-60s intervals.

…evel MBean attributes

Builds on #187. PR-A landed the FailureCategory enum, HardConstraint.Type,
throw-site updates, log enrichment, and async-runner Category preservation.
This PR exposes those dimensions as JMX MBean attributes on both
WagedRebalancerMetricCollector (Rebalancer domain) and ClusterStatusMonitor
(ClusterStatus domain) so cluster-level dashboards and alerts can route
on customer-vs-Helix ownership without scraping controller logs.

Changes:
- WagedRebalancerMetricCollector adds 15 new CountMetric entries: 8
  per-FailureCategory counters and 7 per-HardConstraint.Type counters.
  Each WAGED failure increments exactly one per-category counter; each
  partition that fails to find any eligible node increments each
  distinct hard-constraint counter that contributed (set-union per
  partition, not per node-rejection).
- ClusterStatusMonitor + MBean interface gain 18 attributes mirroring
  the same dimensions: 8 per-category counters, 2 rollup counters
  (WagedCustomerActionableFailureCounter, WagedInternalFailureCounter),
  7 per-HardConstraint counters, and a WagedFallbackInUseGauge that
  flips to 1 when WAGED returns the last-known-good assignment instead
  of a freshly computed one. reset() zeros all new fields on
  leadership change so a new leader does not inherit stale counts.
- HardConstraint and ConstraintBasedAlgorithm widened from
  package-private to public so the metric layer (in monitoring.mbeans)
  can reference HardConstraint.Type and install a failure reporter on
  ConstraintBasedAlgorithm via instanceof. Concrete HardConstraint
  subclasses stay package-private; no new operational surface beyond
  the type identifier.
- ConstraintBasedAlgorithm exposes
  setHardConstraintFailureReporter(Consumer<HardConstraint.Type>) and
  invokes it inside the existing "no eligible candidate" branch,
  exactly once per distinct constraint type that contributed to the
  partition's failure.
- WagedRebalancer pre-resolves EnumMap<FailureCategory, CountMetric>
  and EnumMap<HardConstraint.Type, CountMetric> at construction. New
  reportFailureCategory / reportHardConstraintFailure /
  reportAsyncFailure methods tick both the Rebalancer-domain and
  ClusterStatus-domain counters via a single call. validateInput is
  now wrapped in a try-catch so config-validation failures also tick
  the per-category counters (previously silent at the cluster level).
  A usedFallback boolean tracks whether the catch block returned the
  last-known-good fallback; the WagedFallbackInUseGauge is updated at
  the end of computeNewIdealStates to reflect either outcome.
- WagedRebalancer drops the static DEFAULT_REBALANCE_ALGORITHM
  singleton in favor of a per-instance algorithm built via
  ConstraintBasedAlgorithmFactory. This prevents the per-cluster
  failure reporter from racing across WagedRebalancer instances in the
  same JVM. The shared ForkJoinPool inside the factory is still
  reused, so the per-instance cost is one extra allocation at
  controller startup.
- Async runners (GlobalRebalanceRunner, PartialRebalanceRunner)
  constructor signatures change from CountMetric rebalanceFailureCount
  to Consumer<HelixRebalanceException> asyncFailureReporter. The
  Consumer is this::reportAsyncFailure on WagedRebalancer and handles
  the aggregate counter, per-category Rebalancer-domain counter, and
  ClusterStatus-domain mirror with a single call. The previous
  duplicated cluster-monitor plumbing in each runner is removed.
- GenericHelixController.createRebalancer() now calls
  wagedRebalancer.setClusterStatusMonitor(_clusterStatusMonitor) after
  construction, wiring the cluster monitor into the rebalancer.

Tests:
- TestClusterStatusMonitor gains 7 new tests: per-category counters
  start at zero, routing increments the correct counter and rollup,
  null category is treated as UNKNOWN, fallback gauge reflects the
  latest setter, per-HardConstraint counters start at zero, routing
  works, and null Type is treated as UNKNOWN.
- TestConstraintBasedAlgorithm gains 2 new tests: end-to-end reporter
  wiring with a real NodeCapacityConstraint forced to reject every
  candidate (asserts partition-level set-union semantics: NODE_CAPACITY
  fires exactly once per failed partition), and a no-op test verifying
  the algorithm fails gracefully when the reporter is unset.
- TestWagedRebalancer.testAlgorithmException now also asserts the
  Rebalancer-domain FailureCategoryUnknownCounter ticks, locking in
  the wiring fix.
- TestWagedRebalancerMetrics.testMetricValuePropagation cast hardened
  from (long) to ((Number) ...).longValue(); the new metrics changed
  HashMap iteration order and surfaced a latent bug where a Double
  metric would now appear before the first Long counter.

Behavior guarantees:
- Rebalancing logic is bitwise identical to pre-PR. Every Type value
  at every throw site is preserved (PR-A); FAILURE_TYPES_TO_PROPAGATE
  decision in WagedRebalancer.computeNewIdealStates is unchanged; the
  fallback path runs in the same scenarios; algorithm computation is
  untouched. The per-instance algorithm change is a fresh instance of
  the same code, sharing the same ForkJoinPool.
- HelixRebalanceException, StatefulRebalancer, and existing JMX
  MBeans are all backward-compatible. New MBean attributes are
  additive; nothing is renamed or removed.
- No new MBean attributes appear until a rebalance failure occurs;
  pre-populated counter maps return 0 from the start so dashboards
  see stable 0 instead of NPE for unused dimensions.

Verified: 92 tests pass across WAGED unit, controller stages,
monitoring MBeans, and the real-ZK integration suite
(TestWagedRebalance, TestAbnormalStatesResolver).
@LZD-PratyushBhatt LZD-PratyushBhatt changed the title Expose WAGED FailureCategory and HardConstraint counters as cluster-l… Expose WAGED FailureCategory and HardConstraint counters as cluster-level MBean attributes May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant