Expose WAGED FailureCategory and HardConstraint counters as cluster-level MBean attributes#188
Open
LZD-PratyushBhatt wants to merge 1 commit into
Open
Expose WAGED FailureCategory and HardConstraint counters as cluster-level MBean attributes#188LZD-PratyushBhatt wants to merge 1 commit into
LZD-PratyushBhatt wants to merge 1 commit into
Conversation
…evel MBean attributes Builds on #187. PR-A landed the FailureCategory enum, HardConstraint.Type, throw-site updates, log enrichment, and async-runner Category preservation. This PR exposes those dimensions as JMX MBean attributes on both WagedRebalancerMetricCollector (Rebalancer domain) and ClusterStatusMonitor (ClusterStatus domain) so cluster-level dashboards and alerts can route on customer-vs-Helix ownership without scraping controller logs. Changes: - WagedRebalancerMetricCollector adds 15 new CountMetric entries: 8 per-FailureCategory counters and 7 per-HardConstraint.Type counters. Each WAGED failure increments exactly one per-category counter; each partition that fails to find any eligible node increments each distinct hard-constraint counter that contributed (set-union per partition, not per node-rejection). - ClusterStatusMonitor + MBean interface gain 18 attributes mirroring the same dimensions: 8 per-category counters, 2 rollup counters (WagedCustomerActionableFailureCounter, WagedInternalFailureCounter), 7 per-HardConstraint counters, and a WagedFallbackInUseGauge that flips to 1 when WAGED returns the last-known-good assignment instead of a freshly computed one. reset() zeros all new fields on leadership change so a new leader does not inherit stale counts. - HardConstraint and ConstraintBasedAlgorithm widened from package-private to public so the metric layer (in monitoring.mbeans) can reference HardConstraint.Type and install a failure reporter on ConstraintBasedAlgorithm via instanceof. Concrete HardConstraint subclasses stay package-private; no new operational surface beyond the type identifier. - ConstraintBasedAlgorithm exposes setHardConstraintFailureReporter(Consumer<HardConstraint.Type>) and invokes it inside the existing "no eligible candidate" branch, exactly once per distinct constraint type that contributed to the partition's failure. - WagedRebalancer pre-resolves EnumMap<FailureCategory, CountMetric> and EnumMap<HardConstraint.Type, CountMetric> at construction. New reportFailureCategory / reportHardConstraintFailure / reportAsyncFailure methods tick both the Rebalancer-domain and ClusterStatus-domain counters via a single call. validateInput is now wrapped in a try-catch so config-validation failures also tick the per-category counters (previously silent at the cluster level). A usedFallback boolean tracks whether the catch block returned the last-known-good fallback; the WagedFallbackInUseGauge is updated at the end of computeNewIdealStates to reflect either outcome. - WagedRebalancer drops the static DEFAULT_REBALANCE_ALGORITHM singleton in favor of a per-instance algorithm built via ConstraintBasedAlgorithmFactory. This prevents the per-cluster failure reporter from racing across WagedRebalancer instances in the same JVM. The shared ForkJoinPool inside the factory is still reused, so the per-instance cost is one extra allocation at controller startup. - Async runners (GlobalRebalanceRunner, PartialRebalanceRunner) constructor signatures change from CountMetric rebalanceFailureCount to Consumer<HelixRebalanceException> asyncFailureReporter. The Consumer is this::reportAsyncFailure on WagedRebalancer and handles the aggregate counter, per-category Rebalancer-domain counter, and ClusterStatus-domain mirror with a single call. The previous duplicated cluster-monitor plumbing in each runner is removed. - GenericHelixController.createRebalancer() now calls wagedRebalancer.setClusterStatusMonitor(_clusterStatusMonitor) after construction, wiring the cluster monitor into the rebalancer. Tests: - TestClusterStatusMonitor gains 7 new tests: per-category counters start at zero, routing increments the correct counter and rollup, null category is treated as UNKNOWN, fallback gauge reflects the latest setter, per-HardConstraint counters start at zero, routing works, and null Type is treated as UNKNOWN. - TestConstraintBasedAlgorithm gains 2 new tests: end-to-end reporter wiring with a real NodeCapacityConstraint forced to reject every candidate (asserts partition-level set-union semantics: NODE_CAPACITY fires exactly once per failed partition), and a no-op test verifying the algorithm fails gracefully when the reporter is unset. - TestWagedRebalancer.testAlgorithmException now also asserts the Rebalancer-domain FailureCategoryUnknownCounter ticks, locking in the wiring fix. - TestWagedRebalancerMetrics.testMetricValuePropagation cast hardened from (long) to ((Number) ...).longValue(); the new metrics changed HashMap iteration order and surfaced a latent bug where a Double metric would now appear before the first Long counter. Behavior guarantees: - Rebalancing logic is bitwise identical to pre-PR. Every Type value at every throw site is preserved (PR-A); FAILURE_TYPES_TO_PROPAGATE decision in WagedRebalancer.computeNewIdealStates is unchanged; the fallback path runs in the same scenarios; algorithm computation is untouched. The per-instance algorithm change is a fresh instance of the same code, sharing the same ForkJoinPool. - HelixRebalanceException, StatefulRebalancer, and existing JMX MBeans are all backward-compatible. New MBean attributes are additive; nothing is renamed or removed. - No new MBean attributes appear until a rebalance failure occurs; pre-populated counter maps return 0 from the start so dashboards see stable 0 instead of NPE for unused dimensions. Verified: 92 tests pass across WAGED unit, controller stages, monitoring MBeans, and the real-ZK integration suite (TestWagedRebalance, TestAbnormalStatesResolver).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issues
Follow-up to #187. No separate GitHub issue.
Description
What: Exposes the
FailureCategoryandHardConstraint.Typedimensions introduced by #187 as JMX MBean attributes on both the existingRebalancer:cluster=X,entity=WagedRebalancerMBean (viaWagedRebalancerMetricCollector) and the existingClusterStatus:cluster=XMBean (viaClusterStatusMonitor). Adds a newWagedFallbackInUseGaugethat flips to 1 when WAGED returns the last-known-good assignment instead of a freshly computed one. Plumbs the cluster monitor reference throughGenericHelixController->WagedRebalancer-> the async runners.Why: #187 added the categorization data to
HelixRebalanceExceptionand surfaced it in logs, but cluster-level dashboards still see only the dimensionlessRebalanceFailureCounter. Customers cannot author alerts that route on category from the existing JMX surface; on-call cannot distinguish customer-actionable failures from Helix-team-owned failures without scraping logs. This PR closes that gap by mirroring the typed counters onto both MBean domains. The two rollup counters (WagedCustomerActionableFailureCounter,WagedInternalFailureCounter) are the recommended alert-routing signals.The new
WagedFallbackInUseGaugecloses a separate silent-failure gap: today, when a WAGED failure withTypein the non-propagating set occurs, the rebalancer falls back to the last-known-good assignment and the controller pipeline succeeds. No cluster-level signal indicates that WAGED is serving stale data. This PR introduces a binary gauge that flips to 1 on the next pipeline run that uses the fallback, and back to 0 when a fresh calc succeeds.How:
WagedRebalancerMetricCollector-- adds 15 newCountMetricentries to the existingWagedRebalancerMetricNamesenum:FailureCategorycounters:FailureCategoryCapacityDeficitCounter,FailureCategoryNoCandidateNodeCounter,FailureCategoryInvalidResourceConfigCounter,FailureCategoryInvalidClusterConfigCounter,FailureCategoryMetadataStoreIoCounter,FailureCategoryAlgorithmInternalCounter,FailureCategoryAsyncExecutionCounter,FailureCategoryUnknownCounter.HardConstraint.Typecounters:HardConstraintFaultZoneFailureCounter,HardConstraintNodeCapacityFailureCounter,HardConstraintNodeMaxPartitionLimitFailureCounter,HardConstraintReplicaActivateFailureCounter,HardConstraintSamePartitionOnInstanceFailureCounter,HardConstraintValidGroupTagFailureCounter,HardConstraintUnknownFailureCounter.ClusterStatusMonitorMBean+ClusterStatusMonitor-- 18 new MBean attributes mirroring the same dimensions, plus two rollup counters and the fallback gauge:Map<Enum, AtomicLong>pre-populated for every enum value in the constructor, so dashboards see a stable 0 for unused dimensions.WagedCustomerActionableFailureCounter/WagedInternalFailureCounter-- rollup counters;reportWagedFailureByCategoryincrements both the per-category counter and the appropriate rollup based onFailureCategory.isCustomerActionable().WagedFallbackInUseGauge-- 0/1 flag set bysetWagedFallbackInUseGauge(boolean)fromWagedRebalancer.reset()extended to zero all new counters/gauge on leadership change.HardConstraintandConstraintBasedAlgorithm-- widened from package-private to public so the metric-reporting layer inmonitoring.mbeanscan referenceHardConstraint.Typeand install aConsumer<HardConstraint.Type>reporter onConstraintBasedAlgorithm. ConcreteHardConstraintsubclasses stay package-private; no new operational surface beyond the type identifier.ConstraintBasedAlgorithm-- addssetHardConstraintFailureReporter(Consumer<HardConstraint.Type>). Inside the existing "no eligible candidate" branch (the same branch that already builds the per-node-per-constraint failure map), the reporter is invoked once per distinct constraint type that contributed to the partition's failure (set union across nodes, not summed). This ensures partition-level attribution: a constraint that rejected 50 nodes for a single partition fires the counter+1, not+50.WagedRebalancer-- substantial wiring:EnumMap<FailureCategory, CountMetric>andEnumMap<HardConstraint.Type, CountMetric>at construction; subsequent reporting paths increment by_failureCategoryMetrics.get(category).increment(1L)instead of looking up by name on every failure.reportFailureCategory(HelixRebalanceException),reportHardConstraintFailure(HardConstraint.Type),reportAsyncFailure(HelixRebalanceException)-- each ticks both the Rebalancer-domain and ClusterStatus-domain counters via a single call.reportAsyncFailureadditionally ticks the aggregateRebalanceFailureCounter.setClusterStatusMonitor(ClusterStatusMonitor)installs the monitor and wires the per-HardConstraint reporter onto the algorithm viainstallHardConstraintFailureReporter.updateRebalancePreferencere-installs the reporter on the new algorithm instance.computeNewIdealStatesnow wrapsvalidateInputin a try-catch that increments the aggregate + per-category counters before rethrowing, soINVALID_INPUT/INVALID_RESOURCE_CONFIGfailures are no longer silent at the cluster level. AusedFallbackboolean tracks whether the catch block returned the last-known-good fallback;setWagedFallbackInUseGaugeis called at the end of the method to reflect the outcome.DEFAULT_REBALANCE_ALGORITHMsingleton in favor ofConstraintBasedAlgorithmFactory.getInstance(DEFAULT_GLOBAL_REBALANCE_PREFERENCE)in the constructor. The previous singleton was shared across allWagedRebalancerinstances in the JVM, which would have caused the per-cluster failure reporter to race. The sharedForkJoinPoolinside the factory is still reused; the only cost is one extraConstraintBasedAlgorithmallocation at controller startup.Async runners (
GlobalRebalanceRunner,PartialRebalanceRunner) -- constructor signatures change fromCountMetric rebalanceFailureCounttoConsumer<HelixRebalanceException> asyncFailureReporter. The injected Consumer isWagedRebalancer::reportAsyncFailureand handles all three counter increments uniformly. Removes the duplicate cluster-monitor plumbing that Classify WAGED rebalance failures by FailureCategory; enrich logs and exception messages #187's intermediate design had on each runner.GenericHelixController.createRebalancer()-- after constructing theWagedRebalancer, callssetClusterStatusMonitor(_clusterStatusMonitor)so the rebalancer can attribute failures to the cluster's monitor.Tests
Added to
TestClusterStatusMonitor(7 new tests):testWagedFailureCategoryCountersStartAtZerotestWagedFailureCategoryReportRoutesToCorrectCountersAndRolluptestReportWagedFailureByCategoryHandlesNullAsUnknowntestWagedFallbackInUseGaugeReflectsLatestSettertestWagedHardConstraintCountersStartAtZerotestReportWagedHardConstraintFailureIncrementsCorrectCountertestReportWagedHardConstraintFailureHandlesNullAsUnknownAdded to
TestConstraintBasedAlgorithm(2 new tests):testHardConstraintFailureReporterFiresOncePerPartitionAndConstraintType-- exercises a realNodeCapacityConstraintforced to reject every candidate, asserts the installed reporter fires exactly once per failed partition with theNODE_CAPACITYtype (set-union semantics).testHardConstraintFailureReporterIsNoOpWhenUnset-- confirms the algorithm fails gracefully when no reporter is installed (no NPE on a null reporter).Updated assertions:
TestWagedRebalancer.testAlgorithmException-- now also asserts the Rebalancer-domainFailureCategoryUnknownCounterticks alongside the existing aggregate counter assertion. Locks in that the new per-category counters are wired (without this assertion, a regression that registers but does not increment them would go undetected).TestWagedRebalancerMetrics.testMetricValuePropagation-- cast hardened from(long) metric.getLastEmittedMetricValue()to((Number) ...).longValue(). The new metrics changedHashMapiteration order and surfaced a latent bug where aDouble-typed metric (BaselineDivergenceGauge) could now appear before the firstLongcounter and triggerClassCastException.The following is the result of the
mvn testcommand on the appropriate module:TestClusterStatusMonitor(now includes 7 new tests)TestHelixRebalanceExceptionTestWagedRebalance(real-ZK integration)TestAbnormalStatesResolver(integration)Sample integration log line confirming end-to-end attribution through both the async runner and the new metric layer:
Changes that Break Backward Compatibility (Optional)
No breaking changes to public API. Specifically:
ClusterStatusMonitorMBeanonly gains new methods; nothing is renamed or removed. Existing JMX scrapers see the original attributes plus 18 new ones.WagedRebalancerMetricCollectoronly gains new entries in itsWagedRebalancerMetricNamesenum (which is already public). New metrics are additive.HardConstraintwas widened from package-private to public. The class is stillabstractand all 6 concrete subclasses keep their original visibility (package-private), so no new operational surface is exposed beyond the type identifier needed by the metric layer.ConstraintBasedAlgorithmwas widened from package-private to public for the same reason -- soWagedRebalancercan install the failure reporter via aninstanceofcheck. The class is not intended for direct external instantiation;ConstraintBasedAlgorithmFactoryremains the entry point.WagedRebalancer.setClusterStatusMonitor(ClusterStatusMonitor)is additive.ReadOnlyWagedRebalancerand any external caller that does not call it continue to operate -- all metric-publication code paths null-check the reference.Behavior guarantee on rebalancing logic:
Rebalancing decisions are bitwise identical to pre-PR. The per-instance
ConstraintBasedAlgorithminstance is a fresh allocation of the same code, sharing the sameForkJoinPoolinside the factory. TheFAILURE_TYPES_TO_PROPAGATEdecision inWagedRebalancer.computeNewIdealStatesis unchanged; the fallback path runs in the same scenarios as before; the algorithm computation is untouched.Behavior changes that are not API breaks but worth noting:
validateInputfailures now tickRebalanceFailureCounterand the matching per-category counter (previously they were not counted at the cluster level). The counter is monotonic so the step-up is observable but does not regress any contract.WagedRebalancerconstructor allocates a freshConstraintBasedAlgorithminstead of sharing the previously staticDEFAULT_REBALANCE_ALGORITHMfield. This is a one-time cost at controller startup; the sharedForkJoinPoolinside the factory is still reused, so steady-state behavior is identical.Documentation (Optional)
N/A. New MBean attributes are self-documenting via Javadoc on
ClusterStatusMonitorMBeanand theWagedRebalancerMetricCollector.WagedRebalancerMetricNamesenum.Commits
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)
Code style follows the surrounding conventions in each touched file (existing import order, indentation, brace placement, Javadoc style).
Performance impact
ConstraintBasedAlgorithm.calculate()scoring loop): untouched. The new code only runs in catch blocks or the existing "no eligible candidate" branch._clusterStatusMonitor, one null check, and one volatile boolean write to flip the fallback gauge tofalse. Zero new allocations on the success path.if (candidateNodes.isEmpty())branch -- i.e., only when a partition has already failed to find any eligible node. The work is astream().distinct().forEach()over the constraints that contributed (typically < 10 elements). Zero overhead on the success path; sub-microsecond on the failure path.ConstraintBasedAlgorithmallocation at controller startup vs. sharing the static singleton. The sharedForkJoinPoolis still reused. Negligible.ClusterStatusMonitorinstance (one per cluster) for the per-category and per-HardConstraint counter maps plus rollup atomics.ClusterStatusMonitor(8 per-category + 2 rollup + 1 gauge + 7 per-HardConstraint), 15 new attributes onWagedRebalancerMetricCollector(8 per-category + 7 per-HardConstraint). Each getter isAtomicLong.get()orMap.get(...).get()-- microsecond-scale and only polled by external scrapers at typical 30-60s intervals.