Add per-event-type topology change metrics for controller by thestreak101 · Pull Request #185 · linkedin/helix

thestreak101 · 2026-05-25T06:19:55Z

Issues

This PR is tracked by ACTIONITEM-18683 (LinkedIn-internal). It addresses the PRR-30 reviewer ask for a metric that surfaces total topology change event count and the count of events being processed by the Helix controller.

Description

Adds per-ClusterEventType ReceivedCounter / ProcessedCounter MBean attributes under ClusterStatus:eventName=TopologyChangeEvent for the five event types the rebalancer treats as topology-affecting:

IdealStateChange
InstanceConfigChange
ResourceConfigChange
LiveInstanceChange
ClusterConfigChange

The set mirrors the topology classification already used by ResourceChangeDetector.determinePropertyMapByType, so metric semantics line up with how the rebalancer itself defines a topology change.

Received is incremented in GenericHelixController.pushToEventQueues before the cluster event queue's dedup. Processed is incremented at the end of handleEvent on successful non-task pipeline runs. The gap between the two reflects coalescing under load -- the signal needed to correlate controller load with rebalance throughput.

Non-topology event types (MessageChange, CurrentStateChange, etc.) are filtered at the ClusterStatusMonitor boundary so callers stay simple. All five MBeans are registered eagerly in active() so dashboards and OTel scrapers see a stable schema before the first event of a given type arrives.

MBean shape

domain : ClusterStatus
keys   : cluster=<cluster>, eventName=TopologyChangeEvent, eventType=<one of 5>
attrs  : ReceivedCounter (Long), ProcessedCounter (Long), SensorName (String)

Files changed

controller/stages/ClusterEventType.java -- adds isTopologyChange() predicate and topologyChangeEventTypes() set as the single source of truth for the filter.
monitoring/mbeans/TopologyChangeEventMonitor.java (new) -- per-event-type MBean built on the existing DynamicMBeanProvider pattern (sibling of ClusterEventMonitor).
monitoring/mbeans/ClusterStatusMonitor.java -- map of monitors, eager registration in active(), unregister in reset(), public incrementTopologyChangeEvent{Received,Processed} methods that no-op on non-topology types.
controller/GenericHelixController.java -- two hook lines (Received in pushToEventQueues, Processed at end of handleEvent).

Tests

The following tests are written for this issue:

TestTopologyChangeEventMonitor (unit, 3 tests):
- testEagerRegistrationAndCounters -- all 5 MBeans registered at active(), counters start at zero, increments route to the matching event type, non-topology types are silently dropped, MBeans unregistered on reset().
- testConstructorRejectsNonTopologyType -- monitor construction rejects non-topology ClusterEventType values.
- testTopologyChangeEventTypesCoverage -- locks the topology set in tests so future enum additions are a conscious choice.
TestTopologyChangeEventMetric (integration, 1 test): drives a real controller + ZK + 2 participants and asserts that:
- All 5 topology MBeans are registered as soon as the controller becomes leader.
- LiveInstanceChange and IdealStateChange Received/Processed counters move on cluster startup.
- IdealStateChange counters advance past baseline when a new resource is added.
- InstanceConfigChange counters advance past baseline when an instance is disabled.
- MessageChange and other non-topology types never get a topology MBean registered.

The following is the result of running mvn -pl helix-core test against the new and adjacent monitor tests:

Tests run: 12, Failures: 0, Errors: 0, Skipped: 0

Tests included: TestTopologyChangeEventMonitor (3), TestTopologyChangeEventMetric (1), TestClusterStatusMonitor (7), TestClusterEventStatusMonitor (1). No regressions in adjacent monitor tests.

Changes that Break Backward Compatibility (Optional)

None. Purely additive: a new MBean class, new public methods on ClusterStatusMonitor, two new lines in GenericHelixController. No existing attributes, methods, or behaviors are changed.

Documentation (Optional)

None for this PR. Internal follow-up will land an OtelJmxAdaptor.SENSOR_DIMENSION_NAMES entry so OTel labels read Cluster.Name / Event.Name / Event.Type instead of the raw lowercase ObjectName keys; metric works without it but with inconsistent label casing.

Commits

Single commit, follows the standard guidelines (subject in imperative mood under 50 chars, body wraps at 72, explains what and why).

Code Quality

Diff follows the existing helix-style.xml conventions. New code matches the established _fieldName underscore convention used throughout the surrounding ClusterStatusMonitor / ClusterEventMonitor / monitor MBean classes.

Surfaces a per-ClusterEventType ReceivedCounter / ProcessedCounter MBean under ClusterStatus:eventName=TopologyChangeEvent for the five event types the rebalancer treats as topology-affecting (IdealStateChange, InstanceConfigChange, ResourceConfigChange, LiveInstanceChange, ClusterConfigChange). Mirrors the topology classification already used by ResourceChangeDetector so metric semantics line up with rebalancer behavior. Received is incremented in pushToEventQueues before the cluster event queue's dedup, processed at the end of handleEvent on successful non-task pipeline runs. The gap between the two reflects coalescing under load -- the signal PRR-30 reviewers asked for to correlate controller load with downstream rebalance throughput. Non-topology event types are filtered at the ClusterStatusMonitor boundary so callers stay simple. All five MBeans are registered eagerly in active() so dashboards (and OTel scrapers) see a stable schema before the first event of a given type arrives. Tests: - TestTopologyChangeEventMonitor: MBean registration, increment, filter, enum coverage lock. - TestTopologyChangeEventMetric: real controller + ZK integration test asserting counters advance on participant join, resource add, and instance disable through the full ZK -> CallbackHandler -> pipeline path; also verifies non-topology MBeans are never registered. ACTIONITEM-18683

LZD-PratyushBhatt

Added few comments

LZD-PratyushBhatt · 2026-05-26T14:01:33Z

+  public void incrementReceived() {
+    _receivedCounter.updateValue(_receivedCounter.getValue() + 1);
+  }
+
+  public void incrementProcessed() {
+    _processedCounter.updateValue(_processedCounter.getValue() + 1);
+  }


pushToEventQueues runs on independent ZK callback threads (one per listener:
onLiveInstanceChange, onIdealStateChange, onInstanceConfigChange,
onResourceConfigChange, onClusterConfigChange, plus forceRebalance from the
scheduled rebalance task). Concurrent increments for the same event type
lose updates.

Back both counters with AtomicLong, expose via a SimpleDynamicMetric
subclass whose getValue() reads from the atomic.

LZD-PratyushBhatt · 2026-05-26T14:02:55Z

+        "ReceivedCounter", icReceivedBefore + 1);
+    awaitCounterAtLeast(server, clusterName, ClusterEventType.InstanceConfigChange,
+        "ProcessedCounter", icProcessedBefore + 1);
+


can we add test for a ResourceConfig write for ResourceConfigChange, a cluster-config toggle for ClusterConfigChange. each followed by awaitCounterAtLeast on both counters.

LZD-PratyushBhatt · 2026-05-26T14:03:38Z

+    // Count topology-change events received from ZK before they get coalesced in the
+    // cluster event queue. Counted once per logical event (not per pipeline fan-out into
+    // DEFAULT/TASK queues) so the metric reflects ZK-driven churn, not internal queueing.
+    if (_isMonitoring && _clusterStatusMonitor != null) {


nit. _clusterStatusMonitor is final, initialized in the ctor at line 697. null check is dead. line 977 omits it. drop && _clusterStatusMonitor != null.

LZD-PratyushBhatt · 2026-05-26T14:04:06Z

        _clusterStatusMonitor
            .updateClusterEventDuration(ClusterEventMonitor.PhaseName.TotalProcessed.name(),
                _lastPipelineEndTimestamp - startTime);
+        // Count topology-change events that completed the resource pipeline successfully.


nit: "completed the resource pipeline" is inaccurate as this fires for the management pipeline too

LZD-PratyushBhatt · 2026-05-26T14:04:31Z

+  public static final String EVENT_NAME_KEY = "eventName";
+  public static final String EVENT_TYPE_KEY = "eventType";
+
+  private static final String SENSOR_DN_KEY = "TopologyChangeEvent";


nit: EVENT_NAME and SENSOR_DN_KEY are both "TopologyChangeEvent". one constant.

LZD-PratyushBhatt · 2026-05-26T14:05:17Z

+  long getReceivedCounter() {
+    return _receivedCounter.getValue();
+  }
+
+  long getProcessedCounter() {
+    return _processedCounter.getValue();
+  }


LZD-PratyushBhatt · 2026-05-26T14:06:24Z

+        // Paired with the received counter incremented in pushToEventQueues; the gap
+        // between the two reflects coalescing in the cluster event queue under load.
+        if (!rebalanceFail) {
+          _clusterStatusMonitor.incrementTopologyChangeEventProcessed(event.getEventType());


processed is also incremented for management pipeline events.

When _inManagementMode=true, topology events only go to
_managementModeEventQueue (line 1317). Management registry (lines 667-673)
only has a pipeline for LiveInstanceChange. For the other 4 types,
getPipelinesForEvent returns emptyList, the for-loop is a no-op,
rebalanceFail stays false, and we still bump processed here.

this will have controller stuck in management mode shows received == processed,
looks healthy on dashboards. Exactly what PRR-30 was trying to detect.

Fix:

if (!rebalanceFail && dataProvider instanceof ResourceControllerDataProvider) { _clusterStatusMonitor.incrementTopologyChangeEventProcessed(event.getEventType()); }

Also add an integration case: force management mode, fire an
IdealStateChange, assert processed does NOT move while received does.

Concurrency fix: pushToEventQueues runs on independent ZK callback threads (one per controller listener), so the Long-backed SimpleDynamicMetric increments were racy and lost updates. Switch both counters to an AtomicLong-backed SimpleDynamicMetric subclass so incrementAndGet() makes them safe under concurrent writers. Management-mode gate: in management mode, topology events go to the management mode event queue. The management registry only has a pipeline for LiveInstanceChange; the other four topology types get an empty pipeline list, the for-loop in handleEvent runs zero times, and rebalanceFail stays false -- so the previous gate would falsely credit those events as "processed". Restrict the increment to ResourceControllerDataProvider runs (extracted as a package-private shouldCountTopologyEventAsProcessed predicate so the gate logic is directly unit-testable). Without this fix a controller stuck in management mode would show received==processed -- exactly the false-healthy signal the metric was meant to surface. Test coverage expanded: TestTopologyChangeEventMetric now exercises ResourceConfigChange (resource-config write) and ClusterConfigChange (cluster-config update) end-to-end through the controller, in addition to the existing IdealStateChange / LiveInstanceChange / InstanceConfigChange cases. New TestGenericHelixControllerTopologyMetricGate unit-tests the predicate across resource success, resource failure, management, task-framework, and unknown-data-provider paths. Nits: drop dead getReceivedCounter / getProcessedCounter accessors, collapse duplicated EVENT_NAME / SENSOR_DN_KEY string into one constant, remove dead "_clusterStatusMonitor != null" check in pushToEventQueues (field is final, initialized in the controller constructor). ACTIONITEM-18683

thestreak101 · 2026-05-27T06:25:21Z

Thanks @LZD-PratyushBhatt — addressed all 7 comments in a091201.

Substantive

Concurrency — both counters now back onto AtomicLong via a small AtomicLongMetric subclass of SimpleDynamicMetric<Long>, so increments from independent ZK callback threads use incrementAndGet(). JMX attribute type stays Long so the LinkedIn OTel adaptor / dashboards consume it identically.
Management-mode gate — restricted the processed-increment to ResourceControllerDataProvider runs. Extracted the predicate as a package-private static shouldCountTopologyEventAsProcessed(rebalanceFail, dataProvider) so the gate logic is directly testable. New TestGenericHelixControllerTopologyMetricGate (5 cases) pins the four scenarios that should NOT count: resource-pipeline failure, management pipeline, task-framework pipeline, and unknown data provider. The reasoning behind the management-mode case is captured in the test javadoc.
ResourceConfig / ClusterConfig test coverage — TestTopologyChangeEventMetric now exercises ResourceConfigChange (resource-config write) and ClusterConfigChange (cluster-config update) end-to-end alongside the existing IdealStateChange / LiveInstanceChange / InstanceConfigChange cases.

Nits

Dropped dead getReceivedCounter / getProcessedCounter accessors.
Collapsed EVENT_NAME / SENSOR_DN_KEY (both "TopologyChangeEvent") into one constant.
Dropped the dead _clusterStatusMonitor != null check in pushToEventQueues (field is final, initialized in ctor).

Test results: 17/17 pass across the 5 affected test classes (TestTopologyChangeEventMonitor 3, TestTopologyChangeEventMetric 1 with 6 phases, TestGenericHelixControllerTopologyMetricGate 5, TestClusterStatusMonitor 7, TestClusterEventStatusMonitor 1). No regressions.

On the management-mode integration test

I went with a focused unit test on the extracted predicate rather than a full ZK-freeze integration scenario for two reasons:

The gate is now an explicit named function with clear branch coverage; the test pins the exact (rebalanceFail, dataProvider) matrix.
Forcing management mode end-to-end requires writing freeze-signal ZK plumbing into a test that wouldn't otherwise need it.

Happy to add the heavier ZK-freeze integration case if you'd still prefer it — let me know.

On CI: the PR_CI failure is a pre-existing codecov upload-coverage token-on-protected-branch issue plus 9 flaky/unrelated tests in helix-rest (TestClusterAccessor, TestCustomActiveStatesStoppableCheck, TestMaintenanceManagementService, TestInstancesAccessor, TestCanSwapCompleteAPI) and helix-view-aggregator (TestViewClusterRefresher). My changes touch none of those modules; helix-core ran 1808 tests with 0 failures / 0 errors.

LZD-PratyushBhatt

LGTM

thestreak101 requested review from LZD-PratyushBhatt, PranaviAncha, arkmish, kabragaurav, laxman-ch, ngngwr and proud-parselmouth as code owners May 25, 2026 06:19

LZD-PratyushBhatt reviewed May 26, 2026

View reviewed changes

LZD-PratyushBhatt approved these changes May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-event-type topology change metrics for controller#185

Add per-event-type topology change metrics for controller#185
thestreak101 wants to merge 2 commits into
devfrom
sacchoud/topology-change-event-count-metric

thestreak101 commented May 25, 2026

Uh oh!

LZD-PratyushBhatt left a comment

Uh oh!

LZD-PratyushBhatt May 26, 2026

Uh oh!

LZD-PratyushBhatt May 26, 2026

Uh oh!

LZD-PratyushBhatt May 26, 2026

Uh oh!

LZD-PratyushBhatt May 26, 2026

Uh oh!

LZD-PratyushBhatt May 26, 2026

Uh oh!

LZD-PratyushBhatt May 26, 2026

Uh oh!

LZD-PratyushBhatt May 26, 2026

Uh oh!

thestreak101 commented May 27, 2026

Uh oh!

LZD-PratyushBhatt left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thestreak101 commented May 25, 2026

Issues

Description

MBean shape

Files changed

Tests

Changes that Break Backward Compatibility (Optional)

Documentation (Optional)

Commits

Code Quality

Uh oh!

LZD-PratyushBhatt left a comment

Choose a reason for hiding this comment

Uh oh!

LZD-PratyushBhatt May 26, 2026

Choose a reason for hiding this comment

Uh oh!

LZD-PratyushBhatt May 26, 2026

Choose a reason for hiding this comment

Uh oh!

LZD-PratyushBhatt May 26, 2026

Choose a reason for hiding this comment

Uh oh!

LZD-PratyushBhatt May 26, 2026

Choose a reason for hiding this comment

Uh oh!

LZD-PratyushBhatt May 26, 2026

Choose a reason for hiding this comment

Uh oh!

LZD-PratyushBhatt May 26, 2026

Choose a reason for hiding this comment

Uh oh!

LZD-PratyushBhatt May 26, 2026

Choose a reason for hiding this comment

Uh oh!

thestreak101 commented May 27, 2026

Uh oh!

LZD-PratyushBhatt left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants