Skip to content

Add per-event-type topology change metrics for controller#185

Open
thestreak101 wants to merge 2 commits into
devfrom
sacchoud/topology-change-event-count-metric
Open

Add per-event-type topology change metrics for controller#185
thestreak101 wants to merge 2 commits into
devfrom
sacchoud/topology-change-event-count-metric

Conversation

@thestreak101
Copy link
Copy Markdown
Collaborator

Issues

  • This PR is tracked by ACTIONITEM-18683 (LinkedIn-internal). It addresses the PRR-30 reviewer ask for a metric that surfaces total topology change event count and the count of events being processed by the Helix controller.

Description

Adds per-ClusterEventType ReceivedCounter / ProcessedCounter MBean attributes under ClusterStatus:eventName=TopologyChangeEvent for the five event types the rebalancer treats as topology-affecting:

  • IdealStateChange
  • InstanceConfigChange
  • ResourceConfigChange
  • LiveInstanceChange
  • ClusterConfigChange

The set mirrors the topology classification already used by ResourceChangeDetector.determinePropertyMapByType, so metric semantics line up with how the rebalancer itself defines a topology change.

Received is incremented in GenericHelixController.pushToEventQueues before the cluster event queue's dedup. Processed is incremented at the end of handleEvent on successful non-task pipeline runs. The gap between the two reflects coalescing under load -- the signal needed to correlate controller load with rebalance throughput.

Non-topology event types (MessageChange, CurrentStateChange, etc.) are filtered at the ClusterStatusMonitor boundary so callers stay simple. All five MBeans are registered eagerly in active() so dashboards and OTel scrapers see a stable schema before the first event of a given type arrives.

MBean shape

domain : ClusterStatus
keys   : cluster=<cluster>, eventName=TopologyChangeEvent, eventType=<one of 5>
attrs  : ReceivedCounter (Long), ProcessedCounter (Long), SensorName (String)

Files changed

  • controller/stages/ClusterEventType.java -- adds isTopologyChange() predicate and topologyChangeEventTypes() set as the single source of truth for the filter.
  • monitoring/mbeans/TopologyChangeEventMonitor.java (new) -- per-event-type MBean built on the existing DynamicMBeanProvider pattern (sibling of ClusterEventMonitor).
  • monitoring/mbeans/ClusterStatusMonitor.java -- map of monitors, eager registration in active(), unregister in reset(), public incrementTopologyChangeEvent{Received,Processed} methods that no-op on non-topology types.
  • controller/GenericHelixController.java -- two hook lines (Received in pushToEventQueues, Processed at end of handleEvent).

Tests

The following tests are written for this issue:

  • TestTopologyChangeEventMonitor (unit, 3 tests):
    • testEagerRegistrationAndCounters -- all 5 MBeans registered at active(), counters start at zero, increments route to the matching event type, non-topology types are silently dropped, MBeans unregistered on reset().
    • testConstructorRejectsNonTopologyType -- monitor construction rejects non-topology ClusterEventType values.
    • testTopologyChangeEventTypesCoverage -- locks the topology set in tests so future enum additions are a conscious choice.
  • TestTopologyChangeEventMetric (integration, 1 test): drives a real controller + ZK + 2 participants and asserts that:
    • All 5 topology MBeans are registered as soon as the controller becomes leader.
    • LiveInstanceChange and IdealStateChange Received/Processed counters move on cluster startup.
    • IdealStateChange counters advance past baseline when a new resource is added.
    • InstanceConfigChange counters advance past baseline when an instance is disabled.
    • MessageChange and other non-topology types never get a topology MBean registered.

The following is the result of running mvn -pl helix-core test against the new and adjacent monitor tests:

Tests run: 12, Failures: 0, Errors: 0, Skipped: 0

Tests included: TestTopologyChangeEventMonitor (3), TestTopologyChangeEventMetric (1), TestClusterStatusMonitor (7), TestClusterEventStatusMonitor (1). No regressions in adjacent monitor tests.

Changes that Break Backward Compatibility (Optional)

None. Purely additive: a new MBean class, new public methods on ClusterStatusMonitor, two new lines in GenericHelixController. No existing attributes, methods, or behaviors are changed.

Documentation (Optional)

None for this PR. Internal follow-up will land an OtelJmxAdaptor.SENSOR_DIMENSION_NAMES entry so OTel labels read Cluster.Name / Event.Name / Event.Type instead of the raw lowercase ObjectName keys; metric works without it but with inconsistent label casing.

Commits

Single commit, follows the standard guidelines (subject in imperative mood under 50 chars, body wraps at 72, explains what and why).

Code Quality

Diff follows the existing helix-style.xml conventions. New code matches the established _fieldName underscore convention used throughout the surrounding ClusterStatusMonitor / ClusterEventMonitor / monitor MBean classes.

Surfaces a per-ClusterEventType ReceivedCounter / ProcessedCounter MBean
under ClusterStatus:eventName=TopologyChangeEvent for the five event types
the rebalancer treats as topology-affecting (IdealStateChange,
InstanceConfigChange, ResourceConfigChange, LiveInstanceChange,
ClusterConfigChange). Mirrors the topology classification already used
by ResourceChangeDetector so metric semantics line up with rebalancer
behavior.

Received is incremented in pushToEventQueues before the cluster event
queue's dedup, processed at the end of handleEvent on successful
non-task pipeline runs. The gap between the two reflects coalescing
under load -- the signal PRR-30 reviewers asked for to correlate
controller load with downstream rebalance throughput.

Non-topology event types are filtered at the ClusterStatusMonitor
boundary so callers stay simple. All five MBeans are registered eagerly
in active() so dashboards (and OTel scrapers) see a stable schema
before the first event of a given type arrives.

Tests:
- TestTopologyChangeEventMonitor: MBean registration, increment, filter,
  enum coverage lock.
- TestTopologyChangeEventMetric: real controller + ZK integration test
  asserting counters advance on participant join, resource add, and
  instance disable through the full ZK -> CallbackHandler -> pipeline
  path; also verifies non-topology MBeans are never registered.

ACTIONITEM-18683
Copy link
Copy Markdown
Collaborator

@LZD-PratyushBhatt LZD-PratyushBhatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few comments

Comment on lines +74 to +80
public void incrementReceived() {
_receivedCounter.updateValue(_receivedCounter.getValue() + 1);
}

public void incrementProcessed() {
_processedCounter.updateValue(_processedCounter.getValue() + 1);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushToEventQueues runs on independent ZK callback threads (one per listener:
onLiveInstanceChange, onIdealStateChange, onInstanceConfigChange,
onResourceConfigChange, onClusterConfigChange, plus forceRebalance from the
scheduled rebalance task). Concurrent increments for the same event type
lose updates.

Back both counters with AtomicLong, expose via a SimpleDynamicMetric
subclass whose getValue() reads from the atomic.

"ReceivedCounter", icReceivedBefore + 1);
awaitCounterAtLeast(server, clusterName, ClusterEventType.InstanceConfigChange,
"ProcessedCounter", icProcessedBefore + 1);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add test for a ResourceConfig write for ResourceConfigChange, a cluster-config toggle for ClusterConfigChange. each followed by awaitCounterAtLeast on both counters.

// Count topology-change events received from ZK before they get coalesced in the
// cluster event queue. Counted once per logical event (not per pipeline fan-out into
// DEFAULT/TASK queues) so the metric reflects ZK-driven churn, not internal queueing.
if (_isMonitoring && _clusterStatusMonitor != null) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. _clusterStatusMonitor is final, initialized in the ctor at line 697. null check is dead. line 977 omits it. drop && _clusterStatusMonitor != null.

_clusterStatusMonitor
.updateClusterEventDuration(ClusterEventMonitor.PhaseName.TotalProcessed.name(),
_lastPipelineEndTimestamp - startTime);
// Count topology-change events that completed the resource pipeline successfully.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "completed the resource pipeline" is inaccurate as this fires for the management pipeline too

public static final String EVENT_NAME_KEY = "eventName";
public static final String EVENT_TYPE_KEY = "eventType";

private static final String SENSOR_DN_KEY = "TopologyChangeEvent";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: EVENT_NAME and SENSOR_DN_KEY are both "TopologyChangeEvent". one constant.

Comment on lines +86 to +92
long getReceivedCounter() {
return _receivedCounter.getValue();
}

long getProcessedCounter() {
return _processedCounter.getValue();
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code.

// Paired with the received counter incremented in pushToEventQueues; the gap
// between the two reflects coalescing in the cluster event queue under load.
if (!rebalanceFail) {
_clusterStatusMonitor.incrementTopologyChangeEventProcessed(event.getEventType());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processed is also incremented for management pipeline events.

When _inManagementMode=true, topology events only go to
_managementModeEventQueue (line 1317). Management registry (lines 667-673)
only has a pipeline for LiveInstanceChange. For the other 4 types,
getPipelinesForEvent returns emptyList, the for-loop is a no-op,
rebalanceFail stays false, and we still bump processed here.

this will have controller stuck in management mode shows received == processed,
looks healthy on dashboards. Exactly what PRR-30 was trying to detect.

Fix:

if (!rebalanceFail && dataProvider instanceof ResourceControllerDataProvider) {
  _clusterStatusMonitor.incrementTopologyChangeEventProcessed(event.getEventType());
}

Also add an integration case: force management mode, fire an
IdealStateChange, assert processed does NOT move while received does.

Concurrency fix: pushToEventQueues runs on independent ZK callback
threads (one per controller listener), so the Long-backed
SimpleDynamicMetric increments were racy and lost updates. Switch both
counters to an AtomicLong-backed SimpleDynamicMetric subclass so
incrementAndGet() makes them safe under concurrent writers.

Management-mode gate: in management mode, topology events go to the
management mode event queue. The management registry only has a
pipeline for LiveInstanceChange; the other four topology types get an
empty pipeline list, the for-loop in handleEvent runs zero times, and
rebalanceFail stays false -- so the previous gate would falsely credit
those events as "processed". Restrict the increment to
ResourceControllerDataProvider runs (extracted as a package-private
shouldCountTopologyEventAsProcessed predicate so the gate logic is
directly unit-testable). Without this fix a controller stuck in
management mode would show received==processed -- exactly the
false-healthy signal the metric was meant to surface.

Test coverage expanded: TestTopologyChangeEventMetric now exercises
ResourceConfigChange (resource-config write) and ClusterConfigChange
(cluster-config update) end-to-end through the controller, in addition
to the existing IdealStateChange / LiveInstanceChange / InstanceConfigChange
cases. New TestGenericHelixControllerTopologyMetricGate unit-tests the
predicate across resource success, resource failure, management,
task-framework, and unknown-data-provider paths.

Nits: drop dead getReceivedCounter / getProcessedCounter accessors,
collapse duplicated EVENT_NAME / SENSOR_DN_KEY string into one constant,
remove dead "_clusterStatusMonitor != null" check in pushToEventQueues
(field is final, initialized in the controller constructor).

ACTIONITEM-18683
@thestreak101
Copy link
Copy Markdown
Collaborator Author

Thanks @LZD-PratyushBhatt — addressed all 7 comments in a091201.

Substantive

  • Concurrency — both counters now back onto AtomicLong via a small AtomicLongMetric subclass of SimpleDynamicMetric<Long>, so increments from independent ZK callback threads use incrementAndGet(). JMX attribute type stays Long so the LinkedIn OTel adaptor / dashboards consume it identically.
  • Management-mode gate — restricted the processed-increment to ResourceControllerDataProvider runs. Extracted the predicate as a package-private static shouldCountTopologyEventAsProcessed(rebalanceFail, dataProvider) so the gate logic is directly testable. New TestGenericHelixControllerTopologyMetricGate (5 cases) pins the four scenarios that should NOT count: resource-pipeline failure, management pipeline, task-framework pipeline, and unknown data provider. The reasoning behind the management-mode case is captured in the test javadoc.
  • ResourceConfig / ClusterConfig test coverageTestTopologyChangeEventMetric now exercises ResourceConfigChange (resource-config write) and ClusterConfigChange (cluster-config update) end-to-end alongside the existing IdealStateChange / LiveInstanceChange / InstanceConfigChange cases.

Nits

  • Dropped dead getReceivedCounter / getProcessedCounter accessors.
  • Collapsed EVENT_NAME / SENSOR_DN_KEY (both "TopologyChangeEvent") into one constant.
  • Dropped the dead _clusterStatusMonitor != null check in pushToEventQueues (field is final, initialized in ctor).

Test results: 17/17 pass across the 5 affected test classes (TestTopologyChangeEventMonitor 3, TestTopologyChangeEventMetric 1 with 6 phases, TestGenericHelixControllerTopologyMetricGate 5, TestClusterStatusMonitor 7, TestClusterEventStatusMonitor 1). No regressions.

On the management-mode integration test

I went with a focused unit test on the extracted predicate rather than a full ZK-freeze integration scenario for two reasons:

  1. The gate is now an explicit named function with clear branch coverage; the test pins the exact (rebalanceFail, dataProvider) matrix.
  2. Forcing management mode end-to-end requires writing freeze-signal ZK plumbing into a test that wouldn't otherwise need it.

Happy to add the heavier ZK-freeze integration case if you'd still prefer it — let me know.

On CI: the PR_CI failure is a pre-existing codecov upload-coverage token-on-protected-branch issue plus 9 flaky/unrelated tests in helix-rest (TestClusterAccessor, TestCustomActiveStatesStoppableCheck, TestMaintenanceManagementService, TestInstancesAccessor, TestCanSwapCompleteAPI) and helix-view-aggregator (TestViewClusterRefresher). My changes touch none of those modules; helix-core ran 1808 tests with 0 failures / 0 errors.

Copy link
Copy Markdown
Collaborator

@LZD-PratyushBhatt LZD-PratyushBhatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants