Skip to content

Stream provisioning metric changes#1014

Open
dhananjay-sawner wants to merge 8 commits into
masterfrom
dsawner/provisioning-time-metric
Open

Stream provisioning metric changes#1014
dhananjay-sawner wants to merge 8 commits into
masterfrom
dsawner/provisioning-time-metric

Conversation

@dhananjay-sawner
Copy link
Copy Markdown
Collaborator

@dhananjay-sawner dhananjay-sawner commented May 19, 2026

Summary

  • timeToReadyMs histogram (Coordinator): records the duration between datastream creation (system.creation.ms) and the INITIALIZING -> READY transition. Uses a 15-minute sliding window
    reservoir; exposed via the standard BrooklinHistogramInfo attributes (Count, Max, Mean, p50/p75/p95/p99, etc.). Enables alerting on long provisioning via Max.
  • slowProvisioningCount counter (Coordinator): incremented whenever a datastream's time-to-ready exceeds a configurable threshold.
  • New config brooklin.server.coordinator.slowProvisioningThresholdMs: controls the threshold above which a provisioning is counted as slow. Default 10 minutes. Threshold lives in config, so brooklin-server can tune per fabric without a library bump.

Testing Done

  • testTimeToReadyHistogramOnCreate — histogram count increases and sample is non-negative after INITIALIZING -> READY
  • testTimeToReadyHistogramNotEmittedOnResume — histogram count does not change on STOPPED -> READY resume
  • testSlowProvisioningCounterIncrementsWhenAboveThreshold — with threshold = 0 ms, counter increments after stream goes READY
  • testSlowProvisioningCounterDoesNotIncrementWhenBelowThreshold — with threshold = 1 h, counter stays unchanged after stream goes READY
  • mint build

Dhananjay Sawner added 8 commits May 9, 2026 14:39
Captures the duration between datastream creation (system.creation.ms)
and the INITIALIZING -> READY transition in handleDatastreamAddOrDelete.
Aggregate-only sliding-window histogram (15 min) on the Coordinator
module, registered via BrooklinHistogramInfo in CoordinatorMetrics so
the metrics bridge surfaces it (per #945 precedent).
Resume from STOPPED bypasses this code path (DatastreamResources.resume
sets READY directly), so the metric only fires on initial creation.
Adds an INFO log line in recordTimeToReadyMs alongside the histogram
update so per-stream provisioning times are recoverable from logs.
The histogram aggregates across the sliding window and cannot expose
individual sample values; the log line preserves per-datastream
attribution for retrospective analysis.
setup._datastreamKafkaCluster.shutdown();
}

@Test
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update PR description to reflect the latest changes in the diff.

public enum Counter {
NUM_HEARTBEATS("numHeartbeats");
NUM_HEARTBEATS("numHeartbeats"),
SLOW_PROVISIONING_COUNT("slowProvisioningCount");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counter is monotonic and cumulative. Maybe we need a Meter?

// The metric is created lazily on first update via createOrUpdateSlidingWindowHistogram.
// Registering the BrooklinHistogramInfo here is required for the external metrics bridge
// to pick up the metric name
_metricInfos.add(new BrooklinHistogramInfo(_coordinator.buildMetricName(MODULE, TIME_TO_READY_MS)));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to check if 15 min histogram will help and how. Datastream creation qps can be very very low right?

public static final int DEFAULT_MARK_DATASTREMS_STOPPED_TIMEOUT_MS = 60 * 1000;
public static final int DEFAULT_MARK_DATASTREMS_STOPPED_RETRY_PERIOD_MS = 10 * 1000;
public static final int DEFAULT_LOG_SIZE_LIMIT_IN_BYTES = 1024 * 1024;
public static final long DEFAULT_SLOW_PROVISIONING_THRESHOLD_MS = Duration.ofMinutes(10).toMillis();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excluding validations right? should we split and account differently?

}

private void recordTimeToReadyMs(Datastream ds) {
String creationMsStr = Objects.requireNonNull(ds.getMetadata()).get(CREATION_MS);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add in-line comment/javadoc in the code on when the creation_ms is populated for clarity

coordinator.getDatastreamCache().getZkclient().close();
zkClient.close();
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a regression test - metric should't be updated in the STOPPED → READY path. this should be emitted only on creation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants