Stream provisioning metric changes#1014
Conversation
Captures the duration between datastream creation (system.creation.ms) and the INITIALIZING -> READY transition in handleDatastreamAddOrDelete. Aggregate-only sliding-window histogram (15 min) on the Coordinator module, registered via BrooklinHistogramInfo in CoordinatorMetrics so the metrics bridge surfaces it (per #945 precedent). Resume from STOPPED bypasses this code path (DatastreamResources.resume sets READY directly), so the metric only fires on initial creation.
Adds an INFO log line in recordTimeToReadyMs alongside the histogram update so per-stream provisioning times are recoverable from logs. The histogram aggregates across the sliding window and cannot expose individual sample values; the log line preserves per-datastream attribution for retrospective analysis.
| setup._datastreamKafkaCluster.shutdown(); | ||
| } | ||
|
|
||
| @Test |
There was a problem hiding this comment.
can you update PR description to reflect the latest changes in the diff.
| public enum Counter { | ||
| NUM_HEARTBEATS("numHeartbeats"); | ||
| NUM_HEARTBEATS("numHeartbeats"), | ||
| SLOW_PROVISIONING_COUNT("slowProvisioningCount"); |
There was a problem hiding this comment.
Counter is monotonic and cumulative. Maybe we need a Meter?
| // The metric is created lazily on first update via createOrUpdateSlidingWindowHistogram. | ||
| // Registering the BrooklinHistogramInfo here is required for the external metrics bridge | ||
| // to pick up the metric name | ||
| _metricInfos.add(new BrooklinHistogramInfo(_coordinator.buildMetricName(MODULE, TIME_TO_READY_MS))); |
There was a problem hiding this comment.
Good to check if 15 min histogram will help and how. Datastream creation qps can be very very low right?
| public static final int DEFAULT_MARK_DATASTREMS_STOPPED_TIMEOUT_MS = 60 * 1000; | ||
| public static final int DEFAULT_MARK_DATASTREMS_STOPPED_RETRY_PERIOD_MS = 10 * 1000; | ||
| public static final int DEFAULT_LOG_SIZE_LIMIT_IN_BYTES = 1024 * 1024; | ||
| public static final long DEFAULT_SLOW_PROVISIONING_THRESHOLD_MS = Duration.ofMinutes(10).toMillis(); |
There was a problem hiding this comment.
This is excluding validations right? should we split and account differently?
| } | ||
|
|
||
| private void recordTimeToReadyMs(Datastream ds) { | ||
| String creationMsStr = Objects.requireNonNull(ds.getMetadata()).get(CREATION_MS); |
There was a problem hiding this comment.
add in-line comment/javadoc in the code on when the creation_ms is populated for clarity
| coordinator.getDatastreamCache().getZkclient().close(); | ||
| zkClient.close(); | ||
| } | ||
|
|
There was a problem hiding this comment.
maybe add a regression test - metric should't be updated in the STOPPED → READY path. this should be emitted only on creation.
Summary
timeToReadyMshistogram (Coordinator): records the duration between datastream creation (system.creation.ms) and theINITIALIZING -> READYtransition. Uses a 15-minute sliding windowreservoir; exposed via the standard
BrooklinHistogramInfoattributes (Count, Max, Mean, p50/p75/p95/p99, etc.). Enables alerting on long provisioning viaMax.slowProvisioningCountcounter (Coordinator): incremented whenever a datastream's time-to-ready exceeds a configurable threshold.brooklin.server.coordinator.slowProvisioningThresholdMs: controls the threshold above which a provisioning is counted as slow. Default 10 minutes. Threshold lives in config, so brooklin-server can tune per fabric without a library bump.Testing Done
testTimeToReadyHistogramOnCreate— histogram count increases and sample is non-negative afterINITIALIZING -> READYtestTimeToReadyHistogramNotEmittedOnResume— histogram count does not change onSTOPPED -> READYresumetestSlowProvisioningCounterIncrementsWhenAboveThreshold— with threshold = 0 ms, counter increments after stream goes READYtestSlowProvisioningCounterDoesNotIncrementWhenBelowThreshold— with threshold = 1 h, counter stays unchanged after stream goes READYmint build