Skip to content

Add eventsCommitToAckLatencyMs SLA metric#1018

Merged
akshayrai merged 2 commits into
linkedin:masterfrom
akshayrai:akrai/e2e-commit-to-ack-sla
May 21, 2026
Merged

Add eventsCommitToAckLatencyMs SLA metric#1018
akshayrai merged 2 commits into
linkedin:masterfrom
akshayrai:akrai/e2e-commit-to-ack-sla

Conversation

@akshayrai
Copy link
Copy Markdown
Collaborator

Summary

Adds a new SLA metric — eventsCommitToAckLatencyMs — that measures source DB commit → destination ack latency.

The existing eventsLatencyMs is computed from DatastreamProducerRecord.eventsSourceTimestamp, which on some CDC connectors reflects when Brooklin read the event (e.g. from an intermediate Kafka hop) rather than the original DB commit. This change introduces a separate, opt-in timestamp on the producer record and a parallel metric so connectors that can supply a true commit time expose a clean end-to-end SLA without changing existing metric semantics.

Tests

New unit tests covering both the data path and the metric path:

TestDatastreamProducerRecordBuilder

  • eventsCommitTimestamp absent by default
  • round-trips through the builder
  • propagated by copyProducerRecord
  • absence preserved by copyProducerRecord

TestEventProducer

  • metric is a no-op when no commit timestamp is supplied; existing eventsLatencyMs SLA still fires (regression guard)
  • eventsCommitWithinSla increments when latency is inside the threshold
  • eventsCommitOutsideSla increments when latency exceeds the threshold (tight 1ms threshold + 100ms-old commit)
  • during grace, histogram redirects to eventsCommitToAckLatencyMsSlaIneligible and both counters are suppressed

akshayrai and others added 2 commits May 21, 2026 14:21
Adds a parallel end-to-end latency metric measuring source DB commit time
to destination ack, distinct from the existing eventsLatencyMs which on
some CDC connectors (Espresso, TiDB) reflects the time Brooklin read the
event from an intermediate Kafka hop rather than the original DB commit.
DatastreamProducerRecord gains an Optional<Long> eventsCommitTimestamp
that connectors capable of supplying a true commit time populate via
DatastreamProducerRecordBuilder.setEventsCommitTimestamp(long). The field
is absent for non-CDC sources and bootstrap paths, so the new metric only
emits when a connector opts in — existing eventsLatencyMs behavior and
counters are unchanged.
EventProducer threads the optional timestamp through send -> onSendCallback
-> reportMetrics and emits eventsCommitToAckLatencyMs (histogram) plus
eventsCommitWithinSla / eventsCommitOutsideSla counter pairs (primary and
alternate). Thresholds are configurable via commitToAckThresholdSlaMs
(default 5m) and commitToAckThresholdAlternateSlaMs (default 15m); defaults
are wider than the existing source-to-ack thresholds because commit-to-ack
includes upstream CDC pipeline lag. Emission is gated by the same
shouldEmitMetric() suppression as the existing SLA metric, so grace-period
and disableSlaMetric semantics carry over.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four tests covering the new commit-to-ack latency path:
- metric not emitted when commit timestamp is absent (non-CDC / bootstrap)
- within-SLA counter increments when commit timestamp is recent
- outside-SLA counter increments when latency exceeds threshold
- histogram redirects to SLA-ineligible and counters suppress during grace
All assert against aggregate metric names so no per-task DropWizard plumbing
is needed. The non-CDC source URI is used in the first three tests to keep
the grace gate disengaged; the fourth deliberately re-enables it with a
fresh CDC stream to exercise the suppression path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
return;
}
long commitTs = eventsCommitTimestamp.get();
if (commitTs <= 0) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: In reportCommitToAckMetrics, we guard commitTs <= 0 but not the case where the connector supplies a timestamp in the future. In that case System.currentTimeMillis() - commitTs becomes negative and we’ll emit a negative histogram value / classify SLA incorrectly. Just checking Do we need to add that check ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commitTS cannot be in the future since the event has already happened right.

Copy link
Copy Markdown
Collaborator

@mittalprince mittalprince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* Get the source DB commit timestamp (Epoch-millis) if the connector supplied one. Present for CDC connectors
* that surface a true commit time; absent otherwise.
*/
public Optional<Long> getEventsCommitTimestamp() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this field will only get set for CDC events?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this applied only for CDC

@akshayrai akshayrai merged commit 43dc83d into linkedin:master May 21, 2026
1 check passed
@mittalprince mittalprince mentioned this pull request May 22, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants