Skip to content

feat(kafka): add MessageBroker/Kafka/Cluster/{id}/Topic/{topic} metrics#2914

Draft
shashank-reddy-nr wants to merge 7 commits into
newrelic:mainfrom
shashank-reddy-nr:feat/kafka-cluster-id
Draft

feat(kafka): add MessageBroker/Kafka/Cluster/{id}/Topic/{topic} metrics#2914
shashank-reddy-nr wants to merge 7 commits into
newrelic:mainfrom
shashank-reddy-nr:feat/kafka-cluster-id

Conversation

@shashank-reddy-nr

@shashank-reddy-nr shashank-reddy-nr commented Jun 1, 2026

Copy link
Copy Markdown

Summary

Adds three new instrumentation modules that record per-cluster, per-topic Kafka metrics across all supported Kafka client versions. The cluster UUID is read from the client's own already-fetched Metadata — no extra broker connection is opened.

What's changed

  • instrumentation/kafka-clients-cluster-metrics-0.11.0.0/ — covers Kafka producers [0.11.0.0, 2.0.0). ClusterIdHelper uses reflection to walk metadata.fetch().clusterResource().clusterId(). KafkaProducer_Instrumentation stores the resolved UUID in a @newfield volatile String nrClusterId and records MessageBroker/Kafka/Cluster/{id}/Topic/{topic}/Produce on each doSend().
  • instrumentation/kafka-clients-cluster-metrics-2.0.0/ — covers producers and consumers [2.0.0, 3.7.0). Same reflection approach. KafkaConsumer_Instrumentation instruments both poll(Duration) and poll(long).
  • instrumentation/kafka-clients-cluster-metrics-3.7.0/ — covers producers and consumers [3.7.0, ∞). Kafka 3.7 refactored KafkaConsumer into a thin facade; this module targets LegacyKafkaConsumer via the Weaver field-reference pattern (private final ConsumerMetadata metadata = Weaver.callOriginal()) to avoid reflection on the hot path.
  • settings.gradle — registers all three new modules.

Metric shape

MessageBroker/Kafka/Cluster/{clusterId}/Topic/{topic}/Produce
MessageBroker/Kafka/Cluster/{clusterId}/Topic/{topic}/Consume

The cluster UUID is available once the Kafka client has completed its first metadata refresh. Until then, calls fall back to the existing topic-only metric path.

Before contributing, please read our contributing guidelines and code of conduct.

Overview

Describe the changes present in the pull request

Related Github Issue

Include a link to the related GitHub issue, if applicable

Testing

The agent includes a suite of tests which should be used to
verify your changes don't break existing functionality. These tests will run with
Github Actions when a pull request is made. More details on running the tests locally can be found
here,

Checks

  • Your contributions are backwards compatible with relevant frameworks and APIs.
  • Your code does not contain any breaking changes. Otherwise please describe.
  • Your code does not introduce any new dependencies. Otherwise please describe.

@codecov-commenter

codecov-commenter commented Jun 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 37.76%. Comparing base (1859c17) to head (b177dc7).
⚠️ Report is 91 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (1859c17) and HEAD (b177dc7). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (1859c17) HEAD (b177dc7)
3 2
Additional details and impacted files
@@              Coverage Diff              @@
##               main    #2914       +/-   ##
=============================================
- Coverage     70.71%   37.76%   -32.95%     
+ Complexity    10647     5466     -5181     
=============================================
  Files           881      879        -2     
  Lines         42947    43004       +57     
  Branches       6501     6512       +11     
=============================================
- Hits          30368    16240    -14128     
- Misses         9651    24501    +14850     
+ Partials       2928     2263      -665     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jasonjkeller jasonjkeller moved this from Triage to Needs Review in Java Engineering Board Jun 2, 2026
@shashank-reddy-nr shashank-reddy-nr marked this pull request as ready for review June 15, 2026 17:08
Records per-cluster Kafka metrics (MessageBroker/Kafka/Cluster/{clusterId}/Topic/{topic}/Produce
and MessageBroker/Kafka/Cluster/{clusterId}/Topic/{topic}/Consume) to let customers track
throughput broken out by Kafka cluster, not just by topic. The cluster ID is fetched
once per unique broker set via a background AdminClient call; the hot produce/consume path
has no additional overhead. The feature is always-on and best-effort — it does not inject
anything into Kafka wire headers and does not add span or custom attributes.

Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@shashank-reddy-nr shashank-reddy-nr marked this pull request as draft June 22, 2026 20:25
Set Enabled: true on both kafka-clients-config-1.1.0 and
kafka-clients-spans-0.11.0.0 weave JARs so the instrumentation
activates automatically without requiring newrelic.yml configuration.
The metrics are best-effort and carry no overhead when AdminClient
connections fail.

Assisted-by: Claude Sonnet 4.6
…ics modules

Previously the cluster ID was fetched once (nrClusterId == null) and never
re-fetched if the cluster restarted or the metadata refreshed. This adds
CLUSTER_ID_TTL_MS (30 min) and nrClusterIdFetchedAt to all four
kafka-clients-cluster-metrics modules so the ID is periodically re-validated.

Also fixes Java 8 source-level compatibility in 0.11.0.0 and 2.0.0:
- getOrDefault → explicit get + null check (no default methods on old runtimes)
- HashMap<> diamond → HashMap<String, Integer> explicit type args

Assisted-by: Claude Sonnet 4.6
…ka modules

- Add missing KafkaProducer_Instrumentation and ClusterIdHelper to the 4.0.0
  module so produce metrics fire for Kafka 4.x (was silently absent)
- Add poll(long timeoutMs) overload to LegacyKafkaConsumer_Instrumentation
  (3.7.0) so legacy-timeout callers also record consume metrics
- Remove the `nrClusterId == null ||` short-circuit from the TTL guard in all
  eight producer/consumer files across modules 0.11, 2.0, 3.7, and 4.0;
  the null check bypassed TTL and triggered reflection on every message until
  the cluster ID was first obtained — TTL alone is sufficient because
  nrClusterIdFetchedAt == 0 on first call, always exceeding the 1-hour window

Assisted-by: Claude Sonnet 4.6
Add kafka-clients-cluster-metrics: enabled: false to the class_transformer
section so the feature is opt-in. Users can enable it when the AdminClient
DescribeCluster overhead is acceptable.

Assisted-by: Claude Sonnet 4.6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Needs Review

Development

Successfully merging this pull request may close these issues.

3 participants