Skip to content

[feat][broker] PIP-441:Add broker-level metrics for non-recoverable data skips#24726

Open
codelipenghui wants to merge 2 commits into
apache:masterfrom
codelipenghui:feat/add-broker-non-recoverable-data-metrics
Open

[feat][broker] PIP-441:Add broker-level metrics for non-recoverable data skips#24726
codelipenghui wants to merge 2 commits into
apache:masterfrom
codelipenghui:feat/add-broker-non-recoverable-data-metrics

Conversation

@codelipenghui

@codelipenghui codelipenghui commented Sep 10, 2025

Copy link
Copy Markdown
Contributor

Summary

Implementation for #24716

This PR implements broker-level metrics for tracking non-recoverable data skips in Apache Pulsar, providing operational visibility for data loss events.

Key Features Added

  • NonRecoverableDataMetricsCallback Interface: New callback interface in managed-ledger module to avoid cross-module dependencies
  • ManagedLedger Integration: Updated ManagedLedgerImpl.skipNonRecoverableLedger() to trigger callback when ledgers are skipped
  • ManagedCursor Integration: Updated ManagedCursorImpl.skipNonRecoverableEntries() to count and report skipped entries
  • BrokerService Configuration: Automatic callback setup during managed ledger creation for topics
  • Dual Metrics Support: Both Prometheus and OpenTelemetry metrics implementation

New Metrics

  1. pulsar.broker.non.recoverable.ledgers.skipped.count - Tracks the number of entire ledgers skipped due to non-recoverable issues
  2. pulsar.broker.non.recoverable.entries.skipped.count - Tracks the number of individual entries skipped due to non-recoverable issues

Design Decisions

  • Callback Pattern: Uses callback interface to avoid introducing dependencies between managed-ledger and broker modules
  • Automatic Setup: Callbacks are automatically configured by BrokerService during topic creation
  • Null-Safe: All implementations handle null callback gracefully

Testing Coverage

  • 7 Unit Tests: Comprehensive testing of callback integration in NonRecoverableDataCallbackTest
  • 3 Integration Tests: End-to-end validation including real topic/subscription creation in OpenTelemetryBrokerOperabilityStatsTest

Implementation Details

The implementation uses a callback pattern where:

  1. BrokerService sets up the NonRecoverableDataMetricsCallback during managed ledger configuration
  2. ManagedLedgerImpl calls the callback when skipNonRecoverableLedger() is invoked
  3. ManagedCursorImpl calls the callback when skipNonRecoverableEntries() is invoked with actual entry counts
  4. The callback updates BrokerOperabilityMetrics which maintains both Prometheus and OpenTelemetry counters

Test plan

  • Unit tests pass (7/7 in managed-ledger module)
  • Integration tests pass (3/3 in broker module)
  • Checkstyle validation passes (0 violations)
  • End-to-end testing with real topics, subscriptions, and message positions
  • Callback integration verified through actual broker service topic creation

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

This change implements GitHub PR apache#24716 to add operational visibility for non-recoverable data loss events in Apache Pulsar brokers.

Key Changes:
- Add NonRecoverableDataMetricsCallback interface in managed-ledger module
- Integrate callback in ManagedLedgerImpl.skipNonRecoverableLedger()
- Integrate callback in ManagedCursorImpl.skipNonRecoverableEntries()
- Configure callback in BrokerService during managed ledger creation
- Add two new metrics in BrokerOperabilityMetrics:
  - pulsar.broker.non.recoverable.ledgers.skipped.count
  - pulsar.broker.non.recoverable.entries.skipped.count
- Support both Prometheus and OpenTelemetry metrics
- Comprehensive test coverage including end-to-end integration tests

The metrics are always available regardless of ManagedLedgerInterceptor configuration,
providing reliable operational monitoring for data loss scenarios.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@codelipenghui codelipenghui changed the title [feat][broker] Add broker-level metrics for non-recoverable data skips [feat][broker] PIP-441:Add broker-level metrics for non-recoverable data skips Sep 10, 2025
@codelipenghui codelipenghui self-assigned this Sep 10, 2025
@codelipenghui codelipenghui added this to the 4.2.0 milestone Sep 10, 2025
@codelipenghui codelipenghui added the type/feature The PR added a new feature or issue requested a new feature label Sep 10, 2025
@apache apache deleted a comment from github-actions Bot Sep 10, 2025
@github-actions github-actions Bot added doc-required Your PR changes impact docs and you will update later. and removed doc-label-missing labels Sep 10, 2025
@codelipenghui

Copy link
Copy Markdown
Contributor Author

/pulsarbot run-failure-checks

@lhotari lhotari left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov-commenter

codecov-commenter commented Sep 11, 2025

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.58333% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.25%. Comparing base (a2b69cc) to head (ac63aae).
⚠️ Report is 442 commits behind head on master.

Files with missing lines Patch % Lines
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java 33.33% 0 Missing and 2 partials ⚠️
.../pulsar/broker/stats/BrokerOperabilityMetrics.java 94.44% 2 Missing ⚠️
...rg/apache/pulsar/broker/service/BrokerService.java 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24726      +/-   ##
============================================
+ Coverage     74.17%   74.25%   +0.08%     
- Complexity    33470    33628     +158     
============================================
  Files          1895     1902       +7     
  Lines        148029   148512     +483     
  Branches      17142    17216      +74     
============================================
+ Hits         109796   110282     +486     
+ Misses        29479    29430      -49     
- Partials       8754     8800      +46     
Flag Coverage Δ
inttests 26.46% <56.25%> (+0.21%) ⬆️
systests 22.72% <47.91%> (+0.05%) ⬆️
unittests 73.78% <89.58%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...apache/bookkeeper/mledger/ManagedLedgerConfig.java 96.57% <ø> (ø)
...che/bookkeeper/mledger/impl/ManagedLedgerImpl.java 81.09% <100.00%> (+0.16%) ⬆️
...rg/apache/pulsar/broker/service/BrokerService.java 83.65% <85.71%> (+0.24%) ⬆️
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java 78.70% <33.33%> (+0.29%) ⬆️
.../pulsar/broker/stats/BrokerOperabilityMetrics.java 93.84% <94.44%> (+0.22%) ⬆️

... and 138 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java Outdated

@Shawyeok Shawyeok left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-required Your PR changes impact docs and you will update later. ready-to-test release/4.2.3 type/feature The PR added a new feature or issue requested a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants