Skip to content

OCPBUGS-90708: Fix NTP-failover dual-publisher for CLOCK_REALTIME#710

Merged
vitus133 merged 1 commit into
redhat-cne:mainfrom
vitus133:fix-ntp-dual-publisher-clock-realtime
Jun 29, 2026
Merged

OCPBUGS-90708: Fix NTP-failover dual-publisher for CLOCK_REALTIME#710
vitus133 merged 1 commit into
redhat-cne:mainfrom
vitus133:fix-ntp-dual-publisher-clock-realtime

Conversation

@vitus133

Copy link
Copy Markdown
Member

In NTP-failover profiles (chronyd + phc2sys configured), phc2sys process-down was emitting a spurious OsClockSyncStateChange FREERUN event for CLOCK_REALTIME, conflicting with chronyd which is the sole E3 authority in NTP mode. This caused:

  • Double CLOCK_REALTIME values in os-clock-sync-state events
  • Stale openshift_ptp_clock_state{process="phc2sys"} metric

Fix: skip the OsClockSyncStateChange emission in processDownEvent when ChronydEnabled() is true for the profile. Also update the Prometheus gauge on the non-chronyd path to prevent stale metrics.

@openshift-ci-robot

Copy link
Copy Markdown

@vitus133: This pull request references Jira Issue OCPBUGS-90708, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

In NTP-failover profiles (chronyd + phc2sys configured), phc2sys process-down was emitting a spurious OsClockSyncStateChange FREERUN event for CLOCK_REALTIME, conflicting with chronyd which is the sole E3 authority in NTP mode. This caused:

  • Double CLOCK_REALTIME values in os-clock-sync-state events
  • Stale openshift_ptp_clock_state{process="phc2sys"} metric

Fix: skip the OsClockSyncStateChange emission in processDownEvent when ChronydEnabled() is true for the profile. Also update the Prometheus gauge on the non-chronyd path to prevent stale metrics.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@vitus133, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 50 minutes and 33 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: afe75927-1f34-4548-bb33-9ba372a57fb3

📥 Commits

Reviewing files that changed from the base of the PR and between b46ae31 and 8a79a22.

📒 Files selected for processing (2)
  • plugins/ptp_operator/metrics/metrics.go
  • plugins/ptp_operator/metrics/ntp_test.go
📝 Walkthrough

Walkthrough

processDownEvent in metrics.go adds a !opts.ChronydEnabled() guard and an UpdateSyncStateMetrics call for the ClockRealTime FREERUN path. A new ntp_test.go file adds six tests covering phc2sys-down behavior with and without chronyd, chronyd source selection, node sync-state, end-to-end NTP failover, and GNSS holdover.

Changes

NTP Failover Guard and Tests

Layer / File(s) Summary
processDownEvent chronyd guard
plugins/ptp_operator/metrics/metrics.go
Adds !opts.ChronydEnabled() condition to the existing opts.Phc2SysEnabled() check and adds UpdateSyncStateMetrics(phc2sysProcessName, ClockRealTime, ptp.FREERUN) call in the same block.
NTP failover test suite
plugins/ptp_operator/metrics/ntp_test.go
Adds six test functions covering: phc2sys-down with chronyd enabled (no FREERUN event), phc2sys-down without chronyd (FREERUN event emitted), chronyd "Selected source" locking CLOCK_REALTIME, node sync-state with master FREERUN, end-to-end NTP failover sequence, and GNSS holdover resolving to HOLDOVER.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the NTP-failover CLOCK_REALTIME fix, which matches the main change.
Description check ✅ Passed The description matches the change by explaining the chronyd/phc2sys conflict and the metric update fix.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
plugins/ptp_operator/metrics/ntp_test.go (1)

112-123: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the Prometheus gauge on this path too.

This test covers the only branch that now calls UpdateSyncStateMetrics(..., CLOCK_REALTIME, FREERUN), but it never verifies the gauge value. Without that assertion, the stale-metric regression this PR fixes can return unnoticed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ptp_operator/metrics/ntp_test.go` around lines 112 - 123, This test
path already checks the sync state and emitted event, but it does not verify the
Prometheus gauge updated by UpdateSyncStateMetrics for CLOCK_REALTIME. Add an
assertion in ntp_test.go alongside the existing
ptpStats[metrics.ClockRealTime].LastSyncState check to confirm the gauge
reflects FREERUN on the phc2sys process-down path when chronyd is not enabled,
using the existing mock metrics object and CLOCK_REALTIME symbol.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/ptp_operator/metrics/ntp_test.go`:
- Around line 24-27: The test in ntp_test.go is using repeated string literals
that trigger the linter, so move the shared values into file-level constants.
Introduce constants for the profile name, config name, and the repeated
chrony/PTP argument strings, then replace the inline literals in the relevant
test setup and assertions. Update the ntp test helpers around initPubSubTypes
and metrics.NewPTPEventManager to use those constants consistently so the
repeated-literal lint warning is removed.

---

Nitpick comments:
In `@plugins/ptp_operator/metrics/ntp_test.go`:
- Around line 112-123: This test path already checks the sync state and emitted
event, but it does not verify the Prometheus gauge updated by
UpdateSyncStateMetrics for CLOCK_REALTIME. Add an assertion in ntp_test.go
alongside the existing ptpStats[metrics.ClockRealTime].LastSyncState check to
confirm the gauge reflects FREERUN on the phc2sys process-down path when chronyd
is not enabled, using the existing mock metrics object and CLOCK_REALTIME
symbol.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 6cd242d2-0b47-44dc-b989-71326c3c4c95

📥 Commits

Reviewing files that changed from the base of the PR and between c33fcc3 and b46ae31.

📒 Files selected for processing (2)
  • plugins/ptp_operator/metrics/metrics.go
  • plugins/ptp_operator/metrics/ntp_test.go

Comment thread plugins/ptp_operator/metrics/ntp_test.go Outdated
In NTP-failover profiles (chronyd + phc2sys configured), phc2sys
process-down was emitting a spurious OsClockSyncStateChange FREERUN
event for CLOCK_REALTIME, conflicting with chronyd which is the sole
E3 authority in NTP mode. This caused:
- Double CLOCK_REALTIME values in os-clock-sync-state events
- Stale openshift_ptp_clock_state{process="phc2sys"} metric

Fix: skip the OsClockSyncStateChange emission in processDownEvent when
ChronydEnabled() is true for the profile. Also update the Prometheus
gauge on the non-chronyd path to prevent stale metrics.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vitus133 vitus133 force-pushed the fix-ntp-dual-publisher-clock-realtime branch from b46ae31 to 8a79a22 Compare June 28, 2026 08:13
@vitus133

Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown

@vitus133: This pull request references Jira Issue OCPBUGS-90708, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @klaskosk

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: klaskosk.

Note that only redhat-cne members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

@vitus133: This pull request references Jira Issue OCPBUGS-90708, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @klaskosk

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@vitus133

Copy link
Copy Markdown
Member Author

/cherrypick release-4.22

@openshift-cherrypick-robot

Copy link
Copy Markdown

@vitus133: once the present PR merges, I will cherry-pick it on top of release-4.22 in a new PR and assign it to you.

Details

In response to this:

/cherrypick release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nocturnalastro nocturnalastro left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nocturnalastro, vitus133

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [nocturnalastro,vitus133]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vitus133 vitus133 merged commit 3abfadd into redhat-cne:main Jun 29, 2026
13 of 14 checks passed
@openshift-ci-robot

Copy link
Copy Markdown

@vitus133: Jira Issue OCPBUGS-90708: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-90708 has been moved to the MODIFIED state.

Details

In response to this:

In NTP-failover profiles (chronyd + phc2sys configured), phc2sys process-down was emitting a spurious OsClockSyncStateChange FREERUN event for CLOCK_REALTIME, conflicting with chronyd which is the sole E3 authority in NTP mode. This caused:

  • Double CLOCK_REALTIME values in os-clock-sync-state events
  • Stale openshift_ptp_clock_state{process="phc2sys"} metric

Fix: skip the OsClockSyncStateChange emission in processDownEvent when ChronydEnabled() is true for the profile. Also update the Prometheus gauge on the non-chronyd path to prevent stale metrics.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

Copy link
Copy Markdown

@vitus133: new pull request created: #711

Details

In response to this:

/cherrypick release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants