Skip to content

Adopt conmon-rs as default container monitor#2034

Open
saschagrunert wants to merge 1 commit into
openshift:masterfrom
saschagrunert:conmon-rs-enhancement
Open

Adopt conmon-rs as default container monitor#2034
saschagrunert wants to merge 1 commit into
openshift:masterfrom
saschagrunert:conmon-rs-enhancement

Conversation

@saschagrunert

@saschagrunert saschagrunert commented Jun 5, 2026

Copy link
Copy Markdown
Member

Propose switching CRI-O from the traditional per-container conmon to the pod-level conmon-rs monitor in OpenShift.

  • Introduce ConmonRS feature gate in TechPreviewNoUpgrade
  • Extend ContainerRuntimeConfig with a containerMonitor field (PascalCase enum: Conmon, ConmonRS)
  • Move conmon-rs package from RHCOS layer to the Node Layer (OCP-aligned versioning)
  • Graduate from Tech Preview (opt-in) to GA (default) over one minor release cycle
  • Tracking: OCPNODE-1288

/cc @haircommander @rphillips

@openshift-ci openshift-ci Bot requested review from haircommander and rphillips June 5, 2026 09:14
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2026
@openshift-ci

openshift-ci Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rvanderp3 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@saschagrunert saschagrunert force-pushed the conmon-rs-enhancement branch 9 times, most recently from 8c3987a to ca49d66 Compare June 5, 2026 09:54
@saschagrunert saschagrunert changed the title WIP: enhancement: adopt conmon-rs as default container monitor Adopt conmon-rs as default container monitor Jun 5, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2026
@saschagrunert saschagrunert force-pushed the conmon-rs-enhancement branch from ca49d66 to 672b352 Compare June 5, 2026 09:59
@saschagrunert

Copy link
Copy Markdown
Member Author

@bitoku @harche @QiWang19 PTAL, too

Comment thread enhancements/machine-config/conmon-rs-default-container-monitor.md
Comment thread enhancements/machine-config/conmon-rs-default-container-monitor.md Outdated
Comment thread enhancements/machine-config/conmon-rs-default-container-monitor.md Outdated
Comment thread enhancements/machine-config/conmon-rs-default-container-monitor.md Outdated
@saschagrunert saschagrunert force-pushed the conmon-rs-enhancement branch 2 times, most recently from 5eda281 to 0b18266 Compare June 5, 2026 11:56
Comment thread enhancements/machine-config/conmon-rs-default-container-monitor.md Outdated
@saschagrunert saschagrunert force-pushed the conmon-rs-enhancement branch 6 times, most recently from bba7476 to 2c852b2 Compare June 8, 2026 12:32
Comment thread enhancements/machine-config/conmon-rs-default-container-monitor.md Outdated
@saschagrunert saschagrunert force-pushed the conmon-rs-enhancement branch from 2c852b2 to cf19ef3 Compare June 8, 2026 13:19

**Negative and failure cases:**

- Kill the conmon-rs process for a running pod and verify that the kubelet detects the pod failure and restarts it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also verify that the containers in the pods are gracefully shutdown?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Added a "Graceful shutdown" test case that verifies SIGTERM delivery, preStop hook execution, and graceful exit before SIGKILL, compared against a conmon baseline. This is important since conmon-rs changes the signal forwarding path (pod-level vs. per-container).

Comment on lines +516 to +519
Once conmon-rs is GA and the default for at least two OCP minor releases, the traditional conmon can be deprecated:

- Announce deprecation of conmon as container monitor.
- Remove the conmon binary from the node image.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen to clusters using the machine config override during the upgrade to the version where the conmon binary is removed?

Also, is there a plan to remove the capacity to set runtime_type = "oci" in cri-o or will upstream continue to support both?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the "Removing a deprecated feature" section to address both questions.

For the override case: before the conmon binary is removed, the MCO must detect MachineConfig overrides referencing runtime_type = "oci" or a conmon monitor_path and block the upgrade during preflight checks. This prevents clusters from upgrading into a state where the override references a missing binary.

For upstream CRI-O: runtime_type = "oci" will continue to be supported upstream for the foreseeable future. conmon remains the default monitor in upstream CRI-O and in other distributions (Fedora, RHEL standalone). Removing the conmon binary from the OpenShift node image is an OCP-specific decision that doesn't affect upstream.


## Support Procedures

### Detecting conmon-rs issues

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a way to know why conmon-rs crashed if it ever happens?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a "Crash diagnostics" bullet to the support procedures. Three main sources of information:

  1. CRI-O logs the exit code and signal for the conmon-rs process when it exits unexpectedly.
  2. conmon-rs logs its own operational output to journald by default (LogDriver::Systemd), so journalctl -t conmonrs on the affected node shows activity leading up to the crash. Panic messages go to stderr, which CRI-O captures. conmon-rs uses panic = "abort" in release builds, so panics terminate with SIGABRT immediately.
  3. On RHCOS, systemd-coredump captures core dumps via the default kernel.core_pattern pipe. coredumpctl list / coredumpctl info on the node can be used for post-crash analysis.

Address review feedback:
- Add graceful shutdown test case for signal forwarding
- Document MCO preflight check for MachineConfig overrides
  before conmon binary removal
- Add crash diagnostics to support procedures (journald,
  panic=abort behavior, coredumpctl)

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
@saschagrunert saschagrunert force-pushed the conmon-rs-enhancement branch from cf19ef3 to 62ad95f Compare June 25, 2026 06:27
@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@saschagrunert: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants