Adopt conmon-rs as default container monitor#2034
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
8c3987a to
ca49d66
Compare
ca49d66 to
672b352
Compare
5eda281 to
0b18266
Compare
bba7476 to
2c852b2
Compare
2c852b2 to
cf19ef3
Compare
|
|
||
| **Negative and failure cases:** | ||
|
|
||
| - Kill the conmon-rs process for a running pod and verify that the kubelet detects the pod failure and restarts it |
There was a problem hiding this comment.
should we also verify that the containers in the pods are gracefully shutdown?
There was a problem hiding this comment.
Good catch. Added a "Graceful shutdown" test case that verifies SIGTERM delivery, preStop hook execution, and graceful exit before SIGKILL, compared against a conmon baseline. This is important since conmon-rs changes the signal forwarding path (pod-level vs. per-container).
| Once conmon-rs is GA and the default for at least two OCP minor releases, the traditional conmon can be deprecated: | ||
|
|
||
| - Announce deprecation of conmon as container monitor. | ||
| - Remove the conmon binary from the node image. |
There was a problem hiding this comment.
What would happen to clusters using the machine config override during the upgrade to the version where the conmon binary is removed?
Also, is there a plan to remove the capacity to set runtime_type = "oci" in cri-o or will upstream continue to support both?
There was a problem hiding this comment.
Updated the "Removing a deprecated feature" section to address both questions.
For the override case: before the conmon binary is removed, the MCO must detect MachineConfig overrides referencing runtime_type = "oci" or a conmon monitor_path and block the upgrade during preflight checks. This prevents clusters from upgrading into a state where the override references a missing binary.
For upstream CRI-O: runtime_type = "oci" will continue to be supported upstream for the foreseeable future. conmon remains the default monitor in upstream CRI-O and in other distributions (Fedora, RHEL standalone). Removing the conmon binary from the OpenShift node image is an OCP-specific decision that doesn't affect upstream.
|
|
||
| ## Support Procedures | ||
|
|
||
| ### Detecting conmon-rs issues |
There was a problem hiding this comment.
do we have a way to know why conmon-rs crashed if it ever happens?
There was a problem hiding this comment.
Added a "Crash diagnostics" bullet to the support procedures. Three main sources of information:
- CRI-O logs the exit code and signal for the conmon-rs process when it exits unexpectedly.
- conmon-rs logs its own operational output to journald by default (
LogDriver::Systemd), sojournalctl -t conmonrson the affected node shows activity leading up to the crash. Panic messages go to stderr, which CRI-O captures. conmon-rs usespanic = "abort"in release builds, so panics terminate with SIGABRT immediately. - On RHCOS, systemd-coredump captures core dumps via the default
kernel.core_patternpipe.coredumpctl list/coredumpctl infoon the node can be used for post-crash analysis.
Address review feedback: - Add graceful shutdown test case for signal forwarding - Document MCO preflight check for MachineConfig overrides before conmon binary removal - Add crash diagnostics to support procedures (journald, panic=abort behavior, coredumpctl) Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
cf19ef3 to
62ad95f
Compare
|
@saschagrunert: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Propose switching CRI-O from the traditional per-container conmon to the pod-level conmon-rs monitor in OpenShift.
ConmonRSfeature gate inTechPreviewNoUpgradeContainerRuntimeConfigwith acontainerMonitorfield (PascalCase enum:Conmon,ConmonRS)/cc @haircommander @rphillips