MCO: GOMAXPROCS Injection for Go Containers#2047
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
a28aa84 to
04a98c3
Compare
|
|
||
| 2. The Machine Config Operator renders the CRI-O configuration setting `min_injected_gomaxprocs = 1` | ||
|
|
||
| 3. CRI-O calculates GOMAXPROCS for each container as `max(ceil(cpu_request_in_cores) * 2, 1)` and injects it as an environment variable before the container starts |
There was a problem hiding this comment.
Could we double check the order of operations here? On master the hook doubles the CPU shares first and then takes the ceil, not ceil then double. calculateGOMAXPROCS returns max((shares*2+1023)/1024, 1), and per the function doc kubelet sets shares = cores * 1024, so it reduces to max(ceil(cores * 2), 1, floor). The unit tests confirm the doubling: 3 cores to 6, 4 cores to 8, 5 cores to 10, 8 cores to 16. So 1500m lands on 3 (ceil(1.5 * 2)), not 4.
| - Container with `requests.cpu: 1500m` → `GOMAXPROCS=4` (ceil(1.5) * 2 = 4) | ||
| - Container with `requests.cpu: 250m` → `GOMAXPROCS=2` (ceil(0.25) = 1, 1 * 2 = 2) |
There was a problem hiding this comment.
Following from the formula, would these values need updating? The unit tests give the exact answers for the floor of 1 that this enhancement uses: 250m (shares=256), floor=1 returns 1. And 1500m works out to 3 via the same doubling the tests show (3 cores to 6, so 1.5 cores to 3). So I make these 3 and 1 rather than 4 and 2. Does that match what you see?
| - Container with `requests.cpu: 1500m` → `GOMAXPROCS=4` (ceil(1.5) * 2 = 4) | ||
| - Container with `requests.cpu: 250m` → `GOMAXPROCS=2` (ceil(0.25) = 1, 1 * 2 = 2) | ||
| - Container with `requests.cpu: 100m` → `GOMAXPROCS=2` (ceil(0.1) = 1, 1 * 2 = 2) | ||
| - Container with no CPU request → No GOMAXPROCS injection |
There was a problem hiding this comment.
Small thing: best-effort pods (no CPU request) do get an injection on master, using the floor value directly rather than being skipped. The hook only returns early for pods outside burstable or besteffort (cgroup parent check), and the calculation returns the floor for shares=2 (best-effort). The config doc also states the floor is used directly for best-effort, and injecting a value like 1 is exercised in tests. Should this say they receive the floor value instead?
|
|
||
| ### Implementation Details/Notes/Constraints | ||
|
|
||
| CRI-O's GOMAXPROCS injection feature (https://github.com/cri-o/cri-o/pull/9961) sets a minimum floor value via the `min_injected_gomaxprocs` configuration option. When set to `1`, CRI-O injects GOMAXPROCS as `max(ceil(cpu_request_in_cores) * 2, 1)` for each container. |
There was a problem hiding this comment.
Could we point this at the base feature PR? #9961 is the later "run hook if pod is in workload partition" change; the min_injected_gomaxprocs option itself landed in cri-o/cri-o#9860. Also, the option reads as a floor rather than an enable flag: the hook is only wired in when the value is greater than 0, and that same value becomes the fallback floor in the calculation. Would it read more clearly to say injection runs whenever it is greater than 0, with 1 setting the floor to 1?
|
|
||
| **Behavior for edge cases:** | ||
| - Container with no CPU request: CRI-O does not inject GOMAXPROCS (Go default behavior applies) | ||
| - Container with CPU limit but no explicit request: Kubernetes defaults the request to equal the limit, so CRI-O will inject based on that implicit request value |
There was a problem hiding this comment.
This one looks inverted on master. When a container has a CPU limit (quota greater than 0) the hook skips injection entirely, the stated reason being that Go 1.25+ already derives GOMAXPROCS from the cgroup quota. Could we flip this bullet? It might also be worth noting the Go 1.25 behavior in Motivation, since it narrows the scope to burstable requests without a limit.
| **Behavior for edge cases:** | ||
| - Container with no CPU request: CRI-O does not inject GOMAXPROCS (Go default behavior applies) | ||
| - Container with CPU limit but no explicit request: Kubernetes defaults the request to equal the limit, so CRI-O will inject based on that implicit request value | ||
| - Container with CPU request < 500m: Receives `GOMAXPROCS=2` (minimum from formula: ceil(cpu < 1.0) = 1, doubled = 2) |
There was a problem hiding this comment.
With the double then ceil formula, requests below 0.5 cores resolve to 1 rather than 2. The unit tests show this directly: 250m (shares=256), floor=1 returns 1. The same holds at 500m: the calculated value is 1 before the floor applies, visible in 500m (shares=512), floor=4 returns 4 (with a floor of 1 it returns 1). Could we adjust this bullet to match?
|
|
||
| - Language detection or runtime-specific injection (CRI-O injects GOMAXPROCS for all containers; non-Go applications ignore it) | ||
| - Dynamic adjustment of GOMAXPROCS based on runtime CPU throttling or actual usage | ||
| - Per-pod or per-container override mechanisms (this is cluster-level configuration; users can override via pod spec environment variables) |
There was a problem hiding this comment.
cri-o does ship a per-pod opt out via the skip-gomaxprocs.crio.io annotation, which skips injection for that pod even when enabled globally. The hook checks it first in PreCreate. Would it be worth mentioning here so readers know a pod level escape hatch exists?
04a98c3 to
f16f0da
Compare
harche
left a comment
There was a problem hiding this comment.
Thanks for the quick turnaround, the formula and edge cases read accurately now. On a fresh pass I noticed the old "no CPU request means no injection" assumption still lingers in a couple of sections, plus a few scope and wording points. Links to the relevant CRI-O code and tests are inline. As before, happy to be corrected if I am misreading anything.
| - Deploy ContainerRuntimeConfig with injection enabled, verify GOMAXPROCS is set in containers with CPU requests | ||
| - Deploy ContainerRuntimeConfig with injection disabled, verify GOMAXPROCS is not injected | ||
| - Verify containers with explicit GOMAXPROCS environment variable use their own value (not injected) | ||
| - Verify containers without CPU requests do not receive GOMAXPROCS injection |
There was a problem hiding this comment.
This e2e check looks like it now conflicts with the corrected edge case behavior. Lines 76 and 123 say best-effort pods (no CPU request) receive the floor value, and CRI-O agrees: best-effort (shares=2) returns the floor, and injecting a value like 1 is exercised. Should this verify that no-request pods receive the floor rather than no injection?
|
|
||
| - **GOMAXPROCS not injected**: Check CPU request is set on container, verify feature gate is enabled, check CRI-O config | ||
| - **Wrong GOMAXPROCS value**: Verify CPU request matches expected value, check for explicit GOMAXPROCS in pod spec overriding injection | ||
| - **Container using default GOMAXPROCS**: Likely no CPU request set, or explicit override in pod/image |
There was a problem hiding this comment.
Same point as the test plan note: a best-effort pod with no CPU request still gets the floor injected (shares=2 returns the floor), so "no CPU request set" would not explain a container running on Go defaults. Could we drop or reword this bullet?
|
|
||
| ### Non-Goals | ||
|
|
||
| - Language detection or runtime-specific injection (CRI-O injects GOMAXPROCS for all containers; non-Go applications ignore it) |
There was a problem hiding this comment.
Small accuracy point: CRI-O does not inject into all containers. It only runs for burstable and best-effort pods (cgroup gate) that have no CPU limit (quota greater than 0 is skipped) and no pre-existing GOMAXPROCS. Guaranteed and limited pods are excluded. Could we soften "all containers" here?
|
|
||
| ### Drawbacks | ||
|
|
||
| - **Universal injection may be unexpected**: CRI-O injects GOMAXPROCS into all containers when enabled, regardless of whether they are Go applications. Non-Go containers simply ignore the environment variable, but this may confuse users inspecting container environments. |
There was a problem hiding this comment.
Same "all containers" wording as in Non-Goals. Injection is limited to burstable and best-effort pods without a CPU limit, see the cgroup gate and the limit skip. Worth aligning this line with that scope?
|
|
||
| **Behavior for edge cases:** | ||
| - Container with no CPU request (best-effort): Receives the floor value directly (`GOMAXPROCS=1`) | ||
| - Container with CPU limit but no explicit request: CRI-O **skips injection entirely**. Go 1.25+ automatically derives GOMAXPROCS from the cgroup CPU quota, making injection redundant for limited containers |
There was a problem hiding this comment.
Could we generalize this? On master any CPU limit skips injection (quota greater than 0 returns early), not only the "limit but no explicit request" case. So a normal burstable pod that sets both a request and a limit is skipped too, which is a large share of real workloads. It might also help to state explicitly that Guaranteed pods are excluded, since they hit both the limit skip and the burstable or besteffort gate. Relatedly, would it be worth adding "no CPU limit" to the examples on lines 73 to 76 so readers do not expect injection when a limit is present?
|
|
||
| **Common troubleshooting scenarios**: | ||
|
|
||
| - **GOMAXPROCS not injected**: Check CPU request is set on container, verify feature gate is enabled, check CRI-O config |
There was a problem hiding this comment.
For "GOMAXPROCS not injected", could we add the two most likely causes on master: the container has a CPU limit, or the pod is Guaranteed QoS. Both skip injection even when the feature is enabled, so a CPU request alone is not sufficient.
|
|
||
| ### Open Questions | ||
|
|
||
| None. |
There was a problem hiding this comment.
Would it be worth listing a couple here? Two come to mind.
First, how this composes with the high-performance and CPU pinning hooks. CRI-O assembles hooks into a list and can attach more than one: a pod on a high-performance runtime handler gets the HighPerformanceHooks, and separately, when min_injected_gomaxprocs > 0, it also gets the GomaxprocsHooks, with both wrapped in CompositeHooks. The CompositeHooks doc calls out the exact overlap: a burstable pod on a high-performance handler runs both. So once injection is on cluster wide, a burstable no-limit pod on that path gets GOMAXPROCS set to ceil(2 * request), which can exceed the cpuset it is actually pinned to, and the GOMAXPROCS skip is decided on the CPU quota, which the high-performance hook (running first in the chain) may itself modify. It might be fine, or the answer might be to skip injection on high-performance runtimes, but it seems worth reasoning about rather than leaving implicit.
Second, the GA default flip to Inject (line 211). With Inject as the platform default, every cluster that does not set Disabled starts injecting into every eligible burstable or best-effort workload on upgrade. A Go service that relies on the default GOMAXPROCS = node CPU count would silently switch to ceil(2 * request). The Risks section argues this is safe because an explicit GOMAXPROCS wins, but the workloads that get changed are exactly the ones that do not set it, so on by default at GA feels like a decision worth recording here rather than assuming.
| ) | ||
| ``` | ||
|
|
||
| The field is optional. When omitted, the platform chooses a reasonable default (currently `Inject`). |
There was a problem hiding this comment.
Optional and non-blocking: since the API is an Inject or Disabled enum that maps to min_injected_gomaxprocs 1 or 0, the integer floor is hidden and fixed at 1. That seems reasonable given the Alternatives section, but an enum cannot carry a value later, so if a configurable floor or multiplier is ever wanted the API would need a new field. Would a sentence on why the enum is preferred over an optional integer be worth adding?
f16f0da to
769e98c
Compare
Signed-off-by: Peter Hunt <pehunt@redhat.com>
769e98c to
f22b5ca
Compare
|
@haircommander: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
This enhancement proposes adding a
gomaxprocsInjectionBehaviorfield to the ContainerRuntimeConfig API that enables CRI-O to automatically inject the GOMAXPROCS environment variable into containers based on their CPU resource requests.Details
gomaxprocsInjectionBehavior(enum:Inject|Disabled)min_injected_gomaxprocsin CRI-O configmax(ceil(cpu_request_in_cores) * 2, 1)GomaxprocsInjection(TechPreviewNoUpgrade)Example Usage
Result:
requests.cpu: 1500m→GOMAXPROCS=4requests.cpu: 250m→GOMAXPROCS=2Related Issue
Tracking: https://redhat.atlassian.net/browse/OCPNODE-4600
Test Plan
🤖 Generated with Claude Code