Skip to content

Bug 0000: Use server-side apply for observedConfig to prevent bootstrap livelock#1643

Open
redhat-chai-bot wants to merge 2 commits into
openshift:mainfrom
redhat-chai-bot:fix-ssa-observed-config
Open

Bug 0000: Use server-side apply for observedConfig to prevent bootstrap livelock#1643
redhat-chai-bot wants to merge 2 commits into
openshift:mainfrom
redhat-chai-bot:fix-ssa-observed-config

Conversation

@redhat-chai-bot

@redhat-chai-bot redhat-chai-bot commented Jun 29, 2026

Copy link
Copy Markdown

Summary

During OCP 5.0 bootstrap, the ConfigObserver in the etcd operator cannot write cipher suite configuration to .spec.observedConfig because of persistent HTTP 409 Conflict errors. This prevents etcd static pods from ever being created, causing a bootstrap timeout (~28% failure rate across 20+ GCP jobs).

Root Cause

The library-go ConfigObserver uses UpdateSpec() — a read-modify-write pattern with optimistic concurrency (resourceVersion check). During bootstrap, 10+ controllers simultaneously update the same Etcd CR's .status section, causing rapid resourceVersion churn. Every ConfigObserver write attempt fails with 409 Conflict because another controller has already bumped the resourceVersion between the read and write.

This creates a vicious cycle:

  1. ConfigObserver fails to write cipher suites → observedConfig stays empty
  2. EnvVarController can't find cipher suites → goes Degraded
  3. Degraded status update bumps resourceVersion → feeds the conflict storm
  4. Livelock persists for the entire 59-minute bootstrap window
  5. Zero etcd static pods are created → everything downstream fails

Fix

Replace the library-go configobserver.NewConfigObserver() with a custom ssaConfigObserver that uses Server-Side Apply (SSA) to write .spec.observedConfig. SSA eliminates optimistic concurrency conflicts because the API server handles field-level ownership merging rather than requiring a matching resourceVersion.

Key changes in pkg/operator/configobservation/configobservercontroller/observe_config_controller.go:

  • New ssaConfigObserver struct that implements the same observer logic (randomized merge, determinism check) but writes via ApplyOperatorSpec() with a dedicated field manager "cluster-etcd-operator-config-observer"
  • Status condition updates also use SSA via ApplyOperatorStatus()
  • No changes to observer functions themselves — ObserveTLSSecurityProfile and ObserveControlPlaneReplicas are unchanged
  • No library-go changes required

Testing

  • go build ./... — passes
  • make verify (gofmt + go vet) — passes
  • go test ./pkg/... — all tests pass (except pre-existing pkg/cmd/monitor failure unrelated to this change)

Summary by CodeRabbit

  • New Features

    • Improved operator config observation for more reliable merging of observed settings.
    • Observed config updates now apply via server-side mechanisms and reflect changes consistently.
  • Bug Fixes

    • Added determinism checks to prevent inconsistent merged configurations.
    • Enhanced error handling during observation and update flows, including clearer status updates and emitted success/warning events.
  • Chores

    • Updated a Go dependency in the build configuration.

During OCP 5.0 bootstrap, 10+ controllers simultaneously update the
etcd operator CR's .status section, causing rapid resourceVersion churn.
The ConfigObserver's standard read-modify-write (optimistic concurrency)
pattern fails repeatedly with HTTP 409 Conflict because the
resourceVersion is bumped by other controllers' status updates before
the spec update can complete. This livelock persists for the entire
59-minute bootstrap window, preventing the observedConfig (cipher
suites, TLS version, control plane replicas) from ever being written,
which blocks etcd static pod creation.

This commit replaces the library-go ConfigObserver's UpdateSpec-based
write path with a custom sync loop that uses Server-Side Apply (SSA)
via ApplyOperatorSpec. SSA eliminates optimistic concurrency conflicts
because the API server handles field-level ownership merging rather
than requiring a matching resourceVersion. Each controller owns only
the fields it manages (the config observer owns .spec.observedConfig,
other controllers own .status fields), so concurrent updates no longer
interfere with each other.

The fix is entirely within the etcd operator's codebase — no changes
to library-go are required. The custom controller replicates the
observer execution, merging, and determinism-checking logic from the
library-go ConfigObserver, but writes the result using
operatorClient.ApplyOperatorSpec with a dedicated field manager
("cluster-etcd-operator-config-observer").

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 52f40315-726e-4ccb-b94e-302c47f4d1ba

📥 Commits

Reviewing files that changed from the base of the PR and between ef5f641 and fcad14f.

📒 Files selected for processing (1)
  • go.mod

Walkthrough

Replaces the library-go config observer wiring with an in-file SSA-based observer, adds randomized observer execution and merge determinism checks, and applies observedConfig and degraded status via Server-Side Apply.

Changes

SSA config observer controller

Layer / File(s) Summary
Controller wiring and dependency setup
pkg/operator/configobservation/configobservercontroller/observe_config_controller.go, go.mod
Builds explicit listers and observer lists, constructs ssaConfigObserver, rewires the controller to call ssaObserver.sync, and adds the mergo dependency used by the merge path.
Sync, apply, and informer adapter
pkg/operator/configobservation/configobservercontroller/observe_config_controller.go
Reads operator state, runs observers in random order, merges and compares observed configs for determinism, updates degraded status via SSA, applies observedConfig via SSA, and adapts lister sync checks into controller informers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: switching observedConfig writes to server-side apply to avoid bootstrap livelock.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo test titles were added or modified; the changed controller file contains no It/Describe/Context/When calls.
Test Structure And Quality ✅ Passed No Ginkgo test files were modified in this PR, so the test-structure check is not applicable.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; the PR only changes observe_config_controller.go and go.mod, with no It/Describe/Context/When additions.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; the PR changes only controller code and go.mod, so SNO-specific test compatibility is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PASS: The PR only refactors config observation to SSA and updates go.mod; no new affinity, nodeSelector, spread, PDB, or replica-scheduling logic was added.
Ote Binary Stdout Contract ✅ Passed Touched files don’t change OTE entrypoint/setup; the new controller code logs only in controller runtime, and test/e2e stdout writes stay inside It bodies.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PASS: The PR only changes controller logic/go.mod; the touched Go file has no Ginkgo specs or IPv4/external-network assumptions.
No-Weak-Crypto ✅ Passed The PR’s changed controller file adds no weak crypto imports or algorithms; searches found no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB or secret/token comparisons.
Container-Privileges ✅ Passed PR only changes a Go controller and go.mod; no manifest fields like privileged/hostPID/allowPrivilegeEscalation are present.
No-Sensitive-Data-In-Logs ✅ Passed New log/event messages only emit TLS profile values and replica counts; no passwords, tokens, PII, hostnames, or customer data appear.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ingvagabund for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@pkg/operator/configobservation/configobservercontroller/observe_config_controller.go`:
- Around line 190-195: The non-determinism check in the config observation flow
is correct, but `applyObservedConfig` still applies `mergedObservedConfig`,
which can be a random winner from `rand.Perm`. Update `observeConfigController`
so that when `equality.Semantic.DeepEqual` detects conflicting observer output,
the controller does not write that random merged result into
`.spec.observedConfig`; instead, skip the apply or preserve the last stable
observed config, and keep the degraded error path intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 405950cb-2c79-40eb-b7c4-187735a8b887

📥 Commits

Reviewing files that changed from the base of the PR and between 2b9daad and ef5f641.

📒 Files selected for processing (1)
  • pkg/operator/configobservation/configobservercontroller/observe_config_controller.go

Comment on lines +190 to +195
if !equality.Semantic.DeepEqual(mergedObservedConfig, reverseMergedObservedConfig) {
errs = append(errs, errors.New("non-deterministic config observation detected"))
}

if err := c.applyObservedConfig(ctx, syncCtx, existingConfig, mergedObservedConfig); err != nil {
errs = []error{err}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Do not apply a random merge result after detecting non-determinism.

Line 190 detects conflicting observer output, but Line 194 still applies mergedObservedConfig, whose winning values depend on rand.Perm. That can make .spec.observedConfig flap across syncs even while reporting degraded.

Suggested fix
+	nonDeterministic := false
 	if !equality.Semantic.DeepEqual(mergedObservedConfig, reverseMergedObservedConfig) {
+		nonDeterministic = true
 		errs = append(errs, errors.New("non-deterministic config observation detected"))
 	}
 
-	if err := c.applyObservedConfig(ctx, syncCtx, existingConfig, mergedObservedConfig); err != nil {
+	if !nonDeterministic {
+		if err := c.applyObservedConfig(ctx, syncCtx, existingConfig, mergedObservedConfig); err != nil {
-		errs = []error{err}
+			errs = []error{err}
+		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if !equality.Semantic.DeepEqual(mergedObservedConfig, reverseMergedObservedConfig) {
errs = append(errs, errors.New("non-deterministic config observation detected"))
}
if err := c.applyObservedConfig(ctx, syncCtx, existingConfig, mergedObservedConfig); err != nil {
errs = []error{err}
nonDeterministic := false
if !equality.Semantic.DeepEqual(mergedObservedConfig, reverseMergedObservedConfig) {
nonDeterministic = true
errs = append(errs, errors.New("non-deterministic config observation detected"))
}
if !nonDeterministic {
if err := c.applyObservedConfig(ctx, syncCtx, existingConfig, mergedObservedConfig); err != nil {
errs = []error{err}
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@pkg/operator/configobservation/configobservercontroller/observe_config_controller.go`
around lines 190 - 195, The non-determinism check in the config observation flow
is correct, but `applyObservedConfig` still applies `mergedObservedConfig`,
which can be a random winner from `rand.Perm`. Update `observeConfigController`
so that when `equality.Semantic.DeepEqual` detects conflicting observer output,
the controller does not write that random merged result into
`.spec.observedConfig`; instead, skip the apply or preserve the last stable
observed config, and keep the degraded error path intact.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redhat-chai-bot can you fix this?

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci

openshift-ci Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

@redhat-chai-bot: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tjungblu

Copy link
Copy Markdown
Contributor

@redhat-chai-bot does this really cause bootstrap timeouts? how many conflicts do we run into while updating the status?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants