Skip to content

[shell-operator] fix: retry webhook manager initialization and submit calls#899

Merged
ldmonster merged 4 commits into
mainfrom
fix/webhook-init-retry
May 28, 2026
Merged

[shell-operator] fix: retry webhook manager initialization and submit calls#899
ldmonster merged 4 commits into
mainfrom
fix/webhook-init-retry

Conversation

@fuldaxxx
Copy link
Copy Markdown
Member

@fuldaxxx fuldaxxx commented May 26, 2026

Overview

Improve webhook registration resilience by adding context-aware retry with exponential backoff for admission webhook configuration submit calls, and harden kube-events informer error logging paths.

What this PR does / why we need it

During shell-operator startup, the admission webhook manager must register ValidatingWebhookConfiguration and MutatingWebhookConfiguration via the Kubernetes API.
If the API server is briefly unavailable (rolling restart, leader election, etcd hiccup), these calls could previously fail immediately and abort startup.

This PR adds resilience at the admission resource submit layer:

  • Adds retry with exponential backoff for submit operations of:
    • ValidatingWebhookConfiguration
    • MutatingWebhookConfiguration
  • Retry settings:
    • MaxRetries: 5 (up to 6 total attempts),
    • backoff from 2s up to 15s max,
    • retry loop is context-aware and stops on context cancellation.
  • list/create/update errors are now returned (fail-fast) instead of being logged and silently ignored.

Additional hardening included in this PR:

  • WebhookManager.Start() now accepts context.Context and passes it into Register(ctx).
  • In kube_events_manager, error-path logging is made safer:
    • namespace_informer and resource_informer now use their own logger field instead of relying on MonitorConfig.Logger directly.
    • AddMonitor ensures MonitorConfig.Logger is initialized when missing.
  • Introduces reusable retry utility:
    • pkg/utils/retry/retry.go
    • with unit tests in pkg/utils/retry/retry_test.go.

Signed-off-by: Ruslan Gorbunov <ruslan.gorbunov@flant.com>
@fuldaxxx fuldaxxx self-assigned this May 26, 2026
@fuldaxxx fuldaxxx added the go Pull requests that update Go code label May 26, 2026
@fuldaxxx fuldaxxx requested review from ipaqsa and ldmonster May 26, 2026 14:38
@fuldaxxx fuldaxxx added the bug Something isn't working label May 26, 2026
fuldaxxx added 2 commits May 27, 2026 03:56
Signed-off-by: Ruslan Gorbunov <ruslan.gorbunov@flant.com>
Signed-off-by: Ruslan Gorbunov <ruslan.gorbunov@flant.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds retry-with-backoff layers around webhook manager initialization and admission webhook configuration submission so that brief Kubernetes API server unavailability during shell-operator startup does not cause a fatal exit. Introduces a new shared pkg/utils/retry helper, threads context.Context through the admission webhook registration path, and also tidies up logger propagation in the kube events manager so monitor/namespace informers don't rely on MonitorConfig.Logger being populated by callers.

Changes:

  • New context-aware retry.WithBackoff helper with tests, used in two new places: admission webhook submit (5 retries, 2–15 s) and bootstrap-level webhook manager init (8 retries, 1–15 s).
  • Admission Register/submit now take a context.Context; create/update errors are propagated (previously logged and swallowed).
  • Kube events manager: AddMonitor defaults MonitorConfig.Logger; namespaceInformer carries its own logger; resource_informer and namespace_informer log sync errors via their informer/manager logger.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/utils/retry/retry.go New retry-with-backoff helper honoring context cancellation.
pkg/utils/retry/retry_test.go Tests for success/exhaustion/context-cancel/backoff cap.
pkg/webhook/admission/resource.go Adds retry around list/create/update; propagates create/update errors; threads context.
pkg/webhook/admission/manager.go Start now accepts context and passes it to Register.
pkg/shell-operator/operator.go Passes op.ctx into AdmissionWebhookManager.Start.
pkg/shell-operator/bootstrap.go Wraps both webhook init calls with retry.WithBackoff + config constants.
pkg/kube_events_manager/kube_events_manager.go Ensures MonitorConfig.Logger is populated in AddMonitor.
pkg/kube_events_manager/monitor.go Passes m.logger into NewNamespaceInformer.
pkg/kube_events_manager/namespace_informer.go Stores informer-owned logger and uses it on sync failure.
pkg/kube_events_manager/resource_informer.go Uses ei.logger instead of Monitor.Logger on sync failure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/shell-operator/bootstrap.go Outdated
Comment thread pkg/utils/retry/retry.go Outdated
Comment thread pkg/utils/retry/retry.go Outdated
Signed-off-by: Ruslan Gorbunov <ruslan.gorbunov@flant.com>
@fuldaxxx fuldaxxx marked this pull request as ready for review May 28, 2026 14:19
@ldmonster ldmonster merged commit bf70605 into main May 28, 2026
9 checks passed
@ldmonster ldmonster deleted the fix/webhook-init-retry branch May 28, 2026 14:20
@fuldaxxx fuldaxxx restored the fix/webhook-init-retry branch May 28, 2026 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working go Pull requests that update Go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants