Skip to content

feat(helm): opt-in PodDisruptionBudget and probes for CNPG plugin sidecars#384

Draft
WentingWu666666 wants to merge 1 commit into
documentdb:mainfrom
WentingWu666666:developer/wentingwu/helm-pdb-and-probes
Draft

feat(helm): opt-in PodDisruptionBudget and probes for CNPG plugin sidecars#384
WentingWu666666 wants to merge 1 commit into
documentdb:mainfrom
WentingWu666666:developer/wentingwu/helm-pdb-and-probes

Conversation

@WentingWu666666
Copy link
Copy Markdown
Collaborator

Draft PR for review/discussion only. Part of the GA-readiness chart audit (#381).

What this PR does

Two small, additive, opt-in improvements for pod availability.

PodDisruptionBudget (M9)

Adds templates/11_pdb.yaml, gated by podDisruptionBudget.enabled (default: false).

Disabled by default on purpose: the operator currently runs at replicaCount: 1. A PDB with minAvailable: 1 (or maxUnavailable: 0) on a single-replica deployment blocks node drains forever that's worse than no PDB. Users running multi-replica with leader election should set podDisruptionBudget.enabled=true. When the operator chart gets multi-replica defaults in the future, this default can be flipped.

Plugin probes (M12)

The sidecar-injector and wal-replica deployments are gRPC servers on port 9090 with no probes today pods are marked Ready as soon as the container starts, regardless of whether the gRPC endpoint is actually serving.

Adds tcpSocket readiness + liveness probes on port 9090, gated by pluginProbes.enabled (default: true) with tunable initialDelaySeconds / periodSeconds / failureThreshold.

TCP socket probe is used because the plugins don't expose an HTTP health endpoint. It verifies the gRPC server has bound the port and is accepting connections better than nothing, not as strong as a real gRPC health check (deferred until the plugins implement grpc.health.v1).

Local verification

  • helm lint clean.
  • helm template renders PDB only when podDisruptionBudget.enabled=true.
  • helm template renders TCP probes by default; omits them with pluginProbes.enabled=false.
  • helm upgrade on kind with --set podDisruptionBudget.enabled=true creates the PDB; subsequent --set podDisruptionBudget.enabled=false removes it cleanly.
  • Sidecar-injector pod becomes Ready (TCP probe passes against the actually-running gRPC server, confirming the probe target is correct).
  • Pre-existing user-supplied values are preserved across helm upgrade (Helm default behavior; documented in PR for reviewer awareness).

Tracking

PodDisruptionBudget (M9 from documentdb#381)
==================================

Add an optional PodDisruptionBudget for the operator, gated by
podDisruptionBudget.enabled (default: false). Disabled by default
because the operator currently ships with replicaCount: 1 and a PDB on
a single-replica deployment blocks node drains rather than helping
availability. Users running multi-replica with leader election should
enable it.

Plugin probes (M12 from documentdb#381)
=============================

The sidecar-injector and wal-replica deployments are gRPC servers on
port 9090 that previously had no probes  pods were marked Ready as
soon as the container started, regardless of whether the gRPC endpoint
was actually serving. Add tcpSocket readiness + liveness probes on
port 9090, gated by pluginProbes.enabled (default: true) with tunable
initialDelaySeconds, periodSeconds, and failureThreshold.

TCP socket probe is used because the plugins do not expose an HTTP
health endpoint. The probe verifies the gRPC server is bound and
accepting connections.

Verified locally on kind:
  - helm template renders PDB only when enabled; renders probes by default.
  - helm upgrade applies PDB; second upgrade with --set ... enabled=false
    removes it as expected.
  - sidecar-injector pod becomes Ready (TCP probe passes against the
    actually-running gRPC server).
  - helm lint clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@documentdb-triage-tool documentdb-triage-tool Bot added the enhancement New feature or request label May 19, 2026
@documentdb-triage-tool
Copy link
Copy Markdown

🤖 Auto-triaged by documentdb-triage-tool.

Applied: enhancement
Project fields suggested: Component manifests · Priority P2 · Effort M · Status In Progress
Confidence: 0.85 (mixed)

Reasoning

effort from diff stats (70+0 LOC, 4 files); LLM: Additive opt-in Helm chart improvements (PodDisruptionBudget and plugin probes) as part of GA-readiness audit, touching manifests/templates with no schema or cross-component changes.

If a label is wrong, remove it manually and ping @patty-chow so the rules can be tuned. The bot will not re-label items that already have component labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request triage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants