Skip to content

fix(valkey): harden failover safety, node/sentinel identity, and secret handling#664

Merged
mberlofa merged 5 commits into
helmforgedev:mainfrom
mreho:fix/valkey/failoverHardening
Jul 4, 2026
Merged

fix(valkey): harden failover safety, node/sentinel identity, and secret handling#664
mberlofa merged 5 commits into
helmforgedev:mainfrom
mreho:fix/valkey/failoverHardening

Conversation

@mreho

@mreho mreho commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • harden Valkey Sentinel failover safety with graceful master shutdown failover and startup guard defaults
  • stabilize node and Sentinel identity, add Sentinel probes, bound/re-resolve startup waits, and move auth out of process argv
  • document Sentinel residual data-loss conditions and add schema/tests for the new failover hardening controls

Validation

  • helm lint charts/valkey --strict
  • helm unittest charts/valkey
  • make validate-chart CHART=valkey TIMEOUT=900: FULLY VALIDATED (21 layers)
  • make site-sync-check CHART=valkey
  • make release-check REPO=charts
  • make attribution-check REPO=charts

Site PR: helmforgedev/site#365
Issue: #633

Notes

  • The branch was rebased on current main.
  • The non-conventional revert commit subject was amended to fix(valkey): revert sentinel publish-not-ready bootstrap change.
  • No merge performed.

Summary by CodeRabbit

  • New Features

    • Added configurable Sentinel pod management and failover safeguards for smoother master transitions.
    • Introduced updated startup, readiness, and liveness checks for Sentinel-aware nodes.
  • Bug Fixes

    • Improved handling of restarts during failover windows to reduce empty-master resyncs and data loss risk.
    • Added graceful shutdown behavior to trigger failover before pod termination.
  • Documentation

    • Clarified Sentinel persistence and recovery behavior, including remaining edge-case data-loss scenarios.

@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Sentinel-mode Valkey chart templates move bootstrap, probe, and shutdown logic from inline StatefulSet command/args into mounted ConfigMap scripts (node-entrypoint.sh, node-prestop.sh, sentinel-entrypoint.sh). New probe helpers and Sentinel failover values/schema fields are added, tests are re-targeted to assert generated script content, and documentation is updated.

Changes

Sentinel scripted bootstrap and failover

Layer / File(s) Summary
Failover configuration and documentation
charts/valkey/values.yaml, charts/valkey/values.schema.json, charts/valkey/docs/sentinel.md
Adds sentinel.podManagementPolicy, gracefulFailover, and startupFailoverGuard values with schema validation, and expands persistence/data-loss documentation.
Probe command helpers
charts/valkey/templates/_helpers.tpl
Adds valkey.nodeProbeCommand, valkey.sentinelProbeCommand, and valkey.sentinelReadyCommand for exec-based TLS-aware probes without argv passwords.
ConfigMap entrypoint and shutdown scripts
charts/valkey/templates/configmap.yaml
Adds node-entrypoint.sh (discovery, runtime config generation), node-prestop.sh (graceful failover on shutdown), and sentinel-entrypoint.sh (config bootstrap, persistence, auth, myid).
Node StatefulSet wiring
charts/valkey/templates/sentinel-node-statefulset.yaml
Switches init/main containers to mounted scripts, adds a conditional preStop hook, VALKEYCLI_AUTH env, updated probe commands, and a scripts ConfigMap volume.
Sentinel StatefulSet wiring
charts/valkey/templates/sentinel-statefulset.yaml
Switches the sentinel container to sentinel-entrypoint.sh, adds podManagementPolicy, simplified auth env vars, exec probes, and a scripts volume mount.
Script-based test assertions
charts/valkey/tests/cluster_domain_test.yaml, charts/valkey/tests/ha_extension_test.yaml, charts/valkey/tests/sentinel_node_test.yaml
Re-targets assertions from StatefulSet container args to generated ConfigMap script content, mounted script paths, and auth env wiring.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Kubelet
  participant InitContainer as Init Container
  participant NodeEntrypoint as node-entrypoint.sh
  participant Sentinel
  participant MainContainer as Node Container

  Kubelet->>InitContainer: run node-entrypoint.sh --wait-only
  InitContainer->>Sentinel: sentinel get-master-addr-by-name
  Sentinel-->>InitContainer: master address (or none)
  InitContainer-->>Kubelet: exit when master discovered
  Kubelet->>MainContainer: run node-entrypoint.sh
  MainContainer->>Sentinel: discover role / master
  MainContainer->>MainContainer: write /data/.valkey-runtime.conf
  MainContainer->>MainContainer: exec valkey-server
Loading
sequenceDiagram
  participant Kubelet
  participant MainContainer as Node Container
  participant PreStop as node-prestop.sh
  participant Sentinel

  Kubelet->>PreStop: preStop hook (pod terminating)
  PreStop->>MainContainer: check master role and replicas
  alt is master with replicas
    PreStop->>Sentinel: request failover
    Sentinel-->>PreStop: new master elected
  end
  PreStop-->>Kubelet: allow shutdown
Loading
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main changes: failover safety, stable node/Sentinel identity, and improved secret handling.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e96adb3260

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread charts/valkey/templates/configmap.yaml
Comment thread charts/valkey/templates/configmap.yaml Outdated
@mreho mreho marked this pull request as draft July 3, 2026 17:38
@mreho mreho marked this pull request as ready for review July 4, 2026 09:38

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@charts/valkey/templates/service.yaml`:
- Around line 160-164: The Sentinel service configuration is using
publishNotReadyAddresses on a ClusterIP service, which will not expose per-pod
DNS records for bootstrap. Update the service definition used by
valkey.sentinelServiceName so it is headless, or move publishNotReadyAddresses
to the governing headless service instead. Use the service template and the
valkey.sentinelServiceName configuration to locate the correct place to apply
the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 2469b026-0b49-424d-bce2-9366ec2f8361

📥 Commits

Reviewing files that changed from the base of the PR and between 6f38537 and 185924a.

📒 Files selected for processing (2)
  • charts/valkey/templates/configmap.yaml
  • charts/valkey/templates/service.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • charts/valkey/templates/configmap.yaml

Comment thread charts/valkey/templates/service.yaml Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 185924a580

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread charts/valkey/templates/service.yaml Outdated
@mreho

mreho commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

Ready for merge @mberlofa 😉

@mberlofa mberlofa force-pushed the fix/valkey/failoverHardening branch from 2495b82 to c62670c Compare July 4, 2026 13:36
@mberlofa mberlofa merged commit 740b7b4 into helmforgedev:main Jul 4, 2026
18 checks passed
@mberlofa

mberlofa commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

merged @mreho

@mreho mreho deleted the fix/valkey/failoverHardening branch July 4, 2026 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants