Proposal for GA version of automated etcd backups by bhperry · Pull Request #2035 · openshift/enhancements

bhperry · 2026-06-08T15:17:17Z

Enhancement building on the tech preview version of automated etcd backups. Standardizes CRD names and updates them to be more flexible for different use-cases, including support for on-install backup schedules saved on local storage on each master node without using a separate backup mechanism.

CC @dusk125 @atiratree

openshift-ci · 2026-06-08T15:17:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ashcrow for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dusk125 · 2026-06-08T15:20:45Z

/cc @hasbro17
/cc @tjungblu
/cc @jubittajohn

dusk125 · 2026-06-08T15:21:56Z

Supersedes #1646

tjungblu · 2026-06-09T07:14:37Z

+
+- Is there any need for `EtcdBackupPolicy` to be generalized for components other than etcd? In the original design doc, the `Backup` API had an etcd field so it could support additional types later. In my opinion this API is cleaner when streamlined for the specific use-case we are designing for, and if a new type of backup was needed in future that should be its own separate API.
+
+- Should retention rules also include MinQuantity that enforces a minimum N backups are retained when combined with other rules (e.g. MaxAge, MaxSize).


I think for starters, you should just introduce the minimum quantity as the only retention option and cull by age.

We had a lot of trouble getting the retention to work correctly due to:

ordering issues: you need to run retention before taking a backup to free space - but that means you may go over the max size and count limit after running the backups (IIRC we have funny assertions in our integration tests for this)

unknown backup size: we don't have the information about how big the incoming backup snapshot will be, so it's very difficult to keep it within MaxSize. This is particularly annoying when a customer defines a PV with 10 gigs of space, sets the max size to 10 gigs but then still runs out of space.

Creating such a big test matrix with different rules chained together is probably confusing for our customers and very difficult to get right and properly test. Imagine you have to constantly take backups and prune them as well.

The design here is more of a threshold to trigger pruning than a hard cap. Because yes it gets tricky when you try to preemptively prune for the snapshot that hasn't been taken yet. My idea was to do all prune operations at the API level, and then reconcile that with the state of files in storage. Totally separate from snapshotting schedule, no promises made about order of operations.

But I'm fine with simplifying retention options here so long as the API design allows for extensions in the future. I think that makes sense, and will certainly make testing much easier.

I leave this up to you guys, just wanted to share what a PITA it was to be smart about it before ;)

tjungblu · 2026-06-09T07:16:37Z

+- Should retention rules also include MinQuantity that enforces a minimum N backups are retained when combined with other rules (e.g. MaxAge, MaxSize).
+   - Would prevent cases where new backups have failed and all the existing backups get aged out
+
+- Creating an EtcdBackup could allow a user to escape their namespace and privileges. This feature requires an auth check to only allow cluster-admins to interact with the backups.


is that so? you'd need to have admin rbac to create those, won't you?

This was a note from @atiratree. A user granted RBAC permission to create an EtcdBackup would effectively be able to read anything in the cluster by backing up to a volume they control.

tjungblu · 2026-06-09T07:21:03Z

+- Use an existing open source backup solution such as Velero
+   - To be discussed further with the team that made the original enhancement


I know we're going to discuss this on Friday, but unless you find the last three PMs that were involved in this - I don't think we'll really find an answer.

Figured, that's fine. Included here just in case anyone had something to say about it before we have our meeting.

racedo · 2026-07-01T07:48:42Z

+The procedure also requires gaining a root shell on a control plane node. Shell access to OpenShift control plane nodes access is generally
+discouraged due to the potential for affecting the reliability of the node.
+
+### Goals


For the GA I want to ensure we include Prometheus metrics within the scope of this feature. Observability is critical for enterprise and regulated customers to monitor backup health and satisfy compliance audits.

I propose we introduce the following metrics:

Metric Name Type Description

etcd_backup_schedule_enabled Gauge Indicates whether the automated backup schedule is active (1) or suspended/disabled (0).

etcd_backup_job_last_execution_status Gauge Status of the most recent backup execution (1 = Success, 0 = Failed).

etcd_backup_job_last_success_timestamp_seconds Gauge Unix timestamp of the last successful backup. Crucial for RPO alerting (time() - metric > max_allowed_gap).

etcd_backup_job_duration_seconds Histogram The execution time of the backup process.

etcd_backup_last_size_bytes Gauge The file size of the last successful backup snapshot.

For these regulated corporate environments, verifying data integrity and avoiding silent failures is a strict audit requirement. If a backup script technically succeeds but outputs a truncated or empty file, tracking etcd_backup_last_size_bytes allows infrastructure teams to catch and alert on e.g. drops in backup size or other drastic changes.

Let me know if incorporating these into the GA scope looks feasible or if we should discuss further!

The idea is to have our EtcdBackup CRD include important status information such as size and status reported back from the backups jobs. Would it make sense to collect metrics over all existing backups rather than just the last one? That way you could run queries to sum the total backup size or count the number of failures over some time period.

It's also possible for there to be multiple backups that run at the same time (e.g. with the default backup that targets hostPath on all control plane nodes), so tracking only the "last" with these metrics won't always paint the full picture.

A note about backup validation: since the main part of our backup is the snapshot of the etcd database, we can validate its integrity with etcdutl snapshot status which we do in our cluster-backup script (which is used by the backup jobs) https://github.com/openshift/cluster-etcd-operator/blob/b4786ae33ca2aceae72d67e6644e9928ead71f18/bindata/etcd/cluster-backup.sh#L130

so status should be an accurate reflection of success/failure regardless of size right?

Should be, the etcdutl snapshot status exits !0 if the snapshot is invalid and the cluster-backup script exits 1 in turn.

I agree with Ben, that we could have multiple backup schedules so it may make more sense to have successes and failures. Prom queries can be written to see the rates of these based on the successes/failures of the individual jobs. "Last" doesn't make much sense; all of these could just be general buckets of successes, failures, durations, etc.

Added a metrics section. Modified a bit to follow the standards of other k8s metrics and allow for richer queries

racedo

Add observability metrics

openshift-ci · 2026-07-01T16:25:32Z

@bhperry: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from dtantsur and flavianmissi June 8, 2026 15:17

openshift-ci Bot requested review from hasbro17, jubittajohn and tjungblu June 8, 2026 15:20

bhperry force-pushed the automated-etcd-backups-ga branch from 8d307ef to 04ea6e2 Compare June 8, 2026 17:07

tjungblu mentioned this pull request Jun 9, 2026

[Auto-Generated] Personal Workflow Brief tjungblu/agent#1

Open

tjungblu reviewed Jun 9, 2026

View reviewed changes

Comment thread enhancements/etcd/automated-backups-ga.md Outdated

tjungblu reviewed Jun 9, 2026

View reviewed changes

Comment thread enhancements/etcd/automated-backups-ga.md Outdated

tjungblu reviewed Jun 9, 2026

View reviewed changes

Comment thread enhancements/etcd/automated-backups-ga.md Outdated

tjungblu reviewed Jun 9, 2026

View reviewed changes

bhperry force-pushed the automated-etcd-backups-ga branch from 04ea6e2 to 1dda643 Compare June 11, 2026 15:18

dusk125 suggested changes Jun 30, 2026

View reviewed changes

Comment thread enhancements/etcd/automated-backups-ga.md Outdated

Comment thread enhancements/etcd/automated-backups-ga.md Outdated

bhperry force-pushed the automated-etcd-backups-ga branch from 1dda643 to f1a5de0 Compare June 30, 2026 17:18

racedo reviewed Jul 1, 2026

View reviewed changes

Proposal for GA version of automated etcd backups

44f0c15

bhperry force-pushed the automated-etcd-backups-ga branch from f1a5de0 to 44f0c15 Compare July 1, 2026 16:04


		- Is there any need for `EtcdBackupPolicy` to be generalized for components other than etcd? In the original design doc, the `Backup` API had an etcd field so it could support additional types later. In my opinion this API is cleaner when streamlined for the specific use-case we are designing for, and if a new type of backup was needed in future that should be its own separate API.

		- Should retention rules also include MinQuantity that enforces a minimum N backups are retained when combined with other rules (e.g. MaxAge, MaxSize).

		- Use an existing open source backup solution such as Velero
		- To be discussed further with the team that made the original enhancement

Metric Name	Type	Description
`etcd_backup_schedule_enabled`	Gauge	Indicates whether the automated backup schedule is active (`1`) or suspended/disabled (`0`).
`etcd_backup_job_last_execution_status`	Gauge	Status of the most recent backup execution (`1` = Success, `0` = Failed).
`etcd_backup_job_last_success_timestamp_seconds`	Gauge	Unix timestamp of the last successful backup. Crucial for RPO alerting (`time() - metric > max_allowed_gap`).
`etcd_backup_job_duration_seconds`	Histogram	The execution time of the backup process.
`etcd_backup_last_size_bytes`	Gauge	The file size of the last successful backup snapshot.

Uh oh!

Conversation

bhperry commented Jun 8, 2026

Uh oh!

openshift-ci Bot commented Jun 8, 2026

Uh oh!

dusk125 commented Jun 8, 2026

Uh oh!

dusk125 commented Jun 8, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bhperry Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

racedo left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bhperry Jul 1, 2026 •

edited

Loading