Skip to content

Proposal for GA version of automated etcd backups#2035

Open
bhperry wants to merge 1 commit into
openshift:masterfrom
bhperry:automated-etcd-backups-ga
Open

Proposal for GA version of automated etcd backups#2035
bhperry wants to merge 1 commit into
openshift:masterfrom
bhperry:automated-etcd-backups-ga

Conversation

@bhperry

@bhperry bhperry commented Jun 8, 2026

Copy link
Copy Markdown

Enhancement building on the tech preview version of automated etcd backups. Standardizes CRD names and updates them to be more flexible for different use-cases, including support for on-install backup schedules saved on local storage on each master node without using a separate backup mechanism.

CC @dusk125 @atiratree

@openshift-ci openshift-ci Bot requested review from dtantsur and flavianmissi June 8, 2026 15:17
@openshift-ci

openshift-ci Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ashcrow for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dusk125

dusk125 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

/cc @hasbro17
/cc @tjungblu
/cc @jubittajohn

@dusk125

dusk125 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Supersedes #1646

Comment thread enhancements/etcd/automated-backups-ga.md Outdated

- Is there any need for `EtcdBackupPolicy` to be generalized for components other than etcd? In the original design doc, the `Backup` API had an etcd field so it could support additional types later. In my opinion this API is cleaner when streamlined for the specific use-case we are designing for, and if a new type of backup was needed in future that should be its own separate API.

- Should retention rules also include MinQuantity that enforces a minimum N backups are retained when combined with other rules (e.g. MaxAge, MaxSize).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for starters, you should just introduce the minimum quantity as the only retention option and cull by age.

We had a lot of trouble getting the retention to work correctly due to:

  • ordering issues: you need to run retention before taking a backup to free space - but that means you may go over the max size and count limit after running the backups (IIRC we have funny assertions in our integration tests for this)
  • unknown backup size: we don't have the information about how big the incoming backup snapshot will be, so it's very difficult to keep it within MaxSize. This is particularly annoying when a customer defines a PV with 10 gigs of space, sets the max size to 10 gigs but then still runs out of space.

Creating such a big test matrix with different rules chained together is probably confusing for our customers and very difficult to get right and properly test. Imagine you have to constantly take backups and prune them as well.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design here is more of a threshold to trigger pruning than a hard cap. Because yes it gets tricky when you try to preemptively prune for the snapshot that hasn't been taken yet. My idea was to do all prune operations at the API level, and then reconcile that with the state of files in storage. Totally separate from snapshotting schedule, no promises made about order of operations.

But I'm fine with simplifying retention options here so long as the API design allows for extensions in the future. I think that makes sense, and will certainly make testing much easier.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave this up to you guys, just wanted to share what a PITA it was to be smart about it before ;)

Comment thread enhancements/etcd/automated-backups-ga.md Outdated
- Should retention rules also include MinQuantity that enforces a minimum N backups are retained when combined with other rules (e.g. MaxAge, MaxSize).
- Would prevent cases where new backups have failed and all the existing backups get aged out

- Creating an EtcdBackup could allow a user to escape their namespace and privileges. This feature requires an auth check to only allow cluster-admins to interact with the backups.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that so? you'd need to have admin rbac to create those, won't you?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a note from @atiratree. A user granted RBAC permission to create an EtcdBackup would effectively be able to read anything in the cluster by backing up to a volume they control.

Comment thread enhancements/etcd/automated-backups-ga.md Outdated
Comment on lines +549 to +550
- Use an existing open source backup solution such as Velero
- To be discussed further with the team that made the original enhancement

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we're going to discuss this on Friday, but unless you find the last three PMs that were involved in this - I don't think we'll really find an answer.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured, that's fine. Included here just in case anyone had something to say about it before we have our meeting.

@bhperry bhperry force-pushed the automated-etcd-backups-ga branch from 04ea6e2 to 1dda643 Compare June 11, 2026 15:18
Comment thread enhancements/etcd/automated-backups-ga.md Outdated
Comment thread enhancements/etcd/automated-backups-ga.md Outdated
@bhperry bhperry force-pushed the automated-etcd-backups-ga branch from 1dda643 to f1a5de0 Compare June 30, 2026 17:18
The procedure also requires gaining a root shell on a control plane node. Shell access to OpenShift control plane nodes access is generally
discouraged due to the potential for affecting the reliability of the node.

### Goals

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the GA I want to ensure we include Prometheus metrics within the scope of this feature. Observability is critical for enterprise and regulated customers to monitor backup health and satisfy compliance audits.

I propose we introduce the following metrics:

Metric Name Type Description
etcd_backup_schedule_enabled Gauge Indicates whether the automated backup schedule is active (1) or suspended/disabled (0).
etcd_backup_job_last_execution_status Gauge Status of the most recent backup execution (1 = Success, 0 = Failed).
etcd_backup_job_last_success_timestamp_seconds Gauge Unix timestamp of the last successful backup. Crucial for RPO alerting (time() - metric > max_allowed_gap).
etcd_backup_job_duration_seconds Histogram The execution time of the backup process.
etcd_backup_last_size_bytes Gauge The file size of the last successful backup snapshot.

For these regulated corporate environments, verifying data integrity and avoiding silent failures is a strict audit requirement. If a backup script technically succeeds but outputs a truncated or empty file, tracking etcd_backup_last_size_bytes allows infrastructure teams to catch and alert on e.g. drops in backup size or other drastic changes.

Let me know if incorporating these into the GA scope looks feasible or if we should discuss further!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to have our EtcdBackup CRD include important status information such as size and status reported back from the backups jobs. Would it make sense to collect metrics over all existing backups rather than just the last one? That way you could run queries to sum the total backup size or count the number of failures over some time period.

It's also possible for there to be multiple backups that run at the same time (e.g. with the default backup that targets hostPath on all control plane nodes), so tracking only the "last" with these metrics won't always paint the full picture.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note about backup validation: since the main part of our backup is the snapshot of the etcd database, we can validate its integrity with etcdutl snapshot status which we do in our cluster-backup script (which is used by the backup jobs) https://github.com/openshift/cluster-etcd-operator/blob/b4786ae33ca2aceae72d67e6644e9928ead71f18/bindata/etcd/cluster-backup.sh#L130

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so status should be an accurate reflection of success/failure regardless of size right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be, the etcdutl snapshot status exits !0 if the snapshot is invalid and the cluster-backup script exits 1 in turn.

I agree with Ben, that we could have multiple backup schedules so it may make more sense to have successes and failures. Prom queries can be written to see the rates of these based on the successes/failures of the individual jobs. "Last" doesn't make much sense; all of these could just be general buckets of successes, failures, durations, etc.

@bhperry bhperry Jul 1, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a metrics section. Modified a bit to follow the standards of other k8s metrics and allow for richer queries

@racedo racedo left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add observability metrics

@bhperry bhperry force-pushed the automated-etcd-backups-ga branch from f1a5de0 to 44f0c15 Compare July 1, 2026 16:04
@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@bhperry: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants