Proposal for GA version of automated etcd backups#2035
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/cc @hasbro17 |
|
Supersedes #1646 |
8d307ef to
04ea6e2
Compare
|
|
||
| - Is there any need for `EtcdBackupPolicy` to be generalized for components other than etcd? In the original design doc, the `Backup` API had an etcd field so it could support additional types later. In my opinion this API is cleaner when streamlined for the specific use-case we are designing for, and if a new type of backup was needed in future that should be its own separate API. | ||
|
|
||
| - Should retention rules also include MinQuantity that enforces a minimum N backups are retained when combined with other rules (e.g. MaxAge, MaxSize). |
There was a problem hiding this comment.
I think for starters, you should just introduce the minimum quantity as the only retention option and cull by age.
We had a lot of trouble getting the retention to work correctly due to:
- ordering issues: you need to run retention before taking a backup to free space - but that means you may go over the max size and count limit after running the backups (IIRC we have funny assertions in our integration tests for this)
- unknown backup size: we don't have the information about how big the incoming backup snapshot will be, so it's very difficult to keep it within MaxSize. This is particularly annoying when a customer defines a PV with 10 gigs of space, sets the max size to 10 gigs but then still runs out of space.
Creating such a big test matrix with different rules chained together is probably confusing for our customers and very difficult to get right and properly test. Imagine you have to constantly take backups and prune them as well.
There was a problem hiding this comment.
The design here is more of a threshold to trigger pruning than a hard cap. Because yes it gets tricky when you try to preemptively prune for the snapshot that hasn't been taken yet. My idea was to do all prune operations at the API level, and then reconcile that with the state of files in storage. Totally separate from snapshotting schedule, no promises made about order of operations.
But I'm fine with simplifying retention options here so long as the API design allows for extensions in the future. I think that makes sense, and will certainly make testing much easier.
There was a problem hiding this comment.
I leave this up to you guys, just wanted to share what a PITA it was to be smart about it before ;)
| - Should retention rules also include MinQuantity that enforces a minimum N backups are retained when combined with other rules (e.g. MaxAge, MaxSize). | ||
| - Would prevent cases where new backups have failed and all the existing backups get aged out | ||
|
|
||
| - Creating an EtcdBackup could allow a user to escape their namespace and privileges. This feature requires an auth check to only allow cluster-admins to interact with the backups. |
There was a problem hiding this comment.
is that so? you'd need to have admin rbac to create those, won't you?
There was a problem hiding this comment.
This was a note from @atiratree. A user granted RBAC permission to create an EtcdBackup would effectively be able to read anything in the cluster by backing up to a volume they control.
| - Use an existing open source backup solution such as Velero | ||
| - To be discussed further with the team that made the original enhancement |
There was a problem hiding this comment.
I know we're going to discuss this on Friday, but unless you find the last three PMs that were involved in this - I don't think we'll really find an answer.
There was a problem hiding this comment.
Figured, that's fine. Included here just in case anyone had something to say about it before we have our meeting.
04ea6e2 to
1dda643
Compare
1dda643 to
f1a5de0
Compare
| The procedure also requires gaining a root shell on a control plane node. Shell access to OpenShift control plane nodes access is generally | ||
| discouraged due to the potential for affecting the reliability of the node. | ||
|
|
||
| ### Goals |
There was a problem hiding this comment.
For the GA I want to ensure we include Prometheus metrics within the scope of this feature. Observability is critical for enterprise and regulated customers to monitor backup health and satisfy compliance audits.
I propose we introduce the following metrics:
| Metric Name | Type | Description |
|---|---|---|
etcd_backup_schedule_enabled |
Gauge | Indicates whether the automated backup schedule is active (1) or suspended/disabled (0). |
etcd_backup_job_last_execution_status |
Gauge | Status of the most recent backup execution (1 = Success, 0 = Failed). |
etcd_backup_job_last_success_timestamp_seconds |
Gauge | Unix timestamp of the last successful backup. Crucial for RPO alerting (time() - metric > max_allowed_gap). |
etcd_backup_job_duration_seconds |
Histogram | The execution time of the backup process. |
etcd_backup_last_size_bytes |
Gauge | The file size of the last successful backup snapshot. |
For these regulated corporate environments, verifying data integrity and avoiding silent failures is a strict audit requirement. If a backup script technically succeeds but outputs a truncated or empty file, tracking etcd_backup_last_size_bytes allows infrastructure teams to catch and alert on e.g. drops in backup size or other drastic changes.
Let me know if incorporating these into the GA scope looks feasible or if we should discuss further!
There was a problem hiding this comment.
The idea is to have our EtcdBackup CRD include important status information such as size and status reported back from the backups jobs. Would it make sense to collect metrics over all existing backups rather than just the last one? That way you could run queries to sum the total backup size or count the number of failures over some time period.
It's also possible for there to be multiple backups that run at the same time (e.g. with the default backup that targets hostPath on all control plane nodes), so tracking only the "last" with these metrics won't always paint the full picture.
There was a problem hiding this comment.
A note about backup validation: since the main part of our backup is the snapshot of the etcd database, we can validate its integrity with etcdutl snapshot status which we do in our cluster-backup script (which is used by the backup jobs) https://github.com/openshift/cluster-etcd-operator/blob/b4786ae33ca2aceae72d67e6644e9928ead71f18/bindata/etcd/cluster-backup.sh#L130
There was a problem hiding this comment.
so status should be an accurate reflection of success/failure regardless of size right?
There was a problem hiding this comment.
Should be, the etcdutl snapshot status exits !0 if the snapshot is invalid and the cluster-backup script exits 1 in turn.
I agree with Ben, that we could have multiple backup schedules so it may make more sense to have successes and failures. Prom queries can be written to see the rates of these based on the successes/failures of the individual jobs. "Last" doesn't make much sense; all of these could just be general buckets of successes, failures, durations, etc.
There was a problem hiding this comment.
Added a metrics section. Modified a bit to follow the standards of other k8s metrics and allow for richer queries
f1a5de0 to
44f0c15
Compare
|
@bhperry: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Enhancement building on the tech preview version of automated etcd backups. Standardizes CRD names and updates them to be more flexible for different use-cases, including support for on-install backup schedules saved on local storage on each master node without using a separate backup mechanism.
CC @dusk125 @atiratree