Skip to content

feat: add EtcdBackup CR for on-demand snapshots to object storage (S3/GCS, pluggable)#386

Draft
xrl wants to merge 9 commits into
etcd-io:mainfrom
xrl:pr/etcd-backup-snapshot
Draft

feat: add EtcdBackup CR for on-demand snapshots to object storage (S3/GCS, pluggable)#386
xrl wants to merge 9 commits into
etcd-io:mainfrom
xrl:pr/etcd-backup-snapshot

Conversation

@xrl

@xrl xrl commented Jun 18, 2026

Copy link
Copy Markdown

Refs #385

What

Adds a new namespaced EtcdBackup CRD that takes a point-in-time snapshot of a referenced EtcdCluster and uploads it to an object store. This implements the roadmap item "Create on-demand backup of a cluster" (docs/roadmap.md).

apiVersion: operator.etcd.io/v1alpha1
kind: EtcdBackup
metadata:
  name: nightly
spec:
  clusterRef: my-cluster
  destination:
    provider: s3                 # s3 | gcs
    prefix: etcd/backups
    secretRef: { name: s3-creds } # optional; omit for IRSA / Workload Identity
    s3: { bucket: my-backups, region: us-east-1 }
  retention: { retainCount: 7 }
status:
  phase: Completed               # Pending|Snapshotting|Uploading|Completed|Failed
  snapshotLocation: s3://my-backups/etcd/backups/my-cluster/nightly-20260617T010203Z.db
  snapshotSizeBytes: 20480
  completionTime: "2026-06-17T01:02:05Z"

Architecture decision (delegated) — and rationale

The task was to choose between a standalone backup operator and growing the core operator, with a recommendation to isolate the heavy cloud SDKs. I chose a dedicated controller + an isolated provider package, wired into the existing operator binarynot a second binary, for now.

Two seams keep the cloud SDKs and exec machinery out of the core reconcile path and make everything unit-testable without cloud creds or a live cluster:

  1. Snapshotter (internal/controller/snapshotter.go) produces the snapshot byte stream; the default execs etcdctl snapshot save in a member pod via the exec subresource and streams stdout.
  2. pkg/objectstore.Store (interface.go + s3.go + gcs.go) is the minimal Upload/List/Delete surface. A Factory registry dispatches on the provider name, so a new backend is a new file calling objectstore.Register. The AWS and GCP SDKs are imported only from this package — the controller and API types pull in zero cloud SDK.

The snapshot is streamed snapshotter→uploader through an io.Pipe for backpressure/bounded memory.

Why in-binary rather than cmd/backup-manager: because the SDKs are already quarantined behind the interface and the controller is a self-contained file set, the code is trivially liftable into a separate binary later if the cloud-SDK CVE surface becomes a problem for the core operator. Doing it in-binary now keeps this one independently-mergeable PR with one Deployment / one RBAC set / one image (the standard kubebuilder multi-controller layout), and avoids paying the operational cost of a second binary before there's a reason to. The trade-off (core image gains the AWS/GCP SDK dependency tree) is documented and reversible. See docs/snapshot-backups.md.

Credentials

secretRef is optional. Absent it, providers use their ambient chain (IRSA on AWS, Workload Identity on GCP) — the recommended production posture. Present, the keys are accessKeyID/secretAccessKey(/sessionToken) for s3 and serviceAccountJSON for gcs.

Testing

  • Unit (pkg/objectstore): JoinKey/destination validation, factory dispatch + duplicate-register panic, and the S3 and GCS stores exercised against in-memory fakes of the narrowed SDK interfaces — no cloud credentials required.
  • Unit/envtest (internal/controller): full reconcile happy-path, missing-cluster / no-ready-members / snapshot-error / upload-error → Failed, terminal no-op, snapshot timeout, credential resolution (s3/gcs/ambient/missing-secret), retention pruning, and CRD→destination mapping — all via a fake Snapshotter + fake Store.
  • e2e (test/e2e/backup_test.go): a MinIO-backed (S3-compatible, no real cloud) end-to-end test, compile-verified and gated behind ETCD_E2E_BACKUP=true so it doesn't run in the default kind matrix. All its helpers are local to the file to avoid colliding with other in-flight e2e PRs.
  • Sample CRs in config/samples/operator_v1alpha1_etcdbackup.yaml.

go build ./..., go vet ./..., golangci-lint run, the objectstore + controller (envtest) suites, and go test -c ./test/e2e/ all pass locally; go mod tidy is a no-op.

Notes for reviewers

  • CRDs/deepcopy/RBAC were regenerated with the repo-pinned controller-gen. That pinned version (v0.21.0) is newer than the one that last generated operator.etcd.io_etcdclusters.yaml (v0.20.1), so that file shows a one-line version-annotation bump — a side effect of using the pinned tool, not a hand edit.
  • Restore (new cluster from a backup) and scheduled backups (EtcdBackupSchedule) are intentionally out of scope; this CRD + provider interface are the foundation for both.

🤖 Generated with Claude Code


PR series — operability fixes & TLS

Small single-purpose PRs from live kind-cluster testing of the operator. Each stands alone unless an After is listed. → = this PR.

PR Lands After
Reviewable now — small, any order
🟢 #374 Requeue instead of swallowing the client-cert provisioning error
🟢 #391 Grant events.k8s.io RBAC so operator Events are actually recorded
🟢 #392 Correlate members[]/leaderID from one health snapshot (consistent leader)
🟢 #393 Propagate user-supplied altNames.ipAddresses into certificates
🟢 #394 Accept day-suffix validityDuration (365d, 100d12h) as documented
🟢 #395 Surface early reconcile errors as a Degraded condition (was empty status)
🟢 #379 Configurable reconcile worker pool (--max-concurrent-reconciles)
🟢 #369 kind-based stress/scale e2e harness (1/3/7 members, churn, quorum watcher)
TLS stack — in order
🟢 #376 Independent spec.tls.{peer,client} surfaces (breaking alpha API)
#377 TLSReady condition + TLS lifecycle Events #376
#378 Multi-member TLS quorum e2e + PeerCANotShared #377
Parked as drafts pending #363
#382 Per-cluster domain metrics on the operator /metrics endpoint
#384 EtcdCluster admission webhooks (consolidating with #328) #363
⚪→ #386 EtcdBackup CR → object storage (S3/GCS) #363
#387 Automatic quorum-loss disaster recovery (bootstrap-latch guarded) #363

🟢 ready · ⚪ draft

@k8s-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign jberkus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot

Copy link
Copy Markdown

Hi @xrl. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

xrl and others added 9 commits July 1, 2026 22:27
Implements the roadmap item "Create on-demand backup of a cluster" via a
new namespaced EtcdBackup CRD that snapshots a referenced EtcdCluster and
uploads the snapshot to an object store.

Architecture:
- EtcdBackup CRD (operator.etcd.io/v1alpha1): destination {s3|gcs} with
  bucket/prefix/region/secretRef, optional retention; status reports phase,
  snapshotLocation, size, completionTime and a Succeeded condition.
- A dedicated controller execs `etcdctl snapshot save` in a member pod and
  streams it through an io.Pipe into the uploader (bounded memory).
- The cloud SDKs (aws-sdk-go-v2, cloud.google.com/go/storage) are isolated
  behind pkg/objectstore.Store with a Factory registry; S3 (incl.
  S3-compatible endpoints) and GCS are implemented, others register
  themselves. The controller imports no cloud SDK directly, so the code is
  liftable into a standalone cmd/backup-manager later if desired.
- Credentials default to ambient (IRSA / Workload Identity); secretRef is
  optional.

Testing:
- Unit tests for the provider interface (mocked stores, no cloud creds),
  the snapshot/upload orchestration, credential resolution, retention, and
  CRD-to-destination mapping.
- A MinIO-backed e2e (compile-only, gated by ETCD_E2E_BACKUP) plus sample CRs.

Refs etcd-io#385

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…rt deterministically

Two retention data-loss bugs in the S3 and GCS List implementations:

- The list prefix was a raw string prefix, so listing cluster "etcd-a"'s
  snapshots also matched (and retention then deleted) objects under
  "etcd-a-2" and any other cluster whose name has "etcd-a" as a string
  prefix. Append a trailing "/" after JoinKey (which trims trailing
  slashes) so the prefix matches a directory boundary per cluster.

- sort.Slice is unstable and keyed only on LastModified, whose S3
  granularity is one second. On a modtime tie the just-uploaded snapshot
  could be ordered into the objs[RetainCount:] deletion set. Use
  sort.SliceStable with a descending-key tiebreaker; keys embed a
  sortable UTC timestamp so the newest is deterministically first.

Add regression tests: an S3 list across etcd-a/etcd-a-2 proving the
boundary, an all-equal-modtime list proving the stable order, and a full
Reconcile with RetainCount=1 proving the just-uploaded snapshot survives.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
The in-pod snapshot command redirected etcdctl's output with
">/dev/null 2>&1", discarding all diagnostics. On failure the carefully
captured exec stderr buffer was empty, so every failure (TLS, missing
endpoint, etc.) collapsed to the opaque "empty snapshot produced
(stderr: )". Redirect etcdctl's own output to fd 2 ("1>&2") instead:
etcdctl writes the snapshot to the temp file, not stdout, so only its
progress/error lines move to stderr where they are captured, while the
cat'd snapshot bytes still flow to stdout.

Also correct two misleading comments: the snapshotter doc claimed
"snapshot save /dev/stdout" and "never lands on the operator's disk",
and the controller claimed the snapshot "never lands on the operator's
disk" — true for GCS streaming but false for S3, which spools the stream
to an ephemeral temp file in the operator pod (PutObject needs a known
length).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
EtcdBackupStatus.SnapshotRevision was declared (and surfaced in the CRD)
but never written by any code path, so operators always read 0 — which
reads as "revision 0", not "unknown". Wiring it properly needs parsing
`etcdctl snapshot status`, which is out of scope for this PR. Remove the
always-zero v1alpha1 status field before it ships rather than maintain a
misleading one; it can be reintroduced wired and tested later.

Regenerated the CRD via `make manifests`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…reds

Completes the snapshot backup story with restore and a security pass over
the whole backup+restore object-store path.

EtcdRestore:
- New CRD: restore a snapshot into a NEW/empty EtcdCluster, sourcing it
  either by a completed EtcdBackup ref or an explicit object-store location.
  Status phases (Pending/Downloading/Restoring/Completed/Failed), a
  Succeeded condition, and Events.
- New Restorer seam (exec etcdctl snapshot restore into the genesis member,
  stamping the member identity to match the bootstrapping pod) mirrors the
  Snapshotter design so orchestration is unit-testable with no cloud/etcd.
- Empty-target guarantee: creates the target cluster fresh (owned by the
  restore) or refuses to restore over a cluster with ready members.

objectstore:
- Add Download to the Store interface with streaming S3 (GetObject) and GCS
  (NewReader) impls; missing objects map to a new ErrNotFound sentinel.

Security hardening (backup + restore share one seam):
- resolveStoreCredentials reads creds ONLY from secretRef (or ambient
  identity), emits an audit log line with the secret NAME only, never logs
  or error-echoes credential values, and validates required keys.
- validateDestination / validateRestoreSpec reject misconfiguration before
  any I/O.

Tests: restore orchestration (mock store+restorer) covering both sources,
empty-target rejection, not-found/download/restore errors and terminal
no-op; a cred-never-logged guarantee test (funcr log capture); Download
tests for s3/gcs/mem. Sample CRs, docs (restore + security + least-priv
IAM), RBAC, and a compile-only restore e2e.

Deferred (noted in docs): cron scheduling, encryption-at-rest.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
The exec-based snapshotter shelled out with `sh -c "etcdctl snapshot save ...
| cat ..."` inside the member pod. The etcd images the operator deploys
(gcr.io/etcd-development/etcd:v3.6.x) are distroless and ship no shell, so the
backup snapshot exec failed at runtime with:

    OCI runtime exec failed: exec: "sh": executable file not found in $PATH

i.e. EtcdBackup could never reach Completed against the operator's own
clusters. Caught by the object-storage e2e (ETCD_E2E_BACKUP=true).

Replace the exec snapshotter with a clientv3 Maintenance Snapshot() stream from
the operator process: it reaches the member over the client port the operator
can already address in-cluster and works regardless of what binaries the member
image carries. The Snapshotter interface is unchanged, so the fake and all
backup/restore unit tests are unaffected.

The restorer still execs `sh -c` (a deeper redesign is needed to write the
member data dir without a shell); that path remains gated/skipped in e2e and is
documented as a known gap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Live-validated against kind (ETCD_E2E_BACKUP=true). Improvements:

Backup (now genuinely passes, ~43s):
- Independently verify the snapshot really landed in the bucket: `mc stat`
  the object and assert its size == status.SnapshotSizeBytes, plus a minimum
  -size floor. Previously only the controller's self-reported status fields
  were checked, so a controller that set them without uploading would pass.
- Seed a 50-key content-addressable keyspace (single source of truth shared
  with the restore proof) instead of one key.

Reliability / flakiness (standing rule: readiness watches must catch failure):
- Pod waits detect CrashLoopBackOff/ImagePullBackOff/Failed and surface pod
  logs instead of spinning to the timeout with no diagnosis.
- Never call t.Fatalf inside a wait.For poll closure (racey from the poller
  goroutine); return an error and fail from the calling goroutine.
- Fix the shared-singleton bucket pod collision ("pods mc-mkbucket already
  exists"): name the bootstrap pod per-MinIO-instance and clear any stale one.
- Add feature.Teardown to both tests (MinIO, secret, clusters, mc pod) so the
  shared namespace does not leak between runs.
- Pin MinIO/mc images to RELEASE tags instead of :latest.
- Tighten poll intervals (5s -> 2-3s).

Restore: the data-path is known-incomplete against distroless etcd images
(restorer execs `sh -c` with no shell; restores to /var/run/etcd not
/var/lib/etcd; no member restart). Gate the restore round-trip behind a second
ETCD_E2E_RESTORE=true flag and document the three product gaps, so the working
backup e2e stays green while the test remains an executable spec. When enabled
it now asserts the FULL keyspace round-trips with exact values, polls for it,
and adds a negative control (a never-written key must be absent) to rule out a
vacuous pass on an empty member.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…ackup e2e

Enable hermetic, credential-free GCS testing and add a comprehensive
dual-provider object-storage backup e2e that proves a real, intact etcd
snapshot lands in each backend.

GCS emulator endpoint support (mirrors the existing S3 endpoint seam):
- api: add GCSDestinationSpec.Endpoint (+optional) so a destination can
  target a GCS-compatible emulator (fake-gcs-server / storage testbench).
- objectstore: relax Destination.validate() to accept Endpoint for GCS while
  still rejecting the S3-only ForcePathStyle.
- helpers: map GCS.Endpoint through toObjectStoreDestination.
- gcs.go: when Endpoint is set, build the client with option.WithEndpoint +
  option.WithoutAuthentication (the documented unauthenticated-emulator
  recipe), leaving the real-Google credential paths untouched.
- regenerated CRDs (controller-gen v0.21.0); deepcopy unchanged (scalar field).
- unit: validate() now accepts a GCS emulator endpoint and still rejects
  forcePathStyle.

Comprehensive backup e2e (test/e2e, gated by ETCD_E2E_BACKUP=true):
- Factor the MinIO blueprint into a backupBackend seam and run the SAME full
  backup cycle against both providers: MinIO (S3) and an in-cluster
  fake-gcs-server (GCS), neither needing real cloud creds.
- Seed a 53-key keyspace: 50 content-addressable keys (value=hash(key)) plus
  a ~100 KiB large value and a multibyte UTF-8 value, and a pre-backup
  provenance sentinel.
- Prove the object is present-and-valid independently of controller status:
  stat the object directly on the backend (size == status), apply the 16 KiB
  real-snapshot floor, then download the bytes and run `etcdutl snapshot
  status`, asserting it is an intact snapshot whose totalKey == seeded count.
- Adversarial safety guards (live-green; fire before/independent of the
  known-incomplete restore data-path): restore into a populated cluster is
  rejected and the existing data is preserved; restore from a missing object
  fails cleanly within a bounded deadline (no hang). A corrupt-snapshot
  rejection test is added but honestly gated behind ETCD_E2E_RESTORE, since it
  cannot be distinguished from the broken restorer until the restore data-path
  is fixed.
- jsonInt64Field tolerates the GCS JSON API's quoted "size" string as well as
  mc's bare integer.

Verified live on kind (ETCD_E2E_BACKUP=true), two independent runs: both
provider cycles pass with totalKey=53, plus the populated-target and
missing-object guards. go build/vet and unit+envtest are green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…und-trip)

The previous execRestorer was dead-on-arrival three ways: it shelled into a
distroless etcd image (no `sh`), restored to /var/run/etcd instead of the
member's real /var/lib/etcd data dir, and exec'd a live member with no
force-new-cluster semantics. Replace it with an init-container restore.

Architecture:
- EtcdRestore controller now stamps an `operator.etcd.io/restore-source`
  annotation (JSON: snapshot addressing + creds Secret *name* + operator image
  + a per-snapshot generation token) onto the target EtcdCluster, then watches
  the genesis member boot. Completed now truthfully means "a member booted from
  the restored data" (gated on StatefulSet ReadyReplicas>0), not "spec injected".
  Missing-object is verified eagerly (terminal, prompt failure); a wedged restore
  init-container (corrupt snapshot, bad creds) is detected and marked Failed.
- The cluster controller injects, ONLY when that annotation is present, a single
  restore init-container running the operator image as `manager restore-localize`.
  A normal cluster's pod template is byte-identical (zero init-containers), so the
  existing cluster e2e is unaffected and no fleet-wide rolling restart is triggered.
- restore-localize (shell-free, single image): ordinal-0 guard, per-generation
  idempotency marker (a pod restart never re-wipes-and-restores), downloads the
  snapshot via pkg/objectstore, then runs etcd's snapshot-restore library
  in-process into /var/lib/etcd matching the member identity etcd advertises.
  A corrupt snapshot fails the library's hash check -> exit non-zero (corrupt
  rejection now fails for the RIGHT reason). Creds reach this init-container only,
  via optional secretKeyRefs, never the long-running etcd container.
- Dockerfile/Makefile build the whole ./cmd package (not just main.go) so the
  subcommand is linked; OPERATOR_IMAGE plumbed via flag/env + e2e patch.

Validated live on kind (both providers), ungating the full round-trip:
- S3 (MinIO) + GCS (fake-gcs-server): 50-key keyspace restored byte-identical,
  never-snapshotted key absent.
- corrupt-snapshot, populated-target, missing-object guards green.
go build/vet + unit/envtest (incl. cluster bring-up) green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
@xrl xrl force-pushed the pr/etcd-backup-snapshot branch from e0d4484 to ff3ef84 Compare July 2, 2026 02:56
@kubernetes-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign jberkus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xrl

xrl commented Jul 2, 2026

Copy link
Copy Markdown
Author

Rebased onto current main; converting to draft while #363 lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants