Skip to content

feat: machineconfig-backup capability + PKI kubeconfig fix (session/24)#40

Merged
ontave merged 6 commits into
mainfrom
session/24-machineconfig-backup
May 6, 2026
Merged

feat: machineconfig-backup capability + PKI kubeconfig fix (session/24)#40
ontave merged 6 commits into
mainfrom
session/24-machineconfig-backup

Conversation

@ontave
Copy link
Copy Markdown
Contributor

@ontave ontave commented May 6, 2026

Summary

  • PLATFORM-BL-MACHINECONFIG-BACKUP (conductor side): machineconfig-backup named capability in platform_machineconfig.go. Iterates nodes via EndpointsFromTalosconfig + NodeContext, calls GetMachineConfig per node, uploads to s3://{bucket}/{cluster}/machineconfigs/{TIMESTAMP}/{hostname}.yaml. Hostname extracted from config YAML with sanitized-IP fallback. CapabilityMachineConfigBackup = "machineconfig-backup" added to runnerlib/constants.go. Registered in stubs.go.
  • PLATFORM-BL-KUBECONFIG-CANONICAL (conductor side): pkiRotateHandler now writes only seam-mc-{cluster}-kubeconfig; removed secondary target-cluster-kubeconfig write.
  • 5 unit tests: nil clients, missing CR, success with hostname extraction, upload failure, hostname fallback.

Test plan

  • go test ./... -- all tests green
  • TalosMachineConfigBackup live: Conductor executor Job completes and uploads per-node YAML to S3 (requires ccs-mgmt recovery)

ontave added 6 commits May 3, 2026 19:44
…rd, tenant machineConfigPaths, lineage status
Governor directive (session/21): CODEBASE.md eliminated from all repos.
The graphify knowledge graph at ~/ontai/graphify-out/graph.json is the
sole authoritative source for codebase understanding. See root CONTEXT.md
and CLAUDE.md for the Graphify Source of Truth Protocol.
…tarts

Two bugs in hardeningApplyHandler that could destroy cluster nodes or take
down the VIP:

1. VIP filtering in EndpointsFromTalosconfig (adapters.go)
   Adds clusterEndpoint field to talosConfigCtx. When set, the VIP is
   excluded from the endpoint fallback list before returning. Without this,
   the VIP address was included in the per-node iteration, causing
   GetMachineConfig to read from the VIP-holding node and ApplyConfiguration
   to apply only to that node -- silently skipping all other control-plane
   nodes. If the talosconfig contains only the VIP after filtering, an error
   is returned rather than an empty list that would silently skip all nodes.

2. Stabilization wait between nodes (platform_security.go)
   After applying machineconfig patches to a node, waitForNodeStable polls
   Health() until the node is responsive before proceeding to the next node.
   No-reboot applies can briefly restart kubelet or other services. Without
   the wait, sequential rapid application across all control-plane nodes can
   produce overlapping restarts, losing etcd quorum and taking down the VIP.
   The wait is skipped after the last node.

New tests: TestEndpointsFromTalosconfig_ClusterEndpointFiltered,
TestEndpointsFromTalosconfig_ClusterEndpointOnlyReturnsError,
TestHardeningApply_StabilizationWaitBetweenNodes.
Execute mode dispatches via Resolve; agent mode uses RegisteredNames for
the capability manifest only and never calls Execute. One registry keeps
the manifest and implementation set in sync by construction.
Replace /tmp/envtest-bins/1.35.0 (ephemeral, stale version) with the
canonical ontai root Makefile target: make envtest-setup && export
KUBEBUILDER_ASSETS=$(make -s envtest-path). Pinned to K8s 1.32.x.
…session/24)

- machineconfig-backup named capability: iterates all cluster nodes via
  EndpointsFromTalosconfig + NodeContext, reads GetMachineConfig per node,
  uploads to S3 at {cluster}/machineconfigs/{TIMESTAMP}/{hostname}.yaml.
  Hostname extracted from config YAML; sanitized node IP as fallback.
  5 unit tests cover nil clients, missing CR, success, upload failure, no-hostname.
  CapabilityMachineConfigBackup constant added to runnerlib.

- PKI rotation kubeconfig: removed secondary target-cluster-kubeconfig write
  from pkiRotateHandler (upsertKubeconfigSecret now writes only
  seam-mc-{cluster}-kubeconfig per PLATFORM-BL-KUBECONFIG-CANONICAL).
@ontave ontave merged commit 85943e4 into main May 6, 2026
2 checks passed
@ontave ontave deleted the session/24-machineconfig-backup branch May 6, 2026 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant