Skip to content

fix(talos): migrate configmanager to alpha.2 multi-document config API#5775

Draft
devantler wants to merge 2 commits into
mainfrom
claude/talos-alpha2-configmanager
Draft

fix(talos): migrate configmanager to alpha.2 multi-document config API#5775
devantler wants to merge 2 commits into
mainfrom
claude/talos-alpha2-configmanager

Conversation

@devantler

@devantler devantler commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Why

Talos v1.14.0-alpha.2 removed the top-level network and API-server accessors from its config interface, breaking ksail's build and leaving the Dependabot bumps #5765/#5766 red.

What

Migrates the Talos configmanager to read CNI/pod-CIDRs/apiserver-image via Talos's new multi-document accessors. Unit tests, lint and build are green.

⚠️ Blocked — needs a product decision (do not promote as-is)

E2E proves the machinery bump cannot land alone: with the module on alpha.2 but ksail's pinned Talos node image still at the stable v1.13.5, every Talos cluster create fails to boot (setupSharedFilesystems: invalid argument) — the alpha.2-generated machine config is incompatible with the v1.13.5 node. The same suite is fully green on the alpha.1 baseline (#5773), confirming this is the bump, not a flake.

Making this green requires also bumping ksail's default Talos node image to v1.14.0-alpha.2 — i.e. shipping a moving Talos 1.14 alpha as the default local node image. That's a stability call I won't make unilaterally.

Recommendation: hold the Talos 1.14 bump until v1.14.0 stable ships, then bump machinery + node image together. Kept as a draft with the prepared migration code; see #5771. Promote only if you want ksail tracking Talos 1.14-alpha now (I'll add the node-image bump on request).

Part of #5771

Talos v1.14.0-alpha.2 removed Network() and APIServer() from the
config.ClusterConfig interface, moving them to the K8sNetworkConfig,
K8sFlannelCNIConfig and K8sAPIServerConfig documents. Read CNI presence,
pod CIDRs and the kube-apiserver image via the new accessors (available on
config.Provider through the machinery's v1alpha1 bridge, so config
generation is unchanged).

Unblocks the Talos alpha.2 dependency bumps #5765 and #5766.

Fixes #5771
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR updates indirect Go module versions in go.mod and desktop/go.mod, and migrates Talos configmanager/provisioner code and tests to the alpha.2 multi-document config accessors for network, API server, and related CNI handling.

Changes

Go module dependency bumps

Layer / File(s) Summary
Root go.mod dependency bumps
go.mod
Bumps Talos, AWS SDK, containerd typeurl, rtnetlink, ethtool, go-blockdevice/go-talos-support, go-tuf, etcd, yaml, genproto, and Kubernetes indirect dependencies; removes gogo/protobuf.
Desktop go.mod dependency bumps
desktop/go.mod
Applies the same indirect dependency bumps and gogo/protobuf removal in desktop/go.mod.

Talos alpha.2 config API migration

Layer / File(s) Summary
CNI and Pod CIDR accessors via K8sNetworkConfig
pkg/fsutil/configmanager/talos/configs.go, pkg/fsutil/configmanager/talos/doc.go
IsCNIDisabled and NetworkCIDR now read from K8sNetworkConfig()/K8sFlannelCNIConfig(), and the package example is updated to match.
API server config access via K8sAPIServerConfig
pkg/fsutil/configmanager/talos/version.go, pkg/fsutil/configmanager/talos/configs_test.go, pkg/fsutil/configmanager/talos/apiserver_feature_gates_test.go, pkg/svc/provisioner/cluster/talos/provisioner_hetzner_floating_ip_test.go
KubernetesVersionFromProvider and related tests switch from Cluster().APIServer()/Cluster() to K8sAPIServerConfig() for image, extraArgs, and CertSANs reads.
Disruptive CNI change detection
pkg/svc/provisioner/cluster/talos/detect_disruptive.go, pkg/svc/provisioner/cluster/talos/detect_disruptive_test.go
machineClusterConfig and cniName use the alpha.2 accessors with a new cniNone constant, and the CNI change tests are updated for the new non-flannel comparison behavior.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Possibly related issues

Suggested reviewers: devantler

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately summarizes the main Talos configmanager migration to the alpha.2 multi-document config API.
Description check ✅ Passed The description clearly discusses the Talos alpha.2 config migration and its impact on ksail, matching the changeset.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/talos-alpha2-configmanager

Comment @coderabbitai help to get the list of available commands.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

MegaLinter analysis: Success

✅ Linters with no issues

actionlint, bash-exec, git_diff, hadolint, jscpd, jsonlint, lychee, markdown-table-formatter, markdownlint, prettier, prettier, shellcheck, shfmt, stylelint, syft, trivy-sbom, trufflehog, v8r, v8r, yamllint

Notices

📣 MegaLinter 9.5.0 is out! Discover the new features and security recommendations in the release announcement. (Skip this info by defining SECURITY_SUGGESTIONS: false)

See detailed reports in MegaLinter artifacts

MegaLinter is graciously provided by OX Security
Show us your support by starring ⭐ the repository

@github-code-quality

github-code-quality Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Code Coverage Overview

Languages: Go

Go / code-coverage/go

The overall coverage in the branch remains at 65%, unchanged from the branch.

Show a code coverage summary of the most impacted files.
File f60dfea f934da6 +/-
pkg/fsutil/conf...alos/version.go 77% 76% -1%
pkg/cli/cluster...ocal_service.go 92% 92% 0%
pkg/fsutil/conf...alos/configs.go 86% 87% +1%
pkg/client/reconciler/poll.go 80% 84% +4%
pkg/svc/provisi...t_disruptive.go 82% 87% +5%

Updated July 04, 2026 04:40 UTC
Code Coverage is in Public Preview. Learn more and provide us with your feedback.

@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

⚠️ Do not promote yet — this draft has a confirmed Talos-bootstrap regression.

CI shows a clean split: every leg that boots a Talos node fails, everything else passes.

  • All 6 standalone Talos System Test legs → FAILURE (bootstrap failed: … authentication handshake failed: connection reset by peer / EOF / connection refused).
  • The 3 bare Vanilla/K3s/VCluster (Docker, true) legs (which exercise the kubernetes-provider nested-cluster test, and that test creates a nested Talos cluster) → FAILURE with the same bootstrap error.
  • All non-Talos legs (KWOK, and every Vanilla/K3s/VCluster variant that does not boot Talos) → SUCCESS. Build, Test, Coverage, and both linters → SUCCESS.

Not a flake: the concurrent #5773 run (same CI window, same registry, no Talos-config change) had all its Talos legs pass. So the alpha.2 config migration itself breaks the generated Talos machine config at real bring-up — the "config generation is unchanged" assumption holds for the unit-tested structure but not for a live boot (unit tests don't boot a node).

Next: root-cause the config incompatibility introduced by the alpha.1→alpha.2 machinery bump (likely a generated machine-config / certs / apid-facing change vs the Talos image the tests boot; a lockstep Talos test-image bump may be required). Keeping this draft blocked — and the dependabot alpha.2 bumps #5765/#5766 stay blocked with it — until a Talos node bootstraps green.

@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Root-cause of the failing System Tests — confirmed upstream, not fixable in ksail.

The accessor-API migration in this PR is correct and unit-green. The System Test failures are a Talos v1.14.0-alpha.2 boot regression in Docker (container) mode, unrelated to this diff:

[talos] task setupSharedFilesystems (1/1): failed: invalid argument
[talos] phase sharedFilesystems (5/8): failed
[talos] boot sequence: failed
✗ failed to create cluster: bootstrap failed

Talos's SetupSharedFilesystems task runs mount("", t, "", MS_SHARED|MS_REC, "") over ["/", "/var", "/etc/cni", "/run"]. Making a path shared requires it to already be a mount point, else the kernel returns EINVAL ("invalid argument"). The task code is byte-identical between alpha.1 and alpha.2, yet alpha.1 boots fine in Docker and alpha.2 fails here — so an earlier alpha.2 boot-sequence change leaves one of those targets unmounted in container mode. This is the Talos image's internal boot, which ksail cannot influence via machine config.

Disposition: the migration stays a draft until the upstream Docker-mode boot regression is fixed (or a fixed alpha ships). Superseding dependabot bumps #5766 / #5765 (both to the same alpha.2) are being closed in favour of this PR, which does the required API adaptation they lack.

@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Correction to my previous comment — the root cause is a node-image/module version mismatch, not an upstream regression.

My earlier note framed the failing System Tests as an "upstream Talos alpha.2 Docker-mode boot regression." That is inaccurate and I'm superseding it. The accurate root cause:

ksail pins the Talos node image separately from the machinery module. pkg/fsutil/configmanager/talos/Dockerfile pins ghcr.io/siderolabs/talos:v1.13.5 (DefaultTalosImage, read via go:embed). This PR bumps the machinery module to v1.14.0-alpha.2, so ksail now generates an alpha.2-format machine config and applies it to a v1.13.5 node — that node's boot sequence rejects the newer config format at setupSharedFilesystems (invalid argument), which cascades to apid down → bootstrap failure. The #5773 alpha.1 baseline on identical infra is fully green, confirming it's the bump, not a flake.

The module and node image are coupled and must move together. So this is our coupled-bump decision, not an external block:

  • Decision (conservative): HOLD the Talos 1.14 bump until v1.14.0 is STABLE, then bump the machinery module and the Dockerfile node image (the v1.14.0-alpha.2 node image exists — a one-line Dockerfile change) in the same PR. Greening this now would ship a moving 1.14 alpha as ksail's default node image, which is unwarranted.
  • The accessor-API migration in this PR (the actual code change) is correct and unit-green; it's the go.mod version + missing node-image bump that make the E2E red.

Tracked by #5771. Superseding dependabot bumps #5765/#5766 were closed in favour of this PR. Do not rerun the red E2E — it is expected until the node image moves with the module.

Resolve go.mod/go.sum conflicts: keep the Talos machinery v1.14.0-alpha.2 bump,
take gopacket v1.6.1 from main (#5767), regenerate go.sum via go mod tidy for
both root and desktop modules. Clears the DIRTY merge state on this parked draft;
its E2E remains red-by-design (node-image/module version hold — see PR body).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@devantler devantler marked this pull request as ready for review July 4, 2026 06:15
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Parked — blocker live-verified (2026-07-04). Confirmed against siderolabs/talos releases: v1.14.0 stable is not out — the newest tag is v1.14.0-alpha.2 (pre-release); current stable is v1.13.5. The failing Talos E2E is not a code bug: alpha.2's config API is incompatible with the pinned stable node image, and a stable CLI should not default its node image to a pre-release.

Decision: hold this migration until Talos v1.14.0 stable ships, then bump the module + node image together in one PR. Converting back to draft so it isn't accidentally merged (merging as-is breaks every Talos cluster create). Dependabot #5765/#5766 stay parked behind this. Re-evaluate when v1.14.0 stable is released.

@devantler devantler marked this pull request as draft July 4, 2026 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

1 participant