Skip to content

fix(talos): accept the actions publish-app identity at node image verification#2457

Merged
devantler merged 1 commit into
mainfrom
claude/talos-image-verification-actions-identity
Jul 4, 2026
Merged

fix(talos): accept the actions publish-app identity at node image verification#2457
devantler merged 1 commit into
mainfrom
claude/talos-image-verification-actions-identity

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Why

New app releases cannot start on prod: wedding-app's latest rollout is stuck in ImagePullBackOff because the Talos node-level image verification still only trusts the old reusable-workflows signing identity, while images are now signed by the actions workflow after the publish-app move. Every future app release hits the same wall until this lands.

What

Accepts both the actions and reusable-workflows publish-app signing identities in the Talos ImageVerificationConfig — the same transition alternation already in place for the Flux OCIRepository verification and the Kyverno verify-app-images policy (third and final verification layer to get this fix). Narrow to actions-only once every app has published a fresh actions-signed image.

Deploys via the merge queue's ksail cluster update step; no manual action needed after merge.

…ification

The shared publish-app.yaml workflow moved from reusable-workflows into
actions (actions#425); the Talos ImageVerificationConfig catch-all still
pinned the old identity, so every newly published app image fails node-level
pull verification (ImagePullBackOff). Alternate both identities, same
transition as the Flux OCIRepository verify blocks and verify-app-images.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Talos image verification config was updated to accept keyless signatures for publish-app.yaml from both devantler-tech/actions and devantler-tech/reusable-workflows, broadening the subjectRegex, along with updated comments describing the migration transition.

Changes

Image verification rule update

Layer / File(s) Summary
Dual signing identity support
talos/cluster/image-verification.yaml
Comments were updated to describe the migration to devantler-tech/actions, and the subjectRegex for first-party app images was broadened to accept publish-app.yaml signatures from both devantler-tech/actions and devantler-tech/reusable-workflows.

Estimated code review effort: 1 (Trivial) | ~5 minutes

Suggested labels: documentation, security

Suggested reviewers: devantler

Poem
A rabbit hops through YAML rows so tight,
Two signers now approved, both signed just right.
From reusable-workflows to actions new,
The keyless trust now widens for the crew.
🐰 A tiny hop, a safer image night.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title is specific and accurately summarizes the Talos image verification change.
Description check ✅ Passed The description directly explains the verification identity migration and matches the changeset.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/talos-image-verification-actions-identity

Comment @coderabbitai help to get the list of available commands.

@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

One-click needed — same circularity as #2456. The merge queue is still wedged: every merge_group deploy fails its health gate on Deployment/wedding-app (ImagePullBackOff from the stale Talos identity this PR fixes), and the 🔄 Update cluster step that would apply this fix runs after that gate — so this PR cannot merge through the queue until the live config is patched first.

Option 1 (preferred) — live patch, then promote this PR. Dry-run verified on prod-worker-3: one-line diff, applied without reboot (--mode=no-reboot refuses if a reboot were needed). My apply was classifier-denied; run:

for n in 10.0.1.1 10.0.1.2 10.0.1.3 10.0.1.4 10.0.1.6 10.0.1.9 10.0.1.5 10.0.1.8 10.0.1.11; do
  talosctl -e 49.13.53.183 -n "$n" patch mc --mode=no-reboot --patch @talos/cluster/image-verification.yaml
done

(run from this PR's branch so the patch file already carries the alternation; the file is a complete-document patch, so patch mc replaces the rules list wholesale). Within ~5 min kubelet backoff retries the pull, wedding-app rolls out, the queue heals, and this PR then merges through it normally, re-asserting the same state declaratively.

Option 2 — merge this PR past the queue (admin bypass) and dispatch cd.yaml from main; the deploy's cluster-update step applies it. Heavier, and the deploy still needs the health gate relaxed or the same live patch to pass — Option 1 is cleaner.

Autoscaler nodes: patched live too (all 9), but nodes booted after the patch from the Hetzner snapshot get the old config until a ksail cluster update re-syncs — merging this PR closes that gap.

@devantler devantler marked this pull request as ready for review July 4, 2026 14:24
@devantler devantler added this pull request to the merge queue Jul 4, 2026
@devantler devantler removed this pull request from the merge queue due to a manual request Jul 4, 2026
@devantler devantler merged commit 50bd53f into main Jul 4, 2026
15 checks passed
@devantler devantler deleted the claude/talos-image-verification-actions-identity branch July 4, 2026 14:24
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jul 4, 2026
@botantler-1

botantler-1 Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 1.96.8 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@botantler-1 botantler-1 Bot added the released label Jul 4, 2026
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Thanks for promoting — holding the enqueue until the Option-1 live patch runs (the queue circularity above: this PR's own merge_group deploy fails the wedding-app health gate before reaching the cluster-update step, so queueing it now would only evict). Verified just now: nodes still carry the old identity and the pods are still in ImagePullBackOff.

I have a watcher armed: as soon as the patch is applied and wedding-app rolls out, I enqueue this PR (bare gh pr merge), then the parked promoted PRs #2443 and #2437, and re-queue #2434.

@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Merged, but the wedge persists — the fix can't reach the nodes on its own. Talos still runs the old reusable-workflows-only regex (read live 15:10Z), because the deploy's 🔄 Update cluster step runs after the reconcile health gate, which fails on wedding-app's ImagePullBackOff — pr-2427's merge-group deploy failed exactly there at 14:58Z and the queue keeps evicting.

One-click to unwedge (my attempt was permission-gated): from a checkout of current main (50bd53f), run

ksail --config ksail.prod.yaml cluster update

(or dispatch cd.yaml — same composite... though it health-gates too; the local cluster-update is the reliable path). The delta is the one-line subjectRegex alternation — In-Place, no reboot. After it applies, wedding-app pods pull, the health gate clears, and I'll re-enqueue the promoted fleet (#2443, #2437, #2434, #2452, …).

@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Cluster-update confirmed applied (thanks!) — images pull now, but the wedge has a 4th layer: wedding-app v1.14.1 itself never boots. New pods went ImagePullBackOff → CrashLoopBackOff (unsettled top-level await at server.init, Node exit 13), so the Deployment stays Failed and the merge-queue health gate still evicts. Root cause (bisect-verified locally): the @sveltejs/adapter-node 5.5.4→5.5.5 bump in v1.14.1 — the 5.5.5 output hangs at boot.

Fix = wedding-app#152 (pin 5.5.4, dependabot-ignore 5.5.5, restore the Lighthouse lane as the boot canary; boot-smoke verified locally). Promote #152 → v1.14.2 releases → prod rollout heals → I drain the queue (promoted fleet #2443/#2437/#2436/#2434/#2452/#2433/#2440, renovate #2458, re-queue #2427/#2430). The site itself is still up meanwhile — the previous ReplicaSet keeps serving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

1 participant