You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two related deploy-orchestration bugs surfaced while promoting noc-agent (the proactive-loop work). Sibling to #262.
Bug 1 — manual apply and auto app-promotion-deploy cancel each other
app-promotion-deploy.yml triggers on push to main (deploy paths) and calls apply.yml with dry_run: false, which waits at the production gate. Both that reusable call and any manualapply.ymlworkflow_dispatch land in the same live concurrency group:
When an auto-deploy is sitting at the gate and an operator also dispatches a manual apply (or vice versa), GitHub cancels one of the pending runs in the group. Result: applies silently end as cancelled with no job ever starting, and it looks random.
Worse, unapproved auto-deploys accumulate at the gate and hold the lane. Evidence this session:
A stale auto-deploy from an earlier promotion merge (Promote noc-agent 84cb3c9 #261) sat waiting ~2h: 27648482775 — and because the called apply checks out that merge commit, approving it would have deployed the superseded (old) pin, not current main.
Re-dispatching only worked when the lane happened to be clear (27649053353, 27655203336).
Proposed fix
Make superseded promotion deploys auto-cancel (e.g. cancel-in-progress: true for the promotion-deploy lane, or have app-promotion-deploy cancel prior pending runs for the same target) so they don't pile up.
Document/enforce a single deploy path: after a promotion merge, approve the auto-deploy gate rather than manually dispatching apply; if a manual apply is needed, ensure the live lane is clear first.
Consider checking out current main rather than the push SHA so a queued promotion deploy can't ship a superseded pin.
Bug 2 — detect maps a noc_agent_version change to icinga2/mon, not noc
The #264 merge changed only ansible/inventory/host_vars/noc.yml (noc_agent_version: 84cb3c9 → 7941d45). The resulting auto-deploy (27654482813) detect job produced a matrix of just apply (icinga2, mon) — so the app pin bump would not have deployed noc-agent at all; it only re-rendered icinga on mon.
Expected: a noc_agent_version change in host_vars/noc.yml should map to playbook=noc, limit=noc (the app deploy), optionally also an icinga refresh — but it must include noc.
Proposed fix
Fix the detect path→playbook mapping (the add_once(playbook, limit) logic) so an app-version pin change in host_vars/<host>.yml maps to that host's app playbook.
Acceptance
Merging a promotion PR that bumps noc_agent_version triggers an auto-deploy whose matrix includes (noc, noc).
A manual apply and an auto-deploy no longer silently cancel each other; superseded promotion deploys are cancelled rather than left waiting.
Refs: app-promotion-deploy.yml, apply.yml; surfaced promoting noc-agent for the proactive NOC loop (#264). Sibling to #262.
Two related deploy-orchestration bugs surfaced while promoting noc-agent (the proactive-loop work). Sibling to #262.
Bug 1 — manual
applyand autoapp-promotion-deploycancel each otherapp-promotion-deploy.ymltriggers on push tomain(deploy paths) and callsapply.ymlwithdry_run: false, which waits at theproductiongate. Both that reusable call and any manualapply.ymlworkflow_dispatchland in the same live concurrency group:When an auto-deploy is sitting at the gate and an operator also dispatches a manual
apply(or vice versa), GitHub cancels one of the pending runs in the group. Result: applies silently end ascancelledwith no job ever starting, and it looks random.Worse, unapproved auto-deploys accumulate at the gate and hold the lane. Evidence this session:
27648731930,27654535843.waiting~2h:27648482775— and because the calledapplychecks out that merge commit, approving it would have deployed the superseded (old) pin, not currentmain.27649053353,27655203336).Proposed fix
cancel-in-progress: truefor the promotion-deploy lane, or haveapp-promotion-deploycancel prior pending runs for the same target) so they don't pile up.apply; if a manual apply is needed, ensure the live lane is clear first.mainrather than the push SHA so a queued promotion deploy can't ship a superseded pin.Bug 2 —
detectmaps anoc_agent_versionchange toicinga2/mon, notnocThe #264 merge changed only
ansible/inventory/host_vars/noc.yml(noc_agent_version: 84cb3c9 → 7941d45). The resulting auto-deploy (27654482813)detectjob produced a matrix of justapply (icinga2, mon)— so the app pin bump would not have deployed noc-agent at all; it only re-rendered icinga on mon.Expected: a
noc_agent_versionchange inhost_vars/noc.ymlshould map toplaybook=noc, limit=noc(the app deploy), optionally also an icinga refresh — but it must includenoc.Proposed fix
Fix the
detectpath→playbook mapping (theadd_once(playbook, limit)logic) so an app-version pin change inhost_vars/<host>.ymlmaps to that host's app playbook.Acceptance
noc_agent_versiontriggers an auto-deploy whose matrix includes(noc, noc).applyand an auto-deploy no longer silently cancel each other; superseded promotion deploys are cancelled rather than leftwaiting.Refs:
app-promotion-deploy.yml,apply.yml; surfaced promoting noc-agent for the proactive NOC loop (#264). Sibling to #262.🤖 Generated with Claude Code