Skip to content

app-promotion-deploy: manual/auto apply concurrency collisions + wrong detect matrix #265

Description

@Svaag

Two related deploy-orchestration bugs surfaced while promoting noc-agent (the proactive-loop work). Sibling to #262.

Bug 1 — manual apply and auto app-promotion-deploy cancel each other

app-promotion-deploy.yml triggers on push to main (deploy paths) and calls apply.yml with dry_run: false, which waits at the production gate. Both that reusable call and any manual apply.yml workflow_dispatch land in the same live concurrency group:

# apply.yml
concurrency:
  group: ${{ inputs.dry_run && format('production-infra-dry-run-{0}', github.run_id) || 'production-infra-live-v2' }}
  cancel-in-progress: false

When an auto-deploy is sitting at the gate and an operator also dispatches a manual apply (or vice versa), GitHub cancels one of the pending runs in the group. Result: applies silently end as cancelled with no job ever starting, and it looks random.

Worse, unapproved auto-deploys accumulate at the gate and hold the lane. Evidence this session:

  • Manual applies cancelled mid-gate: 27648731930, 27654535843.
  • A stale auto-deploy from an earlier promotion merge (Promote noc-agent 84cb3c9 #261) sat waiting ~2h: 27648482775 — and because the called apply checks out that merge commit, approving it would have deployed the superseded (old) pin, not current main.
  • Re-dispatching only worked when the lane happened to be clear (27649053353, 27655203336).

Proposed fix

  • Make superseded promotion deploys auto-cancel (e.g. cancel-in-progress: true for the promotion-deploy lane, or have app-promotion-deploy cancel prior pending runs for the same target) so they don't pile up.
  • Document/enforce a single deploy path: after a promotion merge, approve the auto-deploy gate rather than manually dispatching apply; if a manual apply is needed, ensure the live lane is clear first.
  • Consider checking out current main rather than the push SHA so a queued promotion deploy can't ship a superseded pin.

Bug 2 — detect maps a noc_agent_version change to icinga2/mon, not noc

The #264 merge changed only ansible/inventory/host_vars/noc.yml (noc_agent_version: 84cb3c9 → 7941d45). The resulting auto-deploy (27654482813) detect job produced a matrix of just apply (icinga2, mon) — so the app pin bump would not have deployed noc-agent at all; it only re-rendered icinga on mon.

Expected: a noc_agent_version change in host_vars/noc.yml should map to playbook=noc, limit=noc (the app deploy), optionally also an icinga refresh — but it must include noc.

Proposed fix

Fix the detect path→playbook mapping (the add_once(playbook, limit) logic) so an app-version pin change in host_vars/<host>.yml maps to that host's app playbook.

Acceptance

  • Merging a promotion PR that bumps noc_agent_version triggers an auto-deploy whose matrix includes (noc, noc).
  • A manual apply and an auto-deploy no longer silently cancel each other; superseded promotion deploys are cancelled rather than left waiting.

Refs: app-promotion-deploy.yml, apply.yml; surfaced promoting noc-agent for the proactive NOC loop (#264). Sibling to #262.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    loop:knowledge-gapKnowledge context is missing, stale, or contradictory

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions