Skip to content

fix(swift-vnet): retry az login on transient DNS error at container cold start (ARO-28135)#5918

Merged
openshift-merge-bot[bot] merged 1 commit into
Azure:mainfrom
raelga:rael/aro-28135-swift-vnet-login-retry
Jul 3, 2026
Merged

fix(swift-vnet): retry az login on transient DNS error at container cold start (ARO-28135)#5918
openshift-merge-bot[bot] merged 1 commit into
Azure:mainfrom
raelga:rael/aro-28135-swift-vnet-login-retry

Conversation

@raelga

@raelga raelga commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Fixes ARO-28135 — follow-up hardening for the Swift VNet Shell step from #5889 (relates to ARO-28045).

What

Wrap the container's az login --identity and az account set in the existing retry() helper. This required hoisting retry() (and MAX_WAIT/POLL_INTERVAL) above the login block.

Why

swift-vnet.sh launches a container group that runs as globalMSI and logs in via az login --identity as its very first action. On an ACI cold start the network/DNS stack is occasionally not ready the instant the container runs, so login fails with a transient DNS error and set -e kills the whole step:

Logging in with the Swift-registered managed identity...
ERROR: <urllib3.connection.HTTPSConnection object at ...>: Failed to establish a new connection: [Errno -3] Try again
✗ swift-vnet-aks-net exited with code 1

[Errno -3] Try again is getaddrinfo EAI_AGAIN — a temporary name-resolution failure. The retry() helper already absorbs eventually-consistent failures (RBAC propagation) for az group show, az resource tag, and az network vnet create, but it was defined after the login, so the two most network-sensitive first calls ran unguarded. A single cold-start blip failed the step even though a retry seconds later succeeds.

Testing

bash -n syntax check passes. The change only reorders existing helper definitions and adds retry in front of two az calls; no behavioral change beyond retrying transient failures within the existing 180s window.

Special notes for your reviewer

Pure hardening — no change to what the step does, only its resilience to cold-start DNS races. Same class of eventual-consistency handling the rest of the script already relies on, now extended to the login path.

PR Checklist

  • Does this change require documentation? No — internal script resilience only.
  • Does this change require tests? Covered by existing CI + EV2 rollout of the step.

…old start (ARO-28135)

The swift-vnet container logs in via 'az login --identity' as the first thing
it does. On an ACI cold start the network/DNS stack is occasionally not ready,
so login fails with a transient DNS error ([Errno -3] Try again) and set -e
kills the whole step.

The existing retry() helper already absorbs eventually-consistent failures for
group show / resource tag / vnet create, but it was defined after the login, so
the two most network-sensitive first calls ran unguarded. Hoist retry() above
the login and wrap 'az login --identity' and 'az account set' in it so a
cold-start DNS blip self-heals within the existing 180s window.
Copilot AI review requested due to automatic review settings July 3, 2026 12:14
@openshift-ci openshift-ci Bot requested review from roivaz and stevekuznetsov July 3, 2026 12:14
@openshift-ci openshift-ci Bot added the approved label Jul 3, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens the Swift management VNet provisioning shell script by making the container’s initial Azure CLI authentication resilient to transient DNS/network readiness issues during ACI cold starts (ARO-28135). This aligns the login path with the script’s existing retry-based handling for eventual consistency and transient failures.

Changes:

  • Hoists the existing retry() helper (and its timing constants) earlier in the container’s inline script so it can be used during initial authentication.
  • Wraps az login --identity and az account set with retry() to self-heal transient cold-start DNS failures.
  • Updates the retry log message to reflect both RBAC propagation and cold-start network transients.

@mmazur

mmazur commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

/lgtm

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mmazur, raelga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit f28d3c0 into Azure:main Jul 3, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants