fix(swift-vnet): retry az login on transient DNS error at container cold start (ARO-28135)#5918
Merged
openshift-merge-bot[bot] merged 1 commit intoJul 3, 2026
Conversation
…old start (ARO-28135) The swift-vnet container logs in via 'az login --identity' as the first thing it does. On an ACI cold start the network/DNS stack is occasionally not ready, so login fails with a transient DNS error ([Errno -3] Try again) and set -e kills the whole step. The existing retry() helper already absorbs eventually-consistent failures for group show / resource tag / vnet create, but it was defined after the login, so the two most network-sensitive first calls ran unguarded. Hoist retry() above the login and wrap 'az login --identity' and 'az account set' in it so a cold-start DNS blip self-heals within the existing 180s window.
Contributor
There was a problem hiding this comment.
Pull request overview
Hardens the Swift management VNet provisioning shell script by making the container’s initial Azure CLI authentication resilient to transient DNS/network readiness issues during ACI cold starts (ARO-28135). This aligns the login path with the script’s existing retry-based handling for eventual consistency and transient failures.
Changes:
- Hoists the existing
retry()helper (and its timing constants) earlier in the container’s inline script so it can be used during initial authentication. - Wraps
az login --identityandaz account setwithretry()to self-heal transient cold-start DNS failures. - Updates the retry log message to reflect both RBAC propagation and cold-start network transients.
Collaborator
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mmazur, raelga The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This was referenced Jul 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes ARO-28135 — follow-up hardening for the Swift VNet Shell step from #5889 (relates to ARO-28045).
What
Wrap the container's
az login --identityandaz account setin the existingretry()helper. This required hoistingretry()(andMAX_WAIT/POLL_INTERVAL) above the login block.Why
swift-vnet.shlaunches a container group that runs asglobalMSIand logs in viaaz login --identityas its very first action. On an ACI cold start the network/DNS stack is occasionally not ready the instant the container runs, so login fails with a transient DNS error andset -ekills the whole step:[Errno -3] Try againisgetaddrinfoEAI_AGAIN — a temporary name-resolution failure. Theretry()helper already absorbs eventually-consistent failures (RBAC propagation) foraz group show,az resource tag, andaz network vnet create, but it was defined after the login, so the two most network-sensitive first calls ran unguarded. A single cold-start blip failed the step even though a retry seconds later succeeds.Testing
bash -nsyntax check passes. The change only reorders existing helper definitions and addsretryin front of twoazcalls; no behavioral change beyond retrying transient failures within the existing 180s window.Special notes for your reviewer
Pure hardening — no change to what the step does, only its resilience to cold-start DNS races. Same class of eventual-consistency handling the rest of the script already relies on, now extended to the login path.
PR Checklist