Skip to content

fix(ci): retry transient native-archive downloads in the assemble stage#18057

Open
radical wants to merge 2 commits into
microsoft:mainfrom
radical:radical/fix-download-archives-retry
Open

fix(ci): retry transient native-archive downloads in the assemble stage#18057
radical wants to merge 2 commits into
microsoft:mainfrom
radical:radical/fix-download-archives-retry

Conversation

@radical

@radical radical commented Jun 9, 2026

Copy link
Copy Markdown
Member

The internal assemble stage fails when a single artifact download drops its
connection mid-transfer.
Build 2995658 failed in "Download native archives
(parallel)" because one of seven downloads hit:

Download failed for 'native_archives_osx_x64' ... Unable to read data from
the transport connection: An existing connection was forcibly closed by the
remote host..

The other six succeeded, but download-native-archives.ps1 issued a single
Invoke-WebRequest per artifact with no retry, so one flaky download failed the
whole stage — cascading into the Assemble job and the non-production-output
check — and required a manual re-run.

Root cause

The failure is a transport-level exception with no HTTP response, so
Invoke-WebRequest -MaximumRetryCount (which only retries on HTTP status codes)
would not have caught it.

The fix

Retry each per-artifact download explicitly with exponential backoff, governed
by new -MaxDownloadAttempts (default 4) and -RetryBaseDelaySeconds
(default 3) parameters. The partial temp file is removed between attempts.

Opening the downloaded zip is done inside the retry scope as an integrity
check: a mid-transfer drop usually throws from Invoke-WebRequest, but can also
leave a truncated-yet-complete-looking file that only fails at
ZipFile::OpenRead. Re-downloading is the right remedy there too, so a
corrupt-zip failure now retries instead of failing the stage.

Tests

tests/Infrastructure.Tests/.../DownloadNativeArchivesTests.cs adds two
regression cases that both fail if the retry is reverted:

  • flaky transport: HTTP 500 twice, then succeeds;
  • corrupt-then-valid: HTTP 200 with non-zip bytes, then a valid zip.

The always-failing case now asserts retries are exhausted. 8/8 tests pass.

Fixes #18055

The assemble stage's "Download native archives (parallel)" task failed when a
single artifact download dropped its connection mid-transfer:

    Download failed for 'native_archives_osx_x64' ... Unable to read data from
    the transport connection: An existing connection was forcibly closed by the
    remote host..

download-native-archives.ps1 issued one Invoke-WebRequest per artifact with no
retry, so any one flaky download failed the whole stage (and cascaded into the
Assemble job and non-production-output verification), requiring a manual re-run.

The failure is a transport-level exception with no HTTP response, so
Invoke-WebRequest -MaximumRetryCount (which only retries on HTTP status codes)
would not have helped. Retry the request explicitly with exponential backoff,
governed by new -MaxDownloadAttempts (default 4) and -RetryBaseDelaySeconds
(default 3) parameters. The partial temp file is removed between attempts.

Opening the downloaded zip is done inside the retry scope as an integrity
check: a mid-transfer drop usually throws from Invoke-WebRequest, but can also
leave a truncated-yet-complete-looking file that only fails at
ZipFile::OpenRead. Re-downloading is the correct remedy for that too, so a
corrupt-zip failure now retries instead of failing the stage.

Tests: added a flaky-artifact case (500 twice, then succeeds) and a
corrupt-then-valid case (200 with non-zip bytes, then a valid zip) proving both
the transport-failure and integrity-check retry paths recover; tightened the
always-failing case to assert retries are exhausted. Mock server gained
AddFlakyArtifact and AddCorruptThenValidArtifact support.

Fixes microsoft#18055

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 18057

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 18057"

@radical radical added the area-engineering-systems infrastructure helix infra engineering repo stuff label Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
GitHub was asked to rerun all failed jobs for that attempt, and the rerun is being tracked in the rerun attempt.
The job links below point to the failed attempt jobs that matched the retry-safe transient failure rules.

@radical radical marked this pull request as ready for review June 9, 2026 19:39
Copilot AI review requested due to automatic review settings June 9, 2026 19:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the eng/scripts/download-native-archives.ps1 assemble-stage artifact download flow against transient failures by adding per-artifact retries with exponential backoff and verifying zip integrity inside the retry scope, so a single flaky download doesn’t fail the whole assemble job.

Changes:

  • Added -MaxDownloadAttempts and -RetryBaseDelaySeconds parameters and implemented explicit per-artifact retry + backoff around Invoke-WebRequest and ZipFile::OpenRead.
  • Ensured partial/corrupt temp zip files are removed between retry attempts, while preserving the final failed temp file for debugging.
  • Added regression tests covering transient HTTP failures, corrupt-then-valid downloads, and updated existing failure assertions to account for retry exhaustion behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
eng/scripts/download-native-archives.ps1 Adds configurable retry/backoff and zip-integrity verification inside the retry loop for artifact downloads.
tests/Infrastructure.Tests/PowerShellScripts/DownloadNativeArchivesTests.cs Adds mock-server behaviors and tests validating retries for transient failures and corrupt downloads, plus updates existing failure assertions.

Comment thread eng/scripts/download-native-archives.ps1 Outdated
The native archive download retry loop treated every failure as retryable,
including HTTP 4xx responses that indicate configuration, auth, or URL
problems. That delayed actionable failures and contradicted the retry policy,
which is intended for transient artifact-store failures.

Classify retryable download failures explicitly: transport/no-response errors,
HTTP 408, HTTP 429, HTTP 5xx responses, and corrupt zip integrity failures.
Other HTTP 4xx responses now fail immediately with a non-retryable error.

Add regression coverage for a 404 artifact download so this cannot regress
back to retrying all HTTP failures. The existing retry tests continue covering
HTTP 5xx, connection-close, and corrupt-then-valid downloads.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@radical radical closed this Jun 10, 2026
@radical radical reopened this Jun 10, 2026
@microsoft-github-policy-service microsoft-github-policy-service Bot added this to the 13.5 milestone Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-engineering-systems infrastructure helix infra engineering repo stuff

Projects

None yet

Development

Successfully merging this pull request may close these issues.

download-native-archives.ps1 has no retry on transient download failures, failing the assemble stage

2 participants