fix(ci): retry transient native-archive downloads in the assemble stage#18057
fix(ci): retry transient native-archive downloads in the assemble stage#18057radical wants to merge 2 commits into
Conversation
The assemble stage's "Download native archives (parallel)" task failed when a
single artifact download dropped its connection mid-transfer:
Download failed for 'native_archives_osx_x64' ... Unable to read data from
the transport connection: An existing connection was forcibly closed by the
remote host..
download-native-archives.ps1 issued one Invoke-WebRequest per artifact with no
retry, so any one flaky download failed the whole stage (and cascaded into the
Assemble job and non-production-output verification), requiring a manual re-run.
The failure is a transport-level exception with no HTTP response, so
Invoke-WebRequest -MaximumRetryCount (which only retries on HTTP status codes)
would not have helped. Retry the request explicitly with exponential backoff,
governed by new -MaxDownloadAttempts (default 4) and -RetryBaseDelaySeconds
(default 3) parameters. The partial temp file is removed between attempts.
Opening the downloaded zip is done inside the retry scope as an integrity
check: a mid-transfer drop usually throws from Invoke-WebRequest, but can also
leave a truncated-yet-complete-looking file that only fails at
ZipFile::OpenRead. Re-downloading is the correct remedy for that too, so a
corrupt-zip failure now retries instead of failing the stage.
Tests: added a flaky-artifact case (500 twice, then succeeds) and a
corrupt-then-valid case (200 with non-zip bytes, then a valid zip) proving both
the transport-failure and integrity-check retry paths recover; tightened the
always-failing case to assert retries are exhausted. Mock server gained
AddFlakyArtifact and AddCorruptThenValidArtifact support.
Fixes microsoft#18055
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 18057Or
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 18057" |
|
Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
|
There was a problem hiding this comment.
Pull request overview
This PR hardens the eng/scripts/download-native-archives.ps1 assemble-stage artifact download flow against transient failures by adding per-artifact retries with exponential backoff and verifying zip integrity inside the retry scope, so a single flaky download doesn’t fail the whole assemble job.
Changes:
- Added
-MaxDownloadAttemptsand-RetryBaseDelaySecondsparameters and implemented explicit per-artifact retry + backoff aroundInvoke-WebRequestandZipFile::OpenRead. - Ensured partial/corrupt temp zip files are removed between retry attempts, while preserving the final failed temp file for debugging.
- Added regression tests covering transient HTTP failures, corrupt-then-valid downloads, and updated existing failure assertions to account for retry exhaustion behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| eng/scripts/download-native-archives.ps1 | Adds configurable retry/backoff and zip-integrity verification inside the retry loop for artifact downloads. |
| tests/Infrastructure.Tests/PowerShellScripts/DownloadNativeArchivesTests.cs | Adds mock-server behaviors and tests validating retries for transient failures and corrupt downloads, plus updates existing failure assertions. |
The native archive download retry loop treated every failure as retryable, including HTTP 4xx responses that indicate configuration, auth, or URL problems. That delayed actionable failures and contradicted the retry policy, which is intended for transient artifact-store failures. Classify retryable download failures explicitly: transport/no-response errors, HTTP 408, HTTP 429, HTTP 5xx responses, and corrupt zip integrity failures. Other HTTP 4xx responses now fail immediately with a non-retryable error. Add regression coverage for a 404 artifact download so this cannot regress back to retrying all HTTP failures. The existing retry tests continue covering HTTP 5xx, connection-close, and corrupt-then-valid downloads. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The internal assemble stage fails when a single artifact download drops its
connection mid-transfer. Build 2995658 failed in "Download native archives
(parallel)" because one of seven downloads hit:
The other six succeeded, but
download-native-archives.ps1issued a singleInvoke-WebRequestper artifact with no retry, so one flaky download failed thewhole stage — cascading into the
Assemblejob and the non-production-outputcheck — and required a manual re-run.
Root cause
The failure is a transport-level exception with no HTTP response, so
Invoke-WebRequest -MaximumRetryCount(which only retries on HTTP status codes)would not have caught it.
The fix
Retry each per-artifact download explicitly with exponential backoff, governed
by new
-MaxDownloadAttempts(default 4) and-RetryBaseDelaySeconds(default 3) parameters. The partial temp file is removed between attempts.
Opening the downloaded zip is done inside the retry scope as an integrity
check: a mid-transfer drop usually throws from
Invoke-WebRequest, but can alsoleave a truncated-yet-complete-looking file that only fails at
ZipFile::OpenRead. Re-downloading is the right remedy there too, so acorrupt-zip failure now retries instead of failing the stage.
Tests
tests/Infrastructure.Tests/.../DownloadNativeArchivesTests.csadds tworegression cases that both fail if the retry is reverted:
The always-failing case now asserts retries are exhausted. 8/8 tests pass.
Fixes #18055