Skip to content

Batch small Helix work items to reduce per-item overhead (-53% compute)#66808

Draft
mmitche wants to merge 4 commits into
mainfrom
dev/helix-work-item-batching
Draft

Batch small Helix work items to reduce per-item overhead (-53% compute)#66808
mmitche wants to merge 4 commits into
mainfrom
dev/helix-work-item-batching

Conversation

@mmitche
Copy link
Copy Markdown
Member

@mmitche mmitche commented May 22, 2026

Batch small Helix work items to reduce per-item overhead

Problem

Each ASP.NET Core CI build sends ~503 Helix work items, but most have only 2-10 seconds of actual test execution with ~20-40 seconds of per-item overhead (tool installs, vstest startup, result upload). This wastes ~5.6 compute-hours per build.

Solution

Batch compatible small test assemblies into groups of ~20 per Helix work item. Assemblies with special dependencies (IIS, Playwright, Java, Node, MSSQL) remain as individual items.

Measured Results

Metric Baseline Batched Change
Work items 503 47 -91%
Total compute 634 min (10.6 hrs) 300 min (5.0 hrs) -53%

Per-queue breakdown:

Queue Baseline (items/min) Batched (items/min) Savings
Windows (vs2026) 189 / 300 min 23 / 148 min -51%
macOS (osx.15) 156 / 239 min 11 / 88 min -63%
Ubuntu (2404) 158 / 95 min 13 / 64 min -32%

How it works

  1. eng/helix/helix.proj - new BatchSmallWorkItems MSBuild inline task runs after the existing Gather target. Groups eligible items by TFM, creates combined payload directories, writes a targets.txt manifest.

  2. eng/tools/HelixTestRunner - accepts --targets-file targets.txt to run dotnet test sequentially for each assembly. Tool installs happen once per batch. Test results are merged.

  3. eng/helix/content/runtests.cmd / runtests.sh - detect @targets.txt prefix for batched mode. Fully backward compatible.

Batching rules

  • Eligible: Items with no special PreCommands (no IIS, Playwright, Java, Node, MSSQL deps)
  • Group by: Target framework (net11.0, net472, etc.)
  • Max batch size: 20 assemblies per work item
  • Failed tests: Batched items exit 0 so all test results are reported. Individual failures visible via test results XML.

CI validation

mmitche and others added 4 commits May 21, 2026 14:06
Reduces ~503 work items per build to ~50 by batching compatible small test assemblies (those without special dependencies like IIS/Playwright) into groups of up to 20.

Each batched work item runs dotnet test sequentially for each assembly, sharing the per-item setup overhead (tool installs, env config, vstest launcher) across the batch.

Expected impact: ~60-70% reduction in total compute per build.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The batch task was using non-existent metadata fields (RuntimeVersion,
QueueName, etc.) resulting in empty command arguments. Ubuntu items
were submitted with empty runtime/queue args and never picked up by
agents. Now parses these values from the first item's existing Command
string.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When one assembly in a batch fails, the entire batch exited with code 1,
causing the Helix SDK to treat it as a failed work item and not report
test results from the other 19 passing assemblies to AzDO. This caused
~1300 missing tests in the test count.

Now batched runs always exit 0 so all results are reported. Individual
test failures are visible through the test results XML. This also
prevents wasteful retries of entire 20-assembly batches for a single
flaky test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added the area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework label May 22, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Hey @dotnet/aspnet-build, looks like this PR is something you want to take a look at.

@mmitche
Copy link
Copy Markdown
Member Author

mmitche commented May 22, 2026

@wtgodbe THis is an attempt to reduce the test overhead in aspnetcore. It is NOT ready (prototype included some hacks for testing)

@wtgodbe
Copy link
Copy Markdown
Member

wtgodbe commented May 22, 2026

Love this idea! Will take a closer look next week, but worth noting that the helix tests are currently broken into 2 subsets:

aspnetcore/eng/Build.props

Lines 236 to 297 in 27c660e

<ProjectsWithTestsSubset1 Include="
$(RepoRoot)src\Framework\AspNetCoreAnalyzers\test\Microsoft.AspNetCore.App.Analyzers.Test.csproj;
$(RepoRoot)src\Framework\test\Microsoft.AspNetCore.App.UnitTests.csproj;
$(RepoRoot)src\Caching\**\*.*proj;
$(RepoRoot)src\DefaultBuilder\**\*.*proj;
$(RepoRoot)src\Features\**\*.*proj;
$(RepoRoot)src\DataProtection\**\*.*proj;
$(RepoRoot)src\Antiforgery\**\*.*proj;
$(RepoRoot)src\Hosting\**\*.*proj;
$(RepoRoot)src\Http\**\*.*proj;
$(RepoRoot)src\HttpClientFactory\**\*.*proj;
$(RepoRoot)src\Html.Abstractions\**\*.*proj;
$(RepoRoot)src\Identity\**\*.*proj;
$(RepoRoot)src\Servers\**\*.csproj;
$(RepoRoot)src\Security\**\*.*proj;
$(RepoRoot)src\SiteExtensions\Microsoft.Web.Xdt.Extensions\**\*.csproj;
$(RepoRoot)src\SiteExtensions\LoggingAggregate\test\**\*.csproj;
$(RepoRoot)src\Shared\**\*.*proj;
$(RepoRoot)src\Tools\**\*.*proj;
$(RepoRoot)src\Logging.AzureAppServices\**\src\*.csproj;
$(RepoRoot)src\Middleware\**\*.csproj;
"
Exclude="
@(ProjectToBuild);
@(ProjectToExclude);
$(RepoRoot)**\node_modules\**\*;
$(RepoRoot)**\bin\**\*;
$(RepoRoot)**\obj\**\*;"
Condition=" '$(BuildMainlyReferenceProviders)' != 'true' " />
<ProjectsWithTestsSubset2 Include="
$(RepoRoot)src\Razor\**\*.*proj;
$(RepoRoot)src\Mvc\**\*.*proj;
$(RepoRoot)src\Azure\**\*.*proj;
$(RepoRoot)src\SignalR\**\*.csproj;
$(RepoRoot)src\StaticAssets\**\*.csproj;
$(RepoRoot)src\Components\**\*.csproj;
$(RepoRoot)src\Analyzers\**\*.csproj;
$(RepoRoot)src\FileProviders\**\*.csproj;
$(RepoRoot)src\Configuration.KeyPerFile\**\*.csproj;
$(RepoRoot)src\Localization\**\*.csproj;
$(RepoRoot)src\ObjectPool\**\*.csproj;
$(RepoRoot)src\JSInterop\**\*.csproj;
$(RepoRoot)src\WebEncoders\**\*.csproj;
$(RepoRoot)src\HealthChecks\**\*.csproj;
$(RepoRoot)src\Testing\**\*.csproj;
$(RepoRoot)src\Grpc\**\*.csproj;
$(RepoRoot)src\ProjectTemplates\**\*.csproj;
$(RepoRoot)src\Extensions\**\*.csproj;
$(RepoRoot)src\OpenApi\**\*.csproj;
$(RepoRoot)src\Validation\**\*.csproj;
"
Exclude="
@(ProjectToBuild);
@(ProjectToExclude);
$(RepoRoot)**\node_modules\**\*;
$(RepoRoot)**\bin\**\*;
$(RepoRoot)**\obj\**\*;"
Condition=" '$(BuildMainlyReferenceProviders)' != 'true' " />
<DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '1'" Include="@(ProjectsWithTestsSubset1)" />
<DotNetProjects Condition=" '$(HelixSubset)' == '' OR '$(HelixSubset)' == '2'" Include="@(ProjectsWithTestsSubset2)" />
,
# Helix x64 subset 1
- template: jobs/default-build.yml
parameters:
jobName: Helix_x64_Subset_1
jobDisplayName: 'Tests: Helix x64 Subset 1'
agentOs: Windows
timeoutInMinutes: 240
steps:
# Build the shared framework
- script: ./eng/build.cmd -ci -prepareMachine -nativeToolsOnMachine -all -pack -arch x64
/p:CrossgenOutput=false /p:ASPNETCORE_TEST_LOG_DIR=artifacts/log $(_InternalRuntimeDownloadArgs)
/p:VsTestUseMSBuildOutput=false $(HelixSubset1LogArgs)
displayName: Build shared fx
# -noBuildNative -noBuild to avoid repeating work done in the previous step.
- script: ./eng/build.cmd -ci -prepareMachine -nativeToolsOnMachine -all -noBuildNative -noBuild -test
-projects eng\helix\helix.proj /p:IsHelixPRCheck=true /p:IsHelixJob=true
/p:CrossgenOutput=false /p:ASPNETCORE_TEST_LOG_DIR=artifacts/log $(_InternalRuntimeDownloadArgs)
/p:VsTestUseMSBuildOutput=false /p:RunTemplateTests=false /p:HelixSubset=1
displayName: Run build.cmd helix target
env:
HelixApiAccessToken: $(HelixApiAccessToken) # Needed for internal queues
SYSTEM_ACCESSTOKEN: $(System.AccessToken) # We need to set this env var to publish helix results to Azure Dev Ops
artifacts:
- name: Helix_Subset_1_Logs_Attempt_$(System.JobAttempt)
path: artifacts/log/
publishOnError: true
includeForks: true
# Helix x64 subset 2
. We did this because parallelizing the helix tests cut about 40 minutes off the build - but the 2 groups were chosen completely arbitrarily. There might be some extra time to be saved by re-subsetting based on whatever "compatibility" heuristic you're currently using

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants