Skip to content

ci: file GitHub issues when scheduled, outerloop, and quarantine CI fails#18058

Draft
radical wants to merge 11 commits into
microsoft:mainfrom
radical:ankj/workflow-failure-issues
Draft

ci: file GitHub issues when scheduled, outerloop, and quarantine CI fails#18058
radical wants to merge 11 commits into
microsoft:mainfrom
radical:ankj/workflow-failure-issues

Conversation

@radical

@radical radical commented Jun 9, 2026

Copy link
Copy Markdown
Member

What's missing

Scheduled GitHub Actions workflows fail silently. GitHub only emails whoever last edited the workflow file, so a broken nightly automation job — or a failing outerloop / quarantine run — can sit unnoticed for days. There's no durable, deduplicated signal that "this scheduled thing is broken."

What this adds

Two complementary, fully deterministic mechanisms (ordinary Actions + JS — no model budget) on one shared engine:

Shared engine — tracking-issue.js
One deduplicated issue per subject, identified by a hidden HTML-comment marker, carrying a fenced run-table that accrues one row per failed run (append, cap at 50, collapse overflow). Repo/label/workflow-agnostic; owns the mechanics, callers own content + policy.

Scanner — monitor-scheduled-workflows.yml (every 2h)
Watches a config-driven list of otherwise-silent scheduled automation workflows. On a failed latest scheduled run of main, files/appends one automation-broken issue per workflow; auto-closes on the next green run. The watch-list lives in monitor-scheduled-workflows.config.json (one flag stops watching a workflow).

In-pipeline reporter (embedded in tests-outerloop.yml / tests-quarantine.yml)
On a scheduled failure, files either a failing-test issue listing the failed tests, or an automation-broken infra issue when the run broke before producing results. Quarantine swallows test failures upstream, so a failed quarantine run is always infra.

GenerateTestSummary --failed-tests-json
Extracts distinct failed test names from TRX ({ failedTests, count, extractionFailed }), so the reporter can tell a real test failure from an infrastructure break.

Why two mechanisms, not one

The scanner only knows "the latest run failed." Distinguishing an infrastructure break from a genuine test failure needs the run's TRX results, which are only available inside the pipeline. So outerloop/quarantine get an embedded reporter that reads their TRX; the rest of the silent automations get the external scanner.

Design notes

  • Pure decision/formatting logic lives in unit-tested modules (node harnesses invoked from xUnit, tests/Infrastructure.Tests/WorkflowScripts/); the thin octokit runners are intentionally not unit-tested.
  • Dedup is via hidden markers. The two mechanisms use disjoint markers (automation-broken:<file> vs ci-failure:<file>:<kind>), so they never manage each other's issues.
  • Only scheduled runs file issues — PR-triggered runs (e.g. editing a workflow) never do.

Call-outs

  • The paths: filters on the quarantine/outerloop workflows mean those test workflows will run on this PR — verify they pass before merging.

radical and others added 7 commits June 8, 2026 19:20
…ine fail

Scheduled GitHub Actions workflows fail silently. GitHub only emails whoever
last edited the workflow file, so a broken nightly job (generate diffs, refresh
manifests, update models, deployment cleanup) — or a broken outerloop /
quarantine run — can sit unnoticed for days.

Adds two complementary mechanisms, both filing/appending a deduplicated issue
per source and following the repo's workflow-script convention (pure JS logic +
Node harness + xUnit behavior test + doc).

Scanner (monitor-scheduled-workflows.yml, every 2h)

- Watches the 14 currently-silent automation workflows. On a failed latest run
  of `main`, files/appends one `automation-broken` issue per workflow; closes it
  automatically on the next green run.
- Excludes workflows that already self-notify, workflow_call building blocks, and
  the agentic *.lock.yml workflows.

In-pipeline reporter (tests-outerloop.yml, tests-quarantine.yml)

- Embedded in the test pipelines because telling an infrastructure break from a
  test failure needs the run's test results.
- Outerloop: downloads the run's TRX, extracts failed test names, and files a
  `failing-test` issue listing them — or an `automation-broken` infra issue when
  the run broke before producing results.
- Quarantine passes ignoreTestFailures, so failing quarantined tests never red
  the run (run-tests.yml swallows them and only checks TRX were produced). A
  failed quarantine run is therefore always infrastructure, and only an
  `automation-broken` issue is filed.
- Dedup is one open issue per (workflow, kind) with a row appended per failed
  run; a human closes it once fixed (no auto-close-on-green). Markers use a
  distinct `ci-failure:` prefix so the scanner and reporter never manage each
  other's issues. Only scheduled runs file issues; PR-triggered runs do not.

To support the reporter, GenerateTestSummary gains a `--failed-tests-json` mode
that lists distinct failed test names (Failed/Error/Timeout) from a set of TRX
files.

Docs: docs/ci/monitor-scheduled-workflows.md,
docs/ci/specialized-test-failure-issues.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Captures the design direction for follow-on CI-health and test-triage
automation so the individual pieces can be designed and built separately:

- CI-health report (repo-pulse-shaped, read-only) — do first.
- Auto-quarantine / un-quarantine driven by a deterministic flakiness
  analyzer over per-test history kept in gh-aw repo-memory (durable),
  with the human-facing quarantine PR authored via capped safe-outputs.
- CI-health that launches work (run-time trends, transient-failure
  rerun-list additions) plus a self-tuning feedback companion, modeled
  on dotnet/runtime's ci-failure-scan pair.

The doc consumes the existing failure-issue plumbing (automation-broken /
ci-failure / failing-test issues) rather than rebuilding it, and records
two findings from verifying current state:

- Agentic (gh-aw) billing is already resolved for this repo — repo-pulse,
  a purely agentic workflow, succeeds daily. The blocker is capacity, not
  enablement: milestone-changelog.lock.yml is currently failing on the
  per-run effective-tokens rate limit.
- Agentic .lock.yml workflows are a monitoring blind spot — the
  scheduled-workflow scanner excludes them, so milestone-changelog's
  ~24h regression surfaced nothing. The CI-health report should treat
  agentic-workflow run health as a first-class monitored surface.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses five issues found in review of the failure-issue automation:

1. (High) GenerateTestSummary --failed-tests-json read TRX via
   TrxReader.GetTestResultsFromTrx, which TimeSpan.Parse-es the
   UnitTestResult startTime/endTime attributes. Real MTP --report-trx
   files emit those as ISO-8601 DateTimeOffset, on which TimeSpan.Parse
   throws FormatException — the per-file catch then skipped every real
   .trx and reported zero failures, so outerloop filed an infra issue
   with no test list instead of a failing-test issue. Switch to
   GetDetailedTestResultsFromTrx (no timestamp parsing; resolves the
   fully-qualified CanonicalName). The existing tests passed only because
   TestTrxBuilder did not emit startTime/endTime; add that capability and
   a regression test that fails on the old reader.

2. (Medium) A genuinely-red outerloop run whose results could not be
   extracted (artifact-download or tool flake) was indistinguishable from
   "zero failed tests" and was misfiled as infra, losing the failing-test
   list and opening a second mismatched issue. The extract step now
   records extractionFailed, and classifyFailure treats an unknown-result
   red run as test-failures (with a "could not enumerate — see artifacts"
   comment) rather than infra. Drops the `|| echo count:0` mask.

3. (Medium) The scheduled-workflow scanner runs every 2h but most watched
   workflows run less often, so the same still-latest failed run was
   re-appended and re-commented on every tick. decideAction now no-ops
   when the latest failed run is already recorded in the issue body.

4. (Medium) Both reporter jobs set a permissions block without
   contents: read, then run actions/checkout. Add contents: read.

5. (Low) Failed test names were inserted into single-backtick Markdown
   unescaped; a name containing a backtick/newline could break out of the
   code span and inject Markdown. Normalize newlines and wrap in a fence
   longer than any backtick run in the name.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The scheduled-workflow scanner and the specialized-test reporter each carried
their own copy of the same issue mechanics — marker-based dedup lookup, the
fenced failures/runs table (parse, append, cap, collapse), and the octokit
create/append/comment/close calls. That duplication is where the recent review
found bugs, and a third consumer (a CI-health report) is on the roadmap.

Extract the reusable, repo-agnostic mechanics into a single engine,
tracking-issue.js:

- pure: findOpenIssueForMarker (oldest-wins), tableHeader, parseTable,
  appendRow (cap + collapse), nextIndex, bodyIncludesRun
- octokit primitives: ensureLabel, listOpenIssuesByLabel, createIssue,
  updateIssueBody, addComment, closeIssue

The engine holds no repo/label/workflow/product specifics. Each consumer keeps
only its own content and decisions (marker namespace, titles, body, row layout,
classify) and composes the engine via thin wrappers, so their tested public
contracts are unchanged. The collapse summary noun is unified to
"N earlier rows omitted" across both.

The scanner's inline github-script orchestration moves into
monitor-scheduled-workflows-runner.js (mirroring the reporter's orchestrator),
and its watch list moves out of the script into
monitor-scheduled-workflows.config.json — an array of { file, name, enabled }
entries. Watching can be turned off per workflow with "enabled": false without
deleting the entry.

Adds TrackingIssueTests (+ harness) for the engine and a selectEnabled test for
the config filter; adjusts the two consumers' collapse-wording assertions. Net
~210 fewer lines in the consumers. All four workflow-script test classes pass
(53 tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The roadmap covered only the agentic (gh-aw) direction. Capture the
complementary deterministic work next to it: finish moving the repo's
remaining bespoke failure-issue reporters onto the shared
tracking-issue.js engine.

- A: promote the test-run failure policy off the outerloop/quarantine
  shape and add a per-consumer content hook (folds into this branch).
- B/C: migrate deployment-tests.yml and tests-daily-smoke.yml from their
  hand-rolled, date-titled create-issue jobs to engine consumers, using
  GenerateTestSummary --failed-tests-json instead of inline TRX parsing.
- D: file a "main is red" issue from the auto-rerun tail (reuse its
  classifier), not a higher-frequency cron scanner.

Documents the embedded-vs-observer mechanism split and the boundary with
the agentic scanner (no classification intelligence in the deterministic
layer).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- A `timed_out` run DOES file an issue (it is in FAILURE_CONCLUSIONS in
  monitor-scheduled-workflows.js), but the doc listed timeouts among the
  ignored conclusions alongside cancelled/skipped. Corrected, and noted
  that timeouts are treated as failures.
- The initial issue is created without a comment (only failures after the
  first post comments), so "full history preserved in the issue's
  comments" overstated it. Reworded the row-collapse note accordingly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
H1 (monitor watchdog acted on non-scheduled runs): listWorkflowRuns
filtered only branch/status, so the "latest completed run" could be a
workflow_dispatch or push run. A manual/push success would auto-close a
real scheduled-failure issue (masking the silent failure the watchdog
exists to catch); a manual/push failure would file a false issue. Add
event: 'schedule' to the request.

M1 (specialized reporter double-posted on re-run): unlike the monitor,
the reporter appended a row + posted a comment without checking whether
the run was already recorded. context.runId is stable across "Re-run all
jobs", so a re-run duplicated the row and re-fired the notification. Add
a tested isRunRecorded guard (mirrors the monitor) and no-op when the run
is already in the body.

M2 (red run misfiled as infra when all TRX unreadable): GenerateTestSummary
caught per-file parse errors and still exited 0 with count:0, never
signaling extraction failure, so classifyFailure filed an infra issue and
dropped the failing-test list. Emit extractionFailed:true when reads
errored and zero failures were collected; the reporter then classifies it
as a test failure.

Adds falsifiable tests for M1 and M2; updates the reporter doc shape.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 18058

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 18058"

radical and others added 4 commits June 9, 2026 13:49
The roadmap captured only proposed/future work (the agentic gh-aw direction
and a deterministic-consolidation plan), which does not belong in the PR that
implements the failure-issue automation. The implemented mechanisms remain
documented in docs/ci/monitor-scheduled-workflows.md and
docs/ci/specialized-test-failure-issues.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ests

Self-review hardening of the new scheduled/outerloop/quarantine
failure-issue automation. Two issues could silently drop the
failing-test signal the automation exists to surface:

- The specialized-test reporter wrote the run URL (the dedup key) into
  the issue body before posting the comment that carries the failing-test
  list. A transient addComment failure left the run recorded, so the
  re-run short-circuited via isRunRecorded and never posted the list.
  Now the issue is created with an empty table, the comment is posted
  first, and the run is recorded last - a failed comment leaves the run
  unrecorded so the next run re-posts.

- GenerateTestSummary --failed-tests-json omitted "Aborted" from the
  failed-outcome set (the sibling CreateFailingTestIssue includes it), so
  a red run whose only failures aborted reported zero failures and was
  misfiled as infra. Added "Aborted".

Lower-severity hardening:

- Treat a missing FAILED_TESTS_PATH on a red outerloop run as an
  extraction failure (test-failures, not infra).
- Filter pull requests out of listForRepo so a labelled PR carrying a
  tracking marker is never mistaken for the managed issue.
- Skip the ensureLabel mutation under dry-run.
- Don't post a notification comment when the managed table fence is
  missing (the run can't be recorded, so commenting would re-notify
  every tick).
- Soften the findOpenIssueForMarker comment to match actual behavior;
  add the trailing newline the editorconfig rule requires.

Adds fake-octokit harnesses and integration tests for both runners
(previously untested), covering the comment-before-record ordering, the
dry-run no-mutation contract, the corrupted-table guard, PR filtering,
and the missing-results-path path. Updates the specialized-test docs for
the Aborted outcome and the missing-path extractionFailed synthesis.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@radical radical mentioned this pull request Jun 9, 2026
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant