Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 165 additions & 0 deletions docs/ci/internal-build-failure-notifications.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# Internal build failure notifications

The internal Azure DevOps pipeline (`microsoft-aspire`, definition 1602,
defined in [`eng/pipelines/azure-pipelines.yml`](../../eng/pipelines/azure-pipelines.yml))
files a GitHub issue on [microsoft/aspire](https://github.com/microsoft/aspire/issues)
when it breaks on a publishing branch, and closes that issue when the next
build of the same branch goes green.

This document describes the contract so future maintainers can reason about
the behavior without re-reading the pipeline YAML.

## What gets notified

Two stages run at the end of every non-PR internal build:

- `notify_failure` — files or updates a GitHub issue when at least one
of `build_sign_native` / `build` / `prepare_installers` ends with
`Failed`.
- `notify_success` — closes any open `ci-broken` issue for the branch
when all three upstream stages end with `Succeeded` or
`SucceededWithIssues` (`prepare_installers` may also legitimately
end with `Skipped` on stable GA release builds — that is accepted
as success).

Both stages gate on the branch being either:

- `refs/heads/main` (exact — the trigger uses the wildcard `main*` so an
exact match here is load-bearing to avoid sweeping in branches like
`main-something`), or
- `refs/heads/release/*`.

`internal/release/*` is deliberately excluded so internal branch names
don't leak into the public issue tracker. Pull-request builds are also
excluded.

The two stages must be at the stage level (not as two jobs in a single
stage) because cross-stage dependency results can only be referenced
from a stage condition via `dependencies.<stage>.result`; from a job
condition the only available form is
`stageDependencies.<stage>.<job>.result`, which has no stage-aggregate
equivalent.

## What gets filed

When the `notify_failure` stage fires, it creates (or appends a comment to)
a single GitHub issue per affected branch:

- **Title:** `Internal build broken on <branch>`
- **Labels:** `area-engineering-systems`, `ci-broken`, `blocking-clean-ci`
- **Assignees:** `joperezr`, `radical`
- **Body marker:** the first line is a hidden HTML comment
`<!-- aspire-internal-build-broken:<branch> -->` used for dedup.

Only one open issue per branch exists at a time.

The body contains a managed markdown table inside a fenced region
delimited by `<!-- ci-broken-failures:begin -->` /
`<!-- ci-broken-failures:end -->`. Each row records one failure: index,
UTC timestamp, build link, commit SHA (linked to the GitHub commit),
and the comma-separated list of failed upstream stages (`build`,
`build_sign_native`, `prepare_installers`).

On each subsequent failure the script:

1. **Updates the issue body** to append a new row to the table. Only
content between the fenced markers is rewritten — any human-added
prose elsewhere in the body is preserved.
2. **Posts a follow-up comment** containing the new build link, commit
SHA, and `cc @joperezr @radical`. The comment is what fires
notifications — body edits don't.

Visible rows in the table are capped (currently 50). Older rows are
collapsed into a `_N earlier failures omitted_` summary line; the full
per-failure history remains in the issue's comments.

`Canceled` stage results (operator cancellation, 1ES timeouts) intentionally
do not file an issue — the stage condition uses explicit `in(..., 'Failed')`
checks which exclude `Canceled`.

## What gets closed

The `notify_success` stage lists open `ci-broken` issues, filters by the
branch marker, and for each match posts a "build is green again" comment
and closes the issue with `state_reason: completed`.

## Dedup and race handling

Issue lookup uses `GET /repos/microsoft/aspire/issues?labels=ci-broken&state=open`
(strongly consistent) plus a local body-marker filter. The Search API is
intentionally avoided because its 1–2 minute eventual-consistency window
would cause near-simultaneous failed builds to each see "0 hits" and file
duplicate issues.

After the create-issue path, the script re-lists immediately. If our
just-created issue is not the oldest carrying the marker, it closes itself
as a duplicate of the older issue. This self-heals the rare race when two
builds fail within seconds of each other.

## Auth

The script mints an installation access token for the **aspire-repo-bot**
GitHub App via [`Get-AspireBotInstallationToken.ps1`](../../eng/pipelines/scripts/Get-AspireBotInstallationToken.ps1)
(the same helper used by the release pipeline's
`dispatch-release-github-tasks.ps1`). The token is immediately marked as a
secret AzDO pipeline variable so any incidental log echo is redacted.

The App's `aspire-bot-app-id` and `aspire-bot-private-key` secrets come
from the `Aspire-Release-Secrets` variable group, imported at pipeline
scope in `eng/pipelines/azure-pipelines.yml` and gated on non-PR builds
of `refs/heads/main` or `refs/heads/release/*` — the same condition the
notify stages use. Manual runs on feature branches and PR builds skip
the import entirely.

**Prerequisite**: the aspire-repo-bot install on microsoft/aspire must have
`issues:write` permission. If missing, the script will 403 on every call
(but never break the build — see below).

## Disabling for a single run

Queue the pipeline manually and set `Notify on failure: dry-run` to true.
In dry-run mode, both stages log the `gh` CLI commands they *would* run
without mutating anything on GitHub. This applies to both the failure
and success paths — a green-build dry-run will not accidentally close
real open issues.

Dry-run mode is fully decoupled from the aspire-repo-bot credentials:
the wrapper omits the `ASPIRE_BOT_APP_ID` / `ASPIRE_BOT_PRIVATE_KEY` env
block and the script's `-AppId` / `-PrivateKeyPem` parameters are
non-mandatory, so a dry-run validation works without Aspire-Release-Secrets
variable group access and never mints a token.

## Why this never breaks the build

[`Notify-GitHubOnBuildResult.ps1`](../../eng/pipelines/scripts/Notify-GitHubOnBuildResult.ps1)
wraps the entire body in `try`/`catch` and always exits 0. Any GitHub API
error, network blip, or 401/403 from a missing App permission produces a
`Write-Warning` in the job log but leaves the build result unchanged. A
flaky notification path must never turn an otherwise-correct build red.

However, a silently-skipped notification is its own failure mode — operators
need to see when the notification path itself broke (e.g., revoked App
permission, GitHub API shape change, deleted label). The catch block emits
AzDO logging commands so failures are visible without breaking the build:

- `##vso[task.logissue type=warning]` surfaces the warning in the build
summary, in 1ES dashboards, and on the badge.
- `##vso[task.complete result=SucceededWithIssues;]` bumps the job result
to `SucceededWithIssues`, which renders as a yellow badge instead of
green. Notifications and dashboards can filter on this.

A build that finishes "green-but-yellow" means the upstream build itself
succeeded, but the notify stage's call to GitHub failed for some reason —
worth investigating, but does not block anything that depends on the build.

## Manually filing or closing

If you need to file or close a `ci-broken` issue by hand (e.g. during
recovery), use the existing label and add the marker `<!-- aspire-internal-build-broken:<branch> -->`
as the first line of the body. The script's next run will treat it as the
canonical open issue and append/close accordingly.

For the failures table to grow, also include the fenced region
(`<!-- ci-broken-failures:begin -->` / `<!-- ci-broken-failures:end -->`)
somewhere in the body. If the markers are absent the script logs a
warning, leaves the body alone, and still posts the follow-up comment.
6 changes: 6 additions & 0 deletions eng/pipelines/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,9 @@ This pipeline:
## Template Structure

The public pipelines (`azure-pipelines-public.yml` and `azdo-tests.yml`) use a shared template (`templates/public-pipeline-template.yml`) to avoid code duplication while maintaining the same functionality.

## Build-result notifications

`azure-pipelines.yml` files a GitHub issue on microsoft/aspire when the
internal build breaks on `main` or `release/*`, and closes it when the
next build is green. See [docs/ci/internal-build-failure-notifications.md](../../docs/ci/internal-build-failure-notifications.md).
Loading
Loading