ci(trigger): tolerate per-call HTTP 401 in cleanup step (fixes #5811)#5812
Conversation
`CI-trigger`'s "Trigger workflow_run[completed]" step does opportunistic housekeeping by listing every skipped run in the repo and trying to delete each one. Some of those runs (typically from forked PRs, other branches, or older retention windows) are not deletable by the PR-scoped `GITHUB_TOKEN`, which surfaces as `HTTP 401: Bad credentials` on the individual `gh run delete` call. The original pipeline used `xargs -r -n1 gh run delete`, which propagates the first non-zero exit and fails the whole step. That in turn marks the `CI-trigger` job conclusion as `failure`, and every downstream test workflow that gates on `workflow_run.conclusion == 'success'` (legacy-g*, mysql84-*, mysql84-gr-*, set_parser_algorithm_3-g1, ...) is silently skipped. Net result: a non-critical housekeeping failure blocks the entire test matrix for the PR. It's been the recurring cause of "why is CI failing on my PR" today, biting PR #5803, PR #5809, and others. Switch from `xargs` to a `while read` loop so per-call failures are caught individually, logged for visibility, and don't abort the step. Cleanup scope is unchanged (still repo-wide, still capped at the most recent 100 skipped runs) -- only the error-handling behavior changes.
|
Note Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported. |
📝 WalkthroughWalkthroughThe CI-trigger workflow's housekeeping cleanup step has been refactored to tolerate permission errors when deleting skipped workflow runs. The single ChangesCI Housekeeping Robustness
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
.github/workflows/ci-trigger.yml (2)
50-55: ⚡ Quick winConsider protecting against list command failures for complete robustness.
The current implementation correctly tolerates individual
gh run deletefailures via theif ! ... thenconstruct. However, with GitHub Actions' defaultset -eo pipefail, if thegh run listcommand itself fails (due to transient API errors, network issues, etc.), the pipeline will still fail and mark the CI-trigger job as failed.Since the PR describes this as "opportunistic housekeeping" that must not block downstream workflows, consider adding full protection:
🛡️ Option 1: Add `|| true` to tolerate any pipeline failure
gh -R ${{ github.repository }} run list -s skipped -L 100 --json databaseId -q '.[].databaseId' \ | while read -r id; do if ! gh -R ${{ github.repository }} run delete "$id" 2>&1; then echo " skip run $id (no rights / already gone)" fi - done + done || true🛡️ Option 2: Disable exit-on-error for the entire step
echo "Cleanup skipped ..." +set +e # Housekeeping only: opportunistically delete skipped runs. Per-callBoth approaches ensure that even if listing fails, the CI-trigger job succeeds and downstream test workflows can proceed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci-trigger.yml around lines 50 - 55, The gh run list pipeline can fail and cause the step to exit under set -eo pipefail; modify the step so a failing list does not break the job by making the pipeline tolerate failure (e.g., ensure the gh -R ... run list ... | while ... command returns success even if the list fails by appending a safe-fallback like "|| true" to the pipeline or temporarily disabling exit-on-error around the sequence with "set +e" before and "set -e" after), keeping the existing per-id deletion check (the if ! gh -R ... run delete "$id" ... then) intact; target the gh run list invocation and the surrounding shell step to implement this resilience.
52-52: 💤 Low valueOptional: Consider suppressing GitHub CLI error output for cleaner logs.
The
2>&1redirection merges stderr to stdout, so both the GitHub CLI error message and your custom skip message will be printed for each failed deletion:Error: HTTP 401: Bad credentials (https://api.github.com/...) skip run 12345 (no rights / already gone)If you prefer less verbose output, you could suppress the CLI's error and show only your message:
if ! gh -R ${{ github.repository }} run delete "$id" 2>/dev/null; thenHowever, keeping the actual error visible can be helpful for debugging, so the current approach is reasonable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci-trigger.yml at line 52, The gh CLI delete command currently redirects stderr into stdout (`2>&1`) which causes both the CLI error and your custom skip message to appear; to suppress the CLI error and show only your skip message, change the redirection on the `gh -R ${{ github.repository }} run delete "$id"` invocation from `2>&1` to `2>/dev/null` so stderr is discarded, leaving your custom skip message as the sole output when deletion fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In @.github/workflows/ci-trigger.yml:
- Around line 50-55: The gh run list pipeline can fail and cause the step to
exit under set -eo pipefail; modify the step so a failing list does not break
the job by making the pipeline tolerate failure (e.g., ensure the gh -R ... run
list ... | while ... command returns success even if the list fails by appending
a safe-fallback like "|| true" to the pipeline or temporarily disabling
exit-on-error around the sequence with "set +e" before and "set -e" after),
keeping the existing per-id deletion check (the if ! gh -R ... run delete "$id"
... then) intact; target the gh run list invocation and the surrounding shell
step to implement this resilience.
- Line 52: The gh CLI delete command currently redirects stderr into stdout
(`2>&1`) which causes both the CLI error and your custom skip message to appear;
to suppress the CLI error and show only your skip message, change the
redirection on the `gh -R ${{ github.repository }} run delete "$id"` invocation
from `2>&1` to `2>/dev/null` so stderr is discarded, leaving your custom skip
message as the sole output when deletion fails.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e90193c2-48be-4f2f-8044-f44b29bd8c59
📒 Files selected for processing (1)
.github/workflows/ci-trigger.yml
📜 Review details
🔇 Additional comments (1)
.github/workflows/ci-trigger.yml (1)
43-49: LGTM!
|
PR #5812 (merged into GH-Actions) fixes the recurring HTTP 401 from CI-trigger.yml's cleanup step. This commit exists only to force a fresh CI run so PR #5809 picks up the new ci-trigger.yml and stops flipping the whole test matrix to red because one housekeeping API call lost the auth lottery. No source changes.



Summary
Fixes #5811.
The "Trigger workflow_run[completed]" step in
.github/workflows/ci-trigger.ymlopportunistically deletes skipped runs as housekeeping. The original implementation:xargs -r -n1propagates the first non-zero exit. Somegh run deletecalls returnHTTP 401: Bad credentials(typically on runs from forked PRs, other branches, or runs the PR-scopedGITHUB_TOKENdoesn't have rights to). One 401 →xargsexits 123 → step fails →CI-triggerjob concludesfailure→ every downstream test workflow gated onworkflow_run.conclusion == 'success'(legacy-g*, mysql84-*, mysql84-gr-*, set_parser_algorithm_3-g1, …) silently skips.This has been today's recurring "why is CI failing on my PR" — hit at least PR #5803, PR #5809 (multiple times), and is documented with the smoking-gun log lines in #5811.
The fix
Replace
xargswith awhile readloop so per-call failures are caught individually, logged, and don't abort the step:What changes
CI-triggernow reflects whether the trigger worked, not whether housekeeping was completely clean.Test plan
CI-trigger— the cleanup step should run cleanly (or log "skip" lines for runs it can't delete) and the job should concludesuccessinstead offailure.Alternatives considered (from #5811)
continue-on-error: trueon the step — cheapest, but loses signal entirely.Going with the minimal-behavior-change option (per-call tolerance) here. (1) can land as a separate PR if desired.
Summary by CodeRabbit