Skip to content

ci(trigger): tolerate per-call HTTP 401 in cleanup step (fixes #5811)#5812

Merged
renecannao merged 1 commit into
GH-Actionsfrom
ci-trigger-tolerate-cleanup-401
May 23, 2026
Merged

ci(trigger): tolerate per-call HTTP 401 in cleanup step (fixes #5811)#5812
renecannao merged 1 commit into
GH-Actionsfrom
ci-trigger-tolerate-cleanup-401

Conversation

@renecannao

@renecannao renecannao commented May 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #5811.

The "Trigger workflow_run[completed]" step in .github/workflows/ci-trigger.yml opportunistically deletes skipped runs as housekeeping. The original implementation:

gh run list -s skipped -L 100 --json databaseId -q '.[].databaseId' \
    | xargs -r -n1 gh run delete

xargs -r -n1 propagates the first non-zero exit. Some gh run delete calls return HTTP 401: Bad credentials (typically on runs from forked PRs, other branches, or runs the PR-scoped GITHUB_TOKEN doesn't have rights to). One 401 → xargs exits 123 → step fails → CI-trigger job concludes failure → every downstream test workflow gated on workflow_run.conclusion == 'success' (legacy-g*, mysql84-*, mysql84-gr-*, set_parser_algorithm_3-g1, …) silently skips.

This has been today's recurring "why is CI failing on my PR" — hit at least PR #5803, PR #5809 (multiple times), and is documented with the smoking-gun log lines in #5811.

The fix

Replace xargs with a while read loop so per-call failures are caught individually, logged, and don't abort the step:

-        gh run list -s skipped -L 100 --json databaseId -q '.[].databaseId' \
-            | xargs -r -n1 gh run delete
+        gh run list -s skipped -L 100 --json databaseId -q '.[].databaseId' \
+            | while read -r id; do
+                if ! gh run delete "$id" 2>&1; then
+                    echo "  skip run $id (no rights / already gone)"
+                fi
+            done

What changes

  • Cleanup scope: unchanged (still repo-wide, still the most recent 100 skipped runs).
  • Error handling: per-call failures (auth, already-deleted, etc.) are now caught with a single log line each, instead of aborting the entire step.
  • Job conclusion: CI-trigger now reflects whether the trigger worked, not whether housekeeping was completely clean.

Test plan

  • Diff is minimal — only the error-handling shape of the cleanup pipeline.
  • Verified by next PR push that exercises CI-trigger — the cleanup step should run cleanly (or log "skip" lines for runs it can't delete) and the job should conclude success instead of failure.

Alternatives considered (from #5811)

  1. Scope cleanup to current branch — cleaner long-term (eliminates the 401s instead of tolerating them), but changes behavior. Worth doing as a follow-up.
  2. continue-on-error: true on the step — cheapest, but loses signal entirely.
  3. Delete the cleanup step — relies on GitHub's retention. Worth considering if the housekeeping isn't actually needed.

Going with the minimal-behavior-change option (per-call tolerance) here. (1) can land as a separate PR if desired.

Summary by CodeRabbit

  • Chores
    • Enhanced CI workflow robustness by improving the housekeeping process for skipped workflow runs with more graceful error handling.

Review Change Stack

`CI-trigger`'s "Trigger workflow_run[completed]" step does
opportunistic housekeeping by listing every skipped run in the repo
and trying to delete each one. Some of those runs (typically from
forked PRs, other branches, or older retention windows) are not
deletable by the PR-scoped `GITHUB_TOKEN`, which surfaces as
`HTTP 401: Bad credentials` on the individual `gh run delete`
call.

The original pipeline used `xargs -r -n1 gh run delete`, which
propagates the first non-zero exit and fails the whole step. That
in turn marks the `CI-trigger` job conclusion as `failure`, and
every downstream test workflow that gates on
`workflow_run.conclusion == 'success'` (legacy-g*, mysql84-*,
mysql84-gr-*, set_parser_algorithm_3-g1, ...) is silently skipped.

Net result: a non-critical housekeeping failure blocks the entire
test matrix for the PR. It's been the recurring cause of "why is
CI failing on my PR" today, biting PR #5803, PR #5809, and others.

Switch from `xargs` to a `while read` loop so per-call failures
are caught individually, logged for visibility, and don't abort the
step. Cleanup scope is unchanged (still repo-wide, still capped at
the most recent 100 skipped runs) -- only the error-handling
behavior changes.
@gemini-code-assist

Copy link
Copy Markdown

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

@coderabbitai

coderabbitai Bot commented May 23, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

The CI-trigger workflow's housekeeping cleanup step has been refactored to tolerate permission errors when deleting skipped workflow runs. The single xargs pipeline is replaced with an explicit loop that handles deletion failures gracefully per run instead of failing the entire step.

Changes

CI Housekeeping Robustness

Layer / File(s) Summary
Skipped run deletion error handling
.github/workflows/ci-trigger.yml
Skipped-run deletion logic is reworked from a compact xargs pipeline into a guarded per-run deletion loop with explanatory comments. Individual deletion failures (e.g., HTTP 401 due to insufficient token permissions) are caught and logged as "skip run" messages instead of causing the entire cleanup step to fail.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

A cleanup that stumbled on each stubborn run,
Now loops through with grace when delete won't play nice,
No more 401 blocking the fun,
The CI pipeline finally rolls the dice. 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately describes the main change: replacing xargs with a loop that tolerates HTTP 401 failures in the cleanup step, directly addressing the linked issue.
Linked Issues check ✅ Passed The PR implements the second preferred fix from issue #5811: tolerating per-call auth failures by replacing xargs with a while-read loop that catches failures and continues deletion.
Out of Scope Changes check ✅ Passed The PR is narrowly scoped to the cleanup step in ci-trigger.yml; no unrelated changes or ancillary modifications are present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ci-trigger-tolerate-cleanup-401

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
.github/workflows/ci-trigger.yml (2)

50-55: ⚡ Quick win

Consider protecting against list command failures for complete robustness.

The current implementation correctly tolerates individual gh run delete failures via the if ! ... then construct. However, with GitHub Actions' default set -eo pipefail, if the gh run list command itself fails (due to transient API errors, network issues, etc.), the pipeline will still fail and mark the CI-trigger job as failed.

Since the PR describes this as "opportunistic housekeeping" that must not block downstream workflows, consider adding full protection:

🛡️ Option 1: Add `|| true` to tolerate any pipeline failure
 gh -R ${{ github.repository }} run list -s skipped -L 100 --json databaseId -q '.[].databaseId' \
     | while read -r id; do
         if ! gh -R ${{ github.repository }} run delete "$id" 2>&1; then
             echo "  skip run $id (no rights / already gone)"
         fi
-    done
+    done || true
🛡️ Option 2: Disable exit-on-error for the entire step
 echo "Cleanup skipped ..."
+set +e
 # Housekeeping only: opportunistically delete skipped runs. Per-call

Both approaches ensure that even if listing fails, the CI-trigger job succeeds and downstream test workflows can proceed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci-trigger.yml around lines 50 - 55, The gh run list
pipeline can fail and cause the step to exit under set -eo pipefail; modify the
step so a failing list does not break the job by making the pipeline tolerate
failure (e.g., ensure the gh -R ... run list ... | while ... command returns
success even if the list fails by appending a safe-fallback like "|| true" to
the pipeline or temporarily disabling exit-on-error around the sequence with
"set +e" before and "set -e" after), keeping the existing per-id deletion check
(the if ! gh -R ... run delete "$id" ... then) intact; target the gh run list
invocation and the surrounding shell step to implement this resilience.

52-52: 💤 Low value

Optional: Consider suppressing GitHub CLI error output for cleaner logs.

The 2>&1 redirection merges stderr to stdout, so both the GitHub CLI error message and your custom skip message will be printed for each failed deletion:

Error: HTTP 401: Bad credentials (https://api.github.com/...)
  skip run 12345 (no rights / already gone)

If you prefer less verbose output, you could suppress the CLI's error and show only your message:

if ! gh -R ${{ github.repository }} run delete "$id" 2>/dev/null; then

However, keeping the actual error visible can be helpful for debugging, so the current approach is reasonable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci-trigger.yml at line 52, The gh CLI delete command
currently redirects stderr into stdout (`2>&1`) which causes both the CLI error
and your custom skip message to appear; to suppress the CLI error and show only
your skip message, change the redirection on the `gh -R ${{ github.repository }}
run delete "$id"` invocation from `2>&1` to `2>/dev/null` so stderr is
discarded, leaving your custom skip message as the sole output when deletion
fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @.github/workflows/ci-trigger.yml:
- Around line 50-55: The gh run list pipeline can fail and cause the step to
exit under set -eo pipefail; modify the step so a failing list does not break
the job by making the pipeline tolerate failure (e.g., ensure the gh -R ... run
list ... | while ... command returns success even if the list fails by appending
a safe-fallback like "|| true" to the pipeline or temporarily disabling
exit-on-error around the sequence with "set +e" before and "set -e" after),
keeping the existing per-id deletion check (the if ! gh -R ... run delete "$id"
... then) intact; target the gh run list invocation and the surrounding shell
step to implement this resilience.
- Line 52: The gh CLI delete command currently redirects stderr into stdout
(`2>&1`) which causes both the CLI error and your custom skip message to appear;
to suppress the CLI error and show only your skip message, change the
redirection on the `gh -R ${{ github.repository }} run delete "$id"` invocation
from `2>&1` to `2>/dev/null` so stderr is discarded, leaving your custom skip
message as the sole output when deletion fails.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e90193c2-48be-4f2f-8044-f44b29bd8c59

📥 Commits

Reviewing files that changed from the base of the PR and between 4b53979 and 4f041de.

📒 Files selected for processing (1)
  • .github/workflows/ci-trigger.yml
📜 Review details
🔇 Additional comments (1)
.github/workflows/ci-trigger.yml (1)

43-49: LGTM!

@renecannao renecannao merged commit 350c2b4 into GH-Actions May 23, 2026
1 check passed
@sonarqubecloud

Copy link
Copy Markdown

renecannao added a commit that referenced this pull request May 23, 2026
PR #5812 (merged into GH-Actions) fixes the recurring HTTP 401 from
CI-trigger.yml's cleanup step. This commit exists only to force a
fresh CI run so PR #5809 picks up the new ci-trigger.yml and stops
flipping the whole test matrix to red because one housekeeping API
call lost the auth lottery.

No source changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant