feat(prow-job-executor): abort gating E2E job on rollout cancellation (AROSLSRE-1339)#259
Open
raelga wants to merge 3 commits into
Open
feat(prow-job-executor): abort gating E2E job on rollout cancellation (AROSLSRE-1339)#259raelga wants to merge 3 commits into
raelga wants to merge 3 commits into
Conversation
… (AROSLSRE-1339) When the EV2 gating step is cancelled, the executor receives SIGTERM but previously left the postsubmit Prow E2E job running to completion. Propagate the cancellation by aborting the job via Gangway BulkJobStatusChange. Because one EV2 pipeline fans out to several regional E2E jobs that share the same job_name and refs, the bulk selector is region-blind and metav1.Time only serializes to one-second precision. A region-aware uniqueness guard lists concurrent same-job executions and skips the abort (fail-safe) if any sibling region shares the target's start-second, so a sibling region is never aborted. Gated behind --abort-on-cancel (default true).
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates prow-job-executor so that when the executor is externally cancelled (e.g., EV2 rollout cancellation leading to SIGTERM), it makes a best-effort attempt to abort the corresponding gating Prow E2E job via Gangway’s bulk status-change API, while adding a safety guard to avoid accidentally aborting a sibling regional execution.
Changes:
- Add cancellation-aware monitoring logic that distinguishes external cancellation from internal monitor timeout and triggers best-effort aborts on external cancellation.
- Implement Gangway bulk abort support (
BulkJobStatusChange) with an isolation guard based on concurrent executions and start-time second collisions. - Add
--abort-on-cancelflag (defaulttrue) and introduce focused unit tests covering abort/skip paths and monitor wiring.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/prow-job-executor/prowjob/monitor.go | Distinguishes parent cancellation vs internal timeout; triggers best-effort abort on cancellation. |
| tools/prow-job-executor/prowjob/client.go | Adds bulk-abort support, derives bulk endpoint URL, lists executions, and implements isolation guard logic. |
| tools/prow-job-executor/prowjob/abort_test.go | Adds unit tests for abort behavior, collision skip logic, and monitor cancellation behavior. |
| tools/prow-job-executor/options.go | Adds --abort-on-cancel flag and wires it into executor monitor creation. |
| tools/prow-job-executor/go.mod | Promotes protobuf dependency to direct requirement for protojson usage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ding Address review feedback on the cancellation-abort path: - handleCancellation now re-attaches the in-hand logger to the abort context via logr.NewContext. The client methods extract the logger with logr.FromContext and fail if it is absent, so the abort must not depend on the parent context carrying one. Added a regression test that aborts from a logger-less context. - ListExecutions builds its request URL through net/url instead of string concatenation, so a --gangway-url that already carries a query string or trailing '?' no longer corrupts the request.
…fs to abort Second-round review feedback: - deriveBulkURL now preserves any base-path prefix on --gangway-url, swapping the /v1/executions route for /v1/bulk-job-status-update instead of replacing the whole path. A prefixed URL (host/gangway/v1/executions) no longer drops the prefix and 404s the abort. - AbortJob refuses to abort when the job has no Spec.Refs. Refs are part of the bulk selector and the API cannot filter by job name, so a nil-refs abort could over-select. Skip instead. - Added tests for the prefixed-URL derivation and the no-refs no-op.
Comment on lines
+107
to
+109
| u.Path = strings.TrimSuffix(u.Path, executionsPath) + bulkStatusChangePath | ||
| u.RawQuery = "" | ||
| return u.String() |
Comment on lines
+294
to
+315
| isolated, err := c.abortWindowIsIsolated(ctx, logger, job, prowExecutionID) | ||
| if err != nil { | ||
| // If we cannot prove isolation we err on the side of caution and leave the | ||
| // job running rather than risk cancelling another region's execution. | ||
| logger.Error(err, "Could not verify the abort would be isolated to this execution; skipping abort") | ||
| return nil | ||
| } | ||
| if !isolated { | ||
| return nil | ||
| } | ||
|
|
||
| if job.Spec.Refs == nil { | ||
| // refs (org/repo) are part of the bulk selector and the API cannot filter | ||
| // by job name; aborting with nil refs could match a much broader set of | ||
| // jobs sharing only type/state. Refuse rather than risk over-selecting. | ||
| logger.Info("Job has no refs; skipping abort to avoid selecting unrelated jobs") | ||
| return nil | ||
| } | ||
| refs, err := prowgangway.FromCrdRefs(job.Spec.Refs) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to convert refs for job %s: %w", prowExecutionID, err) | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Jira: AROSLSRE-1339 · Sub-task: AROSLSRE-1349
Problem
When an EV2 gating rollout step is cancelled,
prow-job-executorreceivesSIGTERMbut previously just exited, leaving the postsubmit Prow E2E job running to completion. That wastes CI capacity and lets a job for an abandoned rollout keep reporting status.Goal
Propagate the rollout cancellation to the gating Prow job: when the executor is cancelled, abort the E2E job it submitted, without ever touching a sibling region's job.
What changes
BulkJobStatusChange(POST /v1/bulk-job-status-update) — there is no per-execution abort endpoint.SIGTERM) from an internal monitor timeout inWaitForCompletion; only the former triggers an abort, sent withcontext.WithoutCancel+ a 30s budget so it survives the shutdown grace period.job_nameandrefs, andmetav1.Timeonly serializes to one-second precision. Before aborting, list concurrent same-job/same-state executions and skip the abort (fail-safe) if any sibling shares the target's start-second.--abort-on-cancelflag (defaulttrue) to gate the behaviour.Example
Validation
From
tools/prow-job-executor:New tests cover abort of a running job, skip-on-region-collision, abort when regions differ in time, the no-op terminal paths, monitor wiring, and the bulk-URL/terminal-state helpers.
Follow-ups
github.com/Azure/ARO-Tools/tools/prow-job-executorpseudo-version in ARO-HCPtest/go.mod(consumed without areplace). Feature is on by default, so no ARO-HCP code change is needed.job_name, samerefs) are out of scope and documented as a residual; the guard only protects same-job_nameregional siblings.