fix(ci): convert .gitlab-ci.yml to pure DAG pipeline#279
Draft
wdconinc wants to merge 5 commits into
Draft
Conversation
Remove the top-level stages: list so that execution ordering is enforced exclusively through the needs: dependency graph. Keep individual stage: labels on jobs for UI grouping only. Add needs: [] to root-node jobs that have no upstream dependencies: - version - nvidia-smi - status:pending - .prune (and derived prune:gpu, prune:docker-new) - clean_internal_tag - .clean_unstable_mr (and derived clean_unstable_mr:gpu, :docker-new) - status:success - status:failure Add needs: [version] to spack-cache-cleanup so it can access the INTERNAL_TAG artifact produced by version. Remove the redundant dependencies: [] from status:success and status:failure, since needs: [] already prevents artifact download. Without stages:, GitLab cannot fall back to stage-based ordering when a needs: entry is accidentally omitted, making DAG violations immediately visible as pipeline validation errors. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GitLab requires stages: to be defined when jobs use custom stage names. Without it, GitLab falls back to the built-in stages only (.pre, build, test, deploy, .post), causing pipeline validation to fail for any job with stage: config, base, eic, etc. Keep stages: for stage name registration; all job ordering is still enforced purely through needs:. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the GitLab CI configuration to avoid mixed stage-ordered + DAG execution by moving the pipeline toward a “pure DAG” model using needs, aiming to prevent downstream container builds from running without the version job’s artifacts.
Changes:
- Removes the top-level
stages:list and keeps per-jobstage:labels for UI grouping. - Adds
needs: []to several root-node jobs (e.g.,version,nvidia-smi, and status jobs) and addsneeds: [version]tospack-cache-cleanup. - Replaces
dependencies: []withneeds: []for status reporting jobs.
Comments suppressed due to low confidence (2)
.gitlab-ci.yml:57
- Removing the top-level
stages:list may make the customstage:values used throughout this file (e.g.,config,status-pending,finalize, etc.) invalid in GitLab CI, depending on the runner/GitLab version (default stages are typically onlybuild/test/deploy). Please confirm the config still lints on your GitLab instance; if not, keepstages:(it can remain purely cosmetic ordering if every job usesneeds) or adjust stages to match GitLab defaults.
stages:
- status-pending
- config
- base ## base OS image
- eic ## EIC container images
.gitlab-ci.yml:896
- With
needs: [], this job becomes a DAG root and will be scheduled immediately, sostatus:successcan report "Succeeded!" before the build/test jobs finish (and even before later failures occur). To keep it as an end-of-pipeline status, make it depend on the terminal jobs in the pipeline (e.g., the last build/benchmark jobs, plus any always-run cleanup you want to wait for), or otherwise gate it so it cannot start until the pipeline is effectively complete.
stage: finalize
needs: [version]
when: always
allow_failure: true
script:
- docker buildx build
--no-cache
--target spack_cache_cleanup
Comment on lines
898
to
905
| . | ||
|
|
||
| status:success: | ||
| stage: status-report | ||
| dependencies: [] | ||
| needs: [] | ||
| extends: .status | ||
| variables: | ||
| STATE: "success" |
With needs: [], status:success ran immediately at pipeline start. Enumerate all non-manual terminal jobs (those nothing else transitively depends on) as explicit needs: so status:success only runs after the entire DAG has completed. Terminal jobs listed in needs:: - nvidia-smi (config; allow_failure: true) - user_spack_environment (benchmarks) - cuda:torch (benchmarks; allow_failure: true) - eic_xl:singularity:default/nightly (deploy) - benchmarks:geoviewer/detector/phyiscs:default (benchmarks) - benchmarks:detector/physics:nightly (benchmarks) - clean_pipeline:gpu/docker-new (finalize; when: always) - clean_unstable_mr:gpu/docker-new (finalize; when: always) - spack-cache-cleanup (finalize; when: always) status:failure keeps needs: [] + when: on_failure which in GitLab fires when any job in the pipeline fails (special behaviour of needs: [] + on_failure). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Many terminal jobs are conditional (rules:) and may not exist in every pipeline variant. GitLab requires optional: true for needs entries that may be absent. Mark all fifteen terminal-job needs as optional so the validator does not reject the config when any of them is skipped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The clean_pipeline and clean_unstable_mr jobs remove Docker images from runners. If they run before benchmarks/singularity jobs finish, those jobs can fail when their images are gone. Add all benchmark and singularity terminal jobs as optional needs (optional: true because they are conditional on rules) to both .clean_pipeline and .clean_unstable_mr base templates. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (3)
.gitlab-ci.yml:892
spack-cache-cleanupextends.build, whoserulesforcewhen: on_success. The job-levelwhen: alwayshere won’t take effect, so this cleanup job won’t run when the pipeline fails (contrary to the intent implied bywhen: always). Consider overridingrules:inspack-cache-cleanup(or removingwhen: alwaysif it’s not meant to run on failures).
if [ "$status" == "failed" ] ; then docker rmi $repository:$tag ; fi ;
if [ "$status" == "canceled" ] ; then docker rmi $repository:$tag ; fi ;
done
allow_failure: true
clean_pipeline:gpu:
.gitlab-ci.yml:912
- The job name
benchmarks:phyiscs:defaultlooks like a typo (inconsistent withbenchmarks:physics:nightly). This makesneedslists and pipeline UX harder to reason about. Consider renaming the job tobenchmarks:physics:defaultand updating references accordingly (including thisneedsentry).
when: always
allow_failure: true
script:
- docker buildx build
.gitlab-ci.yml:903
- PR description says
status:successis a root-node job that should getneeds: [], but the implementation adds a non-emptyneeds:list here. Please align the description with the actual behavior (or adjust the job) so future readers don’t misinterpret how the pipeline is intended to work.
clean_pipeline:docker-new:
extends: .clean_pipeline
tags:
- docker-new
Comment on lines
900
to
905
| clean_pipeline:docker-new: | ||
| extends: .clean_pipeline | ||
| tags: | ||
| - docker-new | ||
|
|
||
| spack-cache-cleanup: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ci: convert to pure DAG pipeline
Problem
The pipeline uses a mix of stage-ordered and DAG jobs, which produces
unexpected behavior when an early job fails.
Observed symptoms (e.g. job #7790629):
waterfall:upload(stagewaterfall, noneeds:) fails.version(stageconfig, noneeds:) is stage-ordered,GitLab blocks the entire
configstage —versionnever runs.base,eic, andbenchmarksare DAG jobs (they haveneeds:), GitLab treats their stuck dependencies as satisfied andschedules them immediately — container builds run without
versionartifacts, producing wrong or missing tags.
Root cause
In GitLab CI's mixed stage+DAG mode:
needs:follow strict stage ordering. A failure instage N blocks all subsequent stages for those jobs.
needs:bypass stage ordering entirely. If aneeds:dependency is stuck in a blocked (never-scheduled) stage, GitLab
treats it as satisfied, and the dependent job starts anyway.
Having even a single job without
needs:is enough to trigger thissplit behaviour.
Fix
Convert the pipeline to pure DAG mode:
stages:list (required for GitLab to recognise customstage names), but it no longer controls execution order — that is
enforced entirely by
needs:. With all jobs usingneeds:, stageordering has no effect, and accidentally omitting
needs:from anew job becomes an immediately visible error rather than silent wrong
behaviour.
needs: []to every root-node job (no upstreamdependencies):
version,nvidia-smi,status:pending,.prune,clean_internal_tag,.clean_unstable_mr,status:success,status:failure.needs: [version]tospack-cache-cleanupso it correctlyinherits the
INTERNAL_TAGdotenv artifact fromversion.dependencies: []fromstatus:successandstatus:failure— withneeds: [], no artifacts are downloadedanyway, so this was redundant.
The existing
needs:chains that enforce build ordering(
version → base → eic → benchmarks) are unchanged.Expected behaviour after this MR
version,nvidia-smi,status:pendingbaseversionsucceedseicbasesucceedsbenchmarkseicsucceedsspack-cache-cleanupversionsucceedsstatus:successstatus:failurewhen: on_failurewithneeds: []— fires when any pipeline job failsA failure in any single job (e.g. a future
waterfall:uploadfrom!278) no longer blocks unrelated jobs in the build chain.