Skip to content

fix(ci): convert .gitlab-ci.yml to pure DAG pipeline#279

Draft
wdconinc wants to merge 5 commits into
masterfrom
dag-pipeline
Draft

fix(ci): convert .gitlab-ci.yml to pure DAG pipeline#279
wdconinc wants to merge 5 commits into
masterfrom
dag-pipeline

Conversation

@wdconinc
Copy link
Copy Markdown
Contributor

@wdconinc wdconinc commented May 10, 2026

ci: convert to pure DAG pipeline

Problem

The pipeline uses a mix of stage-ordered and DAG jobs, which produces
unexpected behavior when an early job fails.

Observed symptoms (e.g. job #7790629):

  1. waterfall:upload (stage waterfall, no needs:) fails.
  2. Because version (stage config, no needs:) is stage-ordered,
    GitLab blocks the entire config stage — version never runs.
  3. Because base, eic, and benchmarks are DAG jobs (they have
    needs:), GitLab treats their stuck dependencies as satisfied and
    schedules them immediately — container builds run without version
    artifacts, producing wrong or missing tags.

Root cause

In GitLab CI's mixed stage+DAG mode:

  • Jobs without needs: follow strict stage ordering. A failure in
    stage N blocks all subsequent stages for those jobs.
  • Jobs with needs: bypass stage ordering entirely. If a needs:
    dependency is stuck in a blocked (never-scheduled) stage, GitLab
    treats it as satisfied, and the dependent job starts anyway.

Having even a single job without needs: is enough to trigger this
split behaviour.

Fix

Convert the pipeline to pure DAG mode:

  • Keep the stages: list (required for GitLab to recognise custom
    stage names), but it no longer controls execution order — that is
    enforced entirely by needs:. With all jobs using needs:, stage
    ordering has no effect, and accidentally omitting needs: from a
    new job becomes an immediately visible error rather than silent wrong
    behaviour.
  • Add needs: [] to every root-node job (no upstream
    dependencies): version, nvidia-smi, status:pending, .prune,
    clean_internal_tag, .clean_unstable_mr, status:success,
    status:failure.
  • Add needs: [version] to spack-cache-cleanup so it correctly
    inherits the INTERNAL_TAG dotenv artifact from version.
  • Remove dependencies: [] from status:success and
    status:failure — with needs: [], no artifacts are downloaded
    anyway, so this was redundant.

The existing needs: chains that enforce build ordering
(version → base → eic → benchmarks) are unchanged.

Expected behaviour after this MR

Job Starts when
version, nvidia-smi, status:pending Pipeline created
base version succeeds
eic base succeeds
benchmarks eic succeeds
spack-cache-cleanup version succeeds
status:success All non-manual terminal jobs complete successfully
status:failure when: on_failure with needs: [] — fires when any pipeline job fails

A failure in any single job (e.g. a future waterfall:upload from
!278) no longer blocks unrelated jobs in the build chain.

Remove the top-level stages: list so that execution ordering is
enforced exclusively through the needs: dependency graph. Keep
individual stage: labels on jobs for UI grouping only.

Add needs: [] to root-node jobs that have no upstream dependencies:
- version
- nvidia-smi
- status:pending
- .prune (and derived prune:gpu, prune:docker-new)
- clean_internal_tag
- .clean_unstable_mr (and derived clean_unstable_mr:gpu, :docker-new)
- status:success
- status:failure

Add needs: [version] to spack-cache-cleanup so it can access the
INTERNAL_TAG artifact produced by version.

Remove the redundant dependencies: [] from status:success and
status:failure, since needs: [] already prevents artifact download.

Without stages:, GitLab cannot fall back to stage-based ordering
when a needs: entry is accidentally omitted, making DAG violations
immediately visible as pipeline validation errors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 10, 2026 20:24
@wdconinc wdconinc changed the title ci: convert to pure DAG pipeline fix(ci): convert .gitlab-ci.yml to pure DAG pipeline May 10, 2026
GitLab requires stages: to be defined when jobs use custom stage
names. Without it, GitLab falls back to the built-in stages only
(.pre, build, test, deploy, .post), causing pipeline validation to
fail for any job with stage: config, base, eic, etc.

Keep stages: for stage name registration; all job ordering is still
enforced purely through needs:.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the GitLab CI configuration to avoid mixed stage-ordered + DAG execution by moving the pipeline toward a “pure DAG” model using needs, aiming to prevent downstream container builds from running without the version job’s artifacts.

Changes:

  • Removes the top-level stages: list and keeps per-job stage: labels for UI grouping.
  • Adds needs: [] to several root-node jobs (e.g., version, nvidia-smi, and status jobs) and adds needs: [version] to spack-cache-cleanup.
  • Replaces dependencies: [] with needs: [] for status reporting jobs.
Comments suppressed due to low confidence (2)

.gitlab-ci.yml:57

  • Removing the top-level stages: list may make the custom stage: values used throughout this file (e.g., config, status-pending, finalize, etc.) invalid in GitLab CI, depending on the runner/GitLab version (default stages are typically only build/test/deploy). Please confirm the config still lints on your GitLab instance; if not, keep stages: (it can remain purely cosmetic ordering if every job uses needs) or adjust stages to match GitLab defaults.
stages:
  - status-pending
  - config
  - base            ## base OS image
  - eic             ## EIC container images

.gitlab-ci.yml:896

  • With needs: [], this job becomes a DAG root and will be scheduled immediately, so status:success can report "Succeeded!" before the build/test jobs finish (and even before later failures occur). To keep it as an end-of-pipeline status, make it depend on the terminal jobs in the pipeline (e.g., the last build/benchmark jobs, plus any always-run cleanup you want to wait for), or otherwise gate it so it cannot start until the pipeline is effectively complete.
  stage: finalize
  needs: [version]
  when: always
  allow_failure: true
  script:
    - docker buildx build
        --no-cache
        --target spack_cache_cleanup

Comment thread .gitlab-ci.yml
Comment on lines 898 to 905
.

status:success:
stage: status-report
dependencies: []
needs: []
extends: .status
variables:
STATE: "success"
wdconinc and others added 2 commits May 10, 2026 15:30
With needs: [], status:success ran immediately at pipeline start.
Enumerate all non-manual terminal jobs (those nothing else transitively
depends on) as explicit needs: so status:success only runs after the
entire DAG has completed.

Terminal jobs listed in needs::
  - nvidia-smi (config; allow_failure: true)
  - user_spack_environment (benchmarks)
  - cuda:torch (benchmarks; allow_failure: true)
  - eic_xl:singularity:default/nightly (deploy)
  - benchmarks:geoviewer/detector/phyiscs:default (benchmarks)
  - benchmarks:detector/physics:nightly (benchmarks)
  - clean_pipeline:gpu/docker-new (finalize; when: always)
  - clean_unstable_mr:gpu/docker-new (finalize; when: always)
  - spack-cache-cleanup (finalize; when: always)

status:failure keeps needs: [] + when: on_failure which in GitLab
fires when any job in the pipeline fails (special behaviour of
needs: [] + on_failure).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Many terminal jobs are conditional (rules:) and may not exist in every
pipeline variant. GitLab requires optional: true for needs entries that
may be absent. Mark all fifteen terminal-job needs as optional so the
validator does not reject the config when any of them is skipped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 10, 2026 20:39
The clean_pipeline and clean_unstable_mr jobs remove Docker images from
runners. If they run before benchmarks/singularity jobs finish, those
jobs can fail when their images are gone. Add all benchmark and
singularity terminal jobs as optional needs (optional: true because they
are conditional on rules) to both .clean_pipeline and .clean_unstable_mr
base templates.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (3)

.gitlab-ci.yml:892

  • spack-cache-cleanup extends .build, whose rules force when: on_success. The job-level when: always here won’t take effect, so this cleanup job won’t run when the pipeline fails (contrary to the intent implied by when: always). Consider overriding rules: in spack-cache-cleanup (or removing when: always if it’s not meant to run on failures).
        if [ "$status" == "failed" ] ; then docker rmi $repository:$tag ; fi ;
        if [ "$status" == "canceled" ] ; then docker rmi $repository:$tag ; fi ;
      done
  allow_failure: true

clean_pipeline:gpu:

.gitlab-ci.yml:912

  • The job name benchmarks:phyiscs:default looks like a typo (inconsistent with benchmarks:physics:nightly). This makes needs lists and pipeline UX harder to reason about. Consider renaming the job to benchmarks:physics:default and updating references accordingly (including this needs entry).
  when: always
  allow_failure: true
  script:
    - docker buildx build

.gitlab-ci.yml:903

  • PR description says status:success is a root-node job that should get needs: [], but the implementation adds a non-empty needs: list here. Please align the description with the actual behavior (or adjust the job) so future readers don’t misinterpret how the pipeline is intended to work.
clean_pipeline:docker-new:
  extends: .clean_pipeline
  tags:
    - docker-new

Comment thread .gitlab-ci.yml
Comment on lines 900 to 905
clean_pipeline:docker-new:
extends: .clean_pipeline
tags:
- docker-new

spack-cache-cleanup:
@wdconinc wdconinc marked this pull request as draft May 10, 2026 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants