rfc: execution lifecycle consolidation#3
Open
lewisjared wants to merge 5 commits into
Open
Conversation
Consolidate diagnostic execution lifecycle (allocation, dispatch, run, classify, publish, ingest, finalise) into one deep module behind a single Transport port. Surface ResourceHint on Diagnostic so providers can declare memory/CPU/wall-clock once. Capture per-execution Telemetry to enable future adaptive scheduling without a schema change.
Mermaid classDiagram is idiomatic for the Protocol + adapters pattern. LR layout fits PR width; method signatures stay legible without HTML hacks.
Compress prose, drop per-design subsections in Rationale (table tells the story), tighten Drawbacks/Prior art/Unresolved/Future to bullets. All three diagrams kept; technical substance preserved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidate the lifecycle of one diagnostic execution
— allocation, dispatch, run, classify, publish, ingest, finalise —
into a single deep module (
ExecutionLifecycle) sitting behind oneTransportport.Surface a declarative
ResourceHinton everyDiagnosticso providers canexpress memory / CPU / wall-clock once,
and capture per-execution
Telemetryso future adaptive scheduling landsas a feature addition rather than a schema migration.
Motivation in one paragraph
Today the lifecycle of one execution is fragmented across ~8 files in 2
packages.
_is_system_errorlives inclimate-ref-core/executor.py;CondaCommandErrorhandling lives beside it; missing-log handling livesin
result_handling.py; per-task timeout enforcement only exists inLocalExecutor;CeleryExecutorenforces no per-task timeout at all;ExecutionGroup.dirtyis toggled in three places.Providers have nowhere to declare resource expectations — ESMValTool
diagnostics that need 16 GB of memory or 8 hours of wall-clock cannot say
so, which blocks any future SLURM/PBS/K8s transport from doing its job.
Reading order
Direct link to the rendered RFC:
text/0000-execution-lifecycle.md
Key sections:
Scope guard
This RFC is not about replacing SLURM, PBS, K8s, or Celery as schedulers.
It is about defining the seam between climate-ref and a scheduler so the
scheduler has enough information to do its job, and so the lifecycle around
the scheduler is one well-tested module instead of eight thin ones.
Process
Following the instructions in this repo's README:
once a PR number is assigned, the RFC file and the RFC PR link inside the
file will be renamed/updated in a follow-up commit on this branch.