feat(standalone_driver): single-node NetCDF/UGRID output via the common io module#1307
Conversation
…prognostics + diagnostics) via common io module
…together in the same output
Common io module: - remove the 1 ms output-time tolerance: the model clock advances by exact datetime arithmetic, so no drift is possible (keep >= so a step landing past a scheduled output time is not skipped) - drop redundant mode=0o777 from mkdir calls (it is the default, umask-masked) - add GridIdentifier type alias (uuid.UUID | str) for uuidOfHGrid - IOMonitor accepts pathlib.Path for the grid file - to_data_array copies the attrs dict internally instead of mutating the caller's (fixes latent pollution of the shared CF tables); add to_host_data_array, moved from the driver, which forces buffers to host - add is_on_interface to OptionalMetaData Standalone driver: - replace the ad-hoc prognostic spec tuple with an explicit _PrognosticFieldSpec (state attribute name + FieldMetaData) - output is on by default, so import the io stack unconditionally: lazy imports, TYPE_CHECKING guards and pytest.importorskip removed (broken installs now fail loudly); uv.lock now actually pins icon4py-common[io] - rename the output subfolder from io/ to output/ - create_io_monitor rejects sub-second dtime instead of silently truncating the interval to 0 seconds; drop the unused process_properties parameter - tighten annotations (gtx.Field, gtx.typing.Backend | None, wpfloat) - main() takes all arguments explicitly; defaults live only in cli() - tests: dims-based expected layouts, backend-parameterized host-transfer test, remove "Phase 0" comments
- driver_io: drop the _PrognosticFieldSpec dataclass and getattr; assemble prognostics via direct, type-checked attribute access keyed by output name - io: type IOMonitor.grid_file_name strictly as pathlib.Path (callers convert) - driver_io: re-add the process_props parameter to create_io_monitor (named per convention, reserved for the distributed IO path) - standalone_driver main: replace the click() console-script shim with a module-level typer.Typer() app; point the pyproject entry point at main:app - tests: express expected prognostic output layout with FieldMetaData
Revert the GridIdentifier (uuid.UUID | str) union and type the grid id strictly as uuid.UUID throughout the IO layer: - writers: uuidOfHGrid back to uuid.UUID; drop the GridIdentifier alias - io: IOMonitor / FieldGroupMonitor take grid_id: uuid.UUID - driver: convert at the boundary with uuid.UUID(grid.id) (Grid.id holds the file's uuidOfHGrid as a string; the synthetic simple grid is not a UUID) - tests: stamp a UUID onto the simple grid for the monitor-construction tests Verified: ruff/mypy/tach, driver + common io unit tests, and the JW output integration datatest (gtfn_cpu) all pass.
common/io: - schedule output by step count: FieldGroupIOConfig.output_interval_steps (write every N calls to store); remove the time-based path (to_delta, OutputInterval, start_time, next_output_time). Output now lands on exact step boundaries regardless of the (possibly non-integer) time step length - rename FieldMetaData.is_on_interface -> is_on_half_levels - merge to_host_data_array into to_data_array(..., to_host=...) - update the module docstring examples to the step-based config standalone_driver: - drive prognostic output from states.data.PROGNOSTIC_CF_ATTRIBUTES directly (CF names, icon_var_name lookup, is_on_half_levels); drop the parallel metadata table - rename DiagnosticInputs -> DiagnosticFields; derive them from the field factories instead of passing them into the driver constructor - add opt-in initial-conditions output (output_initial_state) - nc_title/nc_comment tidy-up; de-"boxed" docstrings - revert the main:app entry-point change back to the click() shim (the entry point fix comes with #1304) Verified: pre-commit, common io + driver unit tests, and the JW output integration datatest (gtfn_cpu) all pass.
- to_data_array: document that the result usually references the field buffer (a view), and that to_host only copies on GPU - _store_output: note the assembled DataArrays alias the live state and must be written before the next step mutates it
There was a problem hiding this comment.
Pull request overview
This PR wires icon4py.model.common.io into the standalone driver to produce single-node NetCDF/UGRID output for the full model state, and updates the common IO layer to be more robust and step-scheduled.
Changes:
- Add a standalone-driver IO bridge (
driver_io.py) that assembles prognostic state + computes diagnostics at output time, and creates anIOMonitor. - Enable (default-on) output in the standalone driver loop with CLI toggles, and disable output automatically for MPI runs.
- Update
common.ioto use step-based output scheduling and to support GPU-safe host-buffer extraction for NetCDF writing; add/adjust unit and integration tests.
Reviewed changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Update lockfile to include icon4py-common[io] for the standalone driver. |
| model/standalone_driver/pyproject.toml | Make icon4py-common[io] a required dependency (output default-on). |
| model/standalone_driver/src/icon4py/model/standalone_driver/main.py | Split CLI from programmatic main(), add --enable-output/--no-enable-output, add click() entry point. |
| model/standalone_driver/src/icon4py/model/standalone_driver/standalone_driver.py | Initialize IO monitor, store during time loop, close at end; disable output in MPI runs. |
| model/standalone_driver/src/icon4py/model/standalone_driver/driver_io.py | New driver-side bridge: DataArray assembly, diagnostic computation, and monitor factory. |
| model/standalone_driver/tests/standalone_driver/unit_tests/test_driver_io.py | Unit tests for DataArray assembly and IO monitor construction (no disk IO). |
| model/standalone_driver/tests/standalone_driver/integration_tests/test_driver_io_output.py | Datatest that runs one JW step and verifies NetCDF/UGRID output contents/shapes. |
| model/standalone_driver/tests/standalone_driver//test_.py | Update existing tests to pass new required args and disable output where not under test. |
| model/common/src/icon4py/model/common/io/io.py | Switch IO scheduling from time-based strings to step-based intervals; output-dir handling change. |
| model/common/src/icon4py/model/common/io/utils.py | Add to_host option and rename is_on_interface → is_on_half_levels; copy attrs defensively. |
| model/common/src/icon4py/model/common/io/ugrid.py | Rename vertical dim mapping flag to is_on_half_levels. |
| model/common/src/icon4py/model/common/io/init.py | Update module docs to reflect step-based IO config and fix links/examples. |
| model/common/src/icon4py/model/common/states/data.py | Mark upward_air_velocity as being on half-levels in metadata. |
| model/common/src/icon4py/model/common/states/model.py | Extend metadata TypedDict with optional is_on_half_levels. |
| model/common/tests/common/io/utils.py | Update test helper for renamed to_data_array arguments. |
| model/common/tests/common/io/unit_tests/test_io.py | Update tests for step-based scheduling and new config validation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _zero_full() -> gtx.Field: | ||
| return data_alloc.zero_field( | ||
| grid, dims.CellDim, dims.KDim, dtype=ta.wpfloat, allocator=backend | ||
| ) | ||
|
|
||
| def _zero_interface() -> gtx.Field: | ||
| return data_alloc.zero_field( | ||
| grid, | ||
| dims.CellDim, | ||
| dims.KDim, | ||
| dtype=ta.wpfloat, | ||
| extend={dims.KDim: 1}, | ||
| allocator=backend, | ||
| ) | ||
|
|
||
| # dry air: all hydrometeors are zero | ||
| qv, qc, qi, qr, qs, qg = (_zero_full() for _ in range(6)) | ||
| temperature = _zero_full() | ||
| virtual_temperature = _zero_full() | ||
| u = _zero_full() | ||
| v = _zero_full() | ||
| pressure = _zero_full() | ||
| # Typed as Any: gt4py's NDArrayObject protocol does not expose __setitem__, so the | ||
| # in-place buffer fill below would not type-check against the precise Field type. | ||
| pressure_ifc: Any = _zero_interface() | ||
| surface_pressure_k = _zero_interface() | ||
|
|
There was a problem hiding this comment.
Done — compute_diagnostics became a DiagnosticsComputer class that allocates the ~14 scratch/output buffers once and reuses them on every compute() call. The driver holds one instance (via a cached property), so the buffers are allocated only when output is actually enabled, then reused each step.
There was a problem hiding this comment.
This is already great, but possibly should be synced with however we decide to group states in the near/mid future (and IO should follow what the driver does and reuse from there).
can you add a todo as a reminder also mentioning physics (which heavily uses diagnostic vars)?
There was a problem hiding this comment.
or open an issue documenting this at https://github.com/orgs/C2SM/projects/27
# Conflicts: # model/standalone_driver/src/icon4py/model/standalone_driver/main.py # model/standalone_driver/src/icon4py/model/standalone_driver/standalone_driver.py # model/standalone_driver/tests/standalone_driver/integration_tests/test_initial_condition.py # model/standalone_driver/tests/standalone_driver/integration_tests/test_standalone_driver.py # model/standalone_driver/tests/standalone_driver/mpi_tests/test_parallel_initial_conditions.py # model/standalone_driver/tests/standalone_driver/mpi_tests/test_parallel_standalone_driver.py
…o the run dir - main: --enable-output/--no-enable-output toggles DriverConfig.enable_output (default off), so output is selectable from the console script - driver_io: drop the redundant OUTPUT_SUBDIR nesting; the monitor writes directly into the run output directory instead of <output_path>/output/
- common/io: FieldGroupIOConfig accepts exactly one of output_interval_steps (every N steps) or output_interval + start_time (time string, e.g. "10 HOURS"); restore OutputInterval/to_delta and a time-based capture path (>= so a step past the scheduled time still writes). Driver keeps using step-based. - driver: compute_diagnostics -> DiagnosticsComputer, which allocates its ~14 scratch/output buffers once and reuses them per step (avoids per-step allocation, notably on GPU); driver holds it via a cached_property. - tests: restore to_delta tests, add dual-mode validation + a write-free time-based capture test; document both schedule modes in the io docstring.
…s enabled Drop the output_initial_state flag from DriverConfig; the driver now always writes the initial state (before the first step) whenever output is enabled. Integration test asserts the two resulting time slices (initial state + one step).
Output is opt-in (--enable-output, default off), so the "enabled by default" comment was inaccurate. icon4py-common[io] stays a required dependency for now; making it an optional extra is left to a follow-up.
# Conflicts: # model/standalone_driver/src/icon4py/model/standalone_driver/standalone_driver.py
| return refinement_control_fields | ||
|
|
||
| @property | ||
| def file_path(self) -> str: |
There was a problem hiding this comment.
Argh, not new... but why do we store this as a str inside GridManager 😞
If you assume at the grid_manager.file_path call site that you can convert it to a Path, I'd force it already here to be a Path, or even better in GridManager.init (I'm not insisting since that touches parts way outside this PR).
There was a problem hiding this comment.
I agree that this is weird. However, let's leave it like this for now and deal with it in a follow up PR. It is a bit out of scope currently ;)
| # Exactly one scheduling mode is set (enforced by the config validation). | ||
| self._step_based = config.output_interval_steps is not None | ||
| if config.output_interval_steps is not None: | ||
| self._output_interval_steps = config.output_interval_steps | ||
| self._step_counter = 0 | ||
| else: | ||
| assert config.output_interval is not None and config.start_time is not None | ||
| self._next_output_time = dt.datetime.fromisoformat(config.start_time) | ||
| self._time_delta = to_delta(config.output_interval) |
There was a problem hiding this comment.
Same comment here about time or step intervals: I think it's nicer if it's normalized to steps only at this point, and then all the logic afterwards can use steps?
There was a problem hiding this comment.
Exactly what it does now: the interval is normalized to steps once (in _interval_in_steps, called from FieldGroupMonitor.__init__), and from there on the monitor only ever counts steps — _at_capture_time is a plain step_counter % N. The model-clock path is gone. (c95a7de)
There was a problem hiding this comment.
To do for later: use the type aliases from
.…write output - io: replace the dual output_interval_steps/output_interval(str)+start_time schedule with a single output_interval (int steps or datetime.timedelta), normalized to steps once at monitor construction via dtime; drop the model-clock path and the to_delta string parser. Add DeltaT/NumTimeSteps/OutputInterval type aliases. - io: FieldGroupMonitor refuses to overwrite an existing data file (the per-run file counter restarts at 0), so a rerun into a populated directory fails loudly. - driver: thread dtime through create_io_monitor -> IOMonitor -> FieldGroupMonitor. - driver_io: add TODO on sharing diagnostic buffers with future state-grouping/physics. - tests: stamp the fake grid uuid in the grid fixture instead of per-test; cover the timedelta->steps normalization, too-short-interval error, and overwrite refusal.
The IOMonitor source grid file is used solely to regenerate the UGRID topology (`_write_ugrid`); the grid identity comes from `grid_id` (the `Grid` object), not the file. Remove the unused `_read_grid_attrs` so the file path has one clear purpose, and leave a TODO to eventually build the UGRID topology from `Grid`/`GridGeometry` and drop the file dependency entirely.
Use the OutputInterval union alias directly in the isinstance check and render it into the error message, so both update automatically if the alias changes.
jcanton
left a comment
There was a problem hiding this comment.
lots of minor comments, it's looking good!
| self.close() | ||
| if not self._at_capture_time(): | ||
| return | ||
| # TODO(halungge): this should do a deep copy of the data |
There was a problem hiding this comment.
Kept it on purpose. It's not needed today because we write synchronously: store() reads the field buffers and appends them to the NetCDF file immediately, before the model takes the next step and mutates them. The deep copy only becomes relevant once IO is asynchronous/buffered (queueing captures or accumulating slices in memory) — likely with the parallel-IO work — so the TODO is a useful marker for that.
| return self._next_output_time == model_time | ||
| def _at_capture_time(self) -> bool: | ||
| # fire every N model steps (one call to ``store`` == one model step) | ||
| self._step_counter += 1 |
There was a problem hiding this comment.
this counter update should probably not live in this method
There was a problem hiding this comment.
Agreed — moved it out. _at_capture_time() is now a pure predicate (step_counter % N == 0); the increment happens in store() instead, where "one call == one model step". (878bb18)
| ], | ||
| ) | ||
| def test_fieldgroup_config_validate_filename(start_time, filename, interval, variables, message): | ||
| def test_fieldgroup_config_validate_filename(filename, output_interval, variables, message): |
There was a problem hiding this comment.
not you but missing type annotations
can we add some parameterized with the DeltaT output_interval?
There was a problem hiding this comment.
also: when parameterizing by output_interval: OutputInterval specify the parameterization as NumTimeSteps(1) / DeltaT(whatever)
safer for (disabled) type checking
There was a problem hiding this comment.
The whole file is type-checked now ;)
There was a problem hiding this comment.
not you but missing type annotations in several (all?) test defs
There was a problem hiding this comment.
can you enable type checking for this file?
The NETCDFWriter assigned gt4py/cupy device arrays directly into netCDF4 variables. On a GPU backend this raised `TypeError: Implicit conversion to a NumPy array is not allowed` from cupy, because netCDF4/HDF5 write from host memory and cupy forbids implicit device->host conversion. Route the three write sites (interface heights, new-variable init, and time-slice append) through `data_alloc.as_numpy(...)`, which performs an explicit D2H copy for cupy/gt4py fields and is a no-op for numpy (so CPU backends are unaffected). The CPU-only unit `test_writers.py` did not exercise this; the GPU integration test (`test_driver_io_output.py`) now passes on both gtfn_gpu and dace_gpu. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Rename the output dimension `interface_level` -> `half_level` everywhere (writers/ugrid/cf_utils + tests) to match the `is_on_half_levels` metadata flag; CF standard_name values are left untouched. (no UGRID/CF standard mandates the name) - simple_grid() now carries a fixed synthetic UUID id, so tests no longer patch it. - Use the OutputInterval/NumTimeSteps/DeltaT aliases explicitly; add a ruff extend-immutable-calls allowlist so they are valid as defaults (one noqa for RUF009). - Move the per-step counter increment out of the _at_capture_time predicate into store(). - Add a restart TODO at the refuse-to-overwrite check and a state-grouping/physics refactor TODO on DiagnosticsComputer. - standalone_driver: _store_output takes simulation_current_datetime: AbsoluteTime; annotate the try/finally; shorten the --enable-output help text. - Enable mypy type-checking for test_io.py: annotate all test defs, wrap grid_id with uuid.UUID, fix two errors in the io test utils helper.
| #: Output schedule: either a number of model steps (``int``) or a simulation-time | ||
| #: delta (``datetime.timedelta``); a delta is normalized to steps using the model time | ||
| #: step. Defaults to every step. |
There was a problem hiding this comment.
not necessary now, but the types in the comment could be updated
|
Mandatory Tests Please make sure you run these tests via comment before you merge!
Optional Tests To run benchmarks you can use:
To run tests and benchmarks with the DaCe backend you can use:
To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:
For more detailed information please look at CI in the EXCLAIM universe. |
|
cscs-ci run default |
|
cscs-ci run distributed |
|
cscs-ci run dace |
|
cscs-ci run extra |
What
Wires
icon4py.model.common.iointo the standalone driver for single-nodeNetCDF/UGRID output of the full model state (phase 0; parallel output is a follow-up).
Built on the config-driven driver (after merging
main).Output
Opt-in via
--enable-output(orDriverConfig.enable_output; default off). Writes theinitial state and then on a schedule, into one file per segment
<output_path>/icon4py_output_NNNN.nc(+ a*_ugrid.ncmesh) — all 10 fields together,by CF name:
air_density,exner_function,virtual_potential_temperature,upward_air_velocity,normal_velocityeastward_wind,northward_wind,temperature,virtual_temperature,pressureSchedule: every N model steps (
output_interval_steps, default 1) or asimulation-time interval (
output_interval+start_time, e.g."10 HOURS").How
driver_io.py— state → CF/UGRIDDataArrays (host-side, GPU-safe); metadata drivenfrom
states/data.py(CF names +icon_var_name+is_on_half_levels);DiagnosticsComputer(scratch/output buffers allocated once, reused per step).standalone_driver.py— monitor built ininitialize_driver(single-node guard),per-step store in the time loop,
close()infinally.common/io— step- or time-based scheduling;grid_idstrictlyuuid.UUID;to_data_array(to_host=...).pyproject.toml—icon4py-common[io]is a required driver dependency.Testing
Unit tests (
test_driver_io.py,test_io.py) + JW integration datatest(
test_driver_io_output.py). pre-commit, unit suites, and the datatest are green.Out of scope (follow-ups)