Summary
When running inference over multiple initial conditions (optionally × ensemble members), the time coordinate written to the output is not guaranteed to correspond to the initial condition that was actually loaded for that output. If the ordering used to assign output times differs from the order in which ICs are loaded, every output is silently mislabeled in time — no error is raised, and the predictions themselves are physically correct, so the bug is invisible unless you verify content against the label. This breaks any date-based verification against a reference dataset and any downstream analysis that reads the init time from the output's time[0].
This is a correctness bug in the inference output path (fme/ace/inference/ — the initial-condition data loading and the time coordinate propagated through loop.py → data_writer/), not a problem specific to one experiment. Any multi-IC and/or multi-member inference is potentially affected.
Expected vs actual
- Expected: an output for initial condition i (member j) carries IC i's own timestamp, i.e.
output.time[0] == IC_i.time + lead_step.
- Actual: the output
time is assigned from a source that can be ordered differently from the ICs as loaded, so output.time and the loaded IC content come from two different orderings of the same date list.
Likely mechanism
The output time appears to be derived from an index/base ordering rather than propagated from each loaded IC's timestamp. With N ICs and M ensemble members, the IC content is enumerated in one order while the time is enumerated in another (e.g. IC-major vs member-major), so they line up only by coincidence at the endpoints. The fix is to carry each loaded IC's timestamp through to the writer rather than reconstruct it from a counter.
Suggested fix
- Ensure the prediction output
time coordinate is propagated from the initial condition that was loaded (single source of truth: the IC dataset's own time), through loop.py and the data_writer, for every (IC, member) combination.
- Add a regression test (e.g. alongside
fme/ace/inference/data_writer/test_*.py or test_inference.py): run a small inference over ≥2 ICs × ≥2 members with distinct, known IC dates and assert each output's time[0] equals that IC's date + lead step. The current behavior would pass a single-IC test but fail this one — which is why it slipped through.
Concrete reproduction (the bug as observed)
Run: gs://vcm-ml-intermediate/2026-06-16-ace2s-land-feedback-inference/frameworkB-era5/segment_*/landfeedback_ic{NNNN}.zarr (branch exp/ace2s-land-feedback-inference), 96 outputs = 48 init-years (1977–2024) × 2 members.
For each output I correlated its day-1 deseasonalized anomaly against ERA5's same-calendar-day anomaly for every candidate year (spatial Pearson):
- Every output matches some ERA5 year at r ≈ 0.99 (content is valid).
- The correlation at the year written in
time is ≈ 0 for 94/96 outputs (only the two endpoints coincide).
The label vs content relationship is a deterministic permutation of the output index m (0–95):
TRUE year (content) = 1977 + floor(m/2) # IC-major: 1977,1977,1978,1978,…
STAMPED year (time) = 1977 + (m mod 48) # member-major: 1977…2024, then again
Likely also affects frameworkB-cm4-rs0 / frameworkB-cm4-rs1 (different IC count — not yet checked).
Impact / severity
- Silent: no error; predictions are physically valid, so it is undetectable without content-vs-label verification.
- Corrupts any lead-time skill verification (forecast aligned to the wrong truth year).
- Existing affected outputs are recoverable by relabeling (the true time is deterministically derivable), so data need not be regenerated — but the code must be fixed so future inference runs are correct, and a test added so it cannot regress.
Summary
When running inference over multiple initial conditions (optionally × ensemble members), the
timecoordinate written to the output is not guaranteed to correspond to the initial condition that was actually loaded for that output. If the ordering used to assign output times differs from the order in which ICs are loaded, every output is silently mislabeled in time — no error is raised, and the predictions themselves are physically correct, so the bug is invisible unless you verify content against the label. This breaks any date-based verification against a reference dataset and any downstream analysis that reads the init time from the output'stime[0].This is a correctness bug in the inference output path (
fme/ace/inference/— the initial-condition data loading and thetimecoordinate propagated throughloop.py→data_writer/), not a problem specific to one experiment. Any multi-IC and/or multi-member inference is potentially affected.Expected vs actual
output.time[0] == IC_i.time + lead_step.timeis assigned from a source that can be ordered differently from the ICs as loaded, sooutput.timeand the loaded IC content come from two different orderings of the same date list.Likely mechanism
The output time appears to be derived from an index/base ordering rather than propagated from each loaded IC's timestamp. With
NICs andMensemble members, the IC content is enumerated in one order while the time is enumerated in another (e.g. IC-major vs member-major), so they line up only by coincidence at the endpoints. The fix is to carry each loaded IC's timestamp through to the writer rather than reconstruct it from a counter.Suggested fix
timecoordinate is propagated from the initial condition that was loaded (single source of truth: the IC dataset's own time), throughloop.pyand thedata_writer, for every (IC, member) combination.fme/ace/inference/data_writer/test_*.pyortest_inference.py): run a small inference over ≥2 ICs × ≥2 members with distinct, known IC dates and assert each output'stime[0]equals that IC's date + lead step. The current behavior would pass a single-IC test but fail this one — which is why it slipped through.Concrete reproduction (the bug as observed)
Run:
gs://vcm-ml-intermediate/2026-06-16-ace2s-land-feedback-inference/frameworkB-era5/segment_*/landfeedback_ic{NNNN}.zarr(branchexp/ace2s-land-feedback-inference), 96 outputs = 48 init-years (1977–2024) × 2 members.For each output I correlated its day-1 deseasonalized anomaly against ERA5's same-calendar-day anomaly for every candidate year (spatial Pearson):
timeis ≈ 0 for 94/96 outputs (only the two endpoints coincide).The label vs content relationship is a deterministic permutation of the output index
m(0–95):Likely also affects
frameworkB-cm4-rs0/frameworkB-cm4-rs1(different IC count — not yet checked).Impact / severity