Disable corrector for the first N epochs by elynnwu · Pull Request #1260 · ai2cm/ace

elynnwu · 2026-06-10T21:35:23Z

Adds a corrector_disabled_epochs option to corrector configs that skips the corrector for train-mode steps during the first N training epochs. It is always applied in eval mode (validation and inference), so only training is affected.

Some runs get frozen precipitation stuck at exactly zero (e.g. job): the force-positive clamp zeroes the sparse frozen precip channel whenever an early update makes the raw prediction negative, and gradients through the clamp are zero there, so it never recovers. Disabling the corrector for the first epoch lets the loss see the raw prediction so gradients can pull the channel positive before the clamp returns.

corrector_disabled_epochs is a field on CorrectorConfigABC; its @final get_corrector wraps the built corrector in a new EpochScheduledCorrector (which skips an arbitrary wrapped corrector in train mode during the disabled epochs) when the value is > 0. Concrete configs just implement _get_corrector and are unaware of scheduling. The flag is valid only on the corrector config — CorrectorSelector rejects it and delegates — so there's one place to set it and no double-wrapping. The disabled flag is persisted in checkpoint state so mid-epoch resume keeps the interrupted epoch's behavior. Defaults to 0, so existing configs and checkpoints are unaffected.

Config

corrector:
  conserve_dry_air: true
  force_positive_names:
  - frozen_precipitation_rate
  corrector_disabled_epochs: 1   # inside `config:` for selector-based correctors

Changes

CorrectorConfigABC: corrector_disabled_epochs field + @final get_corrector wrapping via new abstract _get_corrector; CorrectorABC gains default no-op train/set_epoch/get_state/load_state; new EpochScheduledCorrector wrapper.
CorrectorSelector rejects corrector_disabled_epochs (must be set on the wrapped config).
Atmosphere/Ocean/Ice configs implement _get_corrector; step layer + Stepper forward set_epoch/train/get_state/load_state to the corrector.

Adds a TrainStepperABC.set_epoch(epoch) hook with a no-op default and wires it from the trainer at fresh-epoch boundaries (mid-epoch resume preserves in-module state so partial-epoch accumulators continue from where they left off). Stepper and CoupledStepper implement set_epoch by walking submodules and invoking request_latent_global_mean_envelope_reset where present, giving model components a way to reset per-epoch in-module statistics without coupling the stepper to model internals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When enabled, the per-channel spatial mean of the post-encoder latent is tracked during training and, in eval, the latent is shifted so that mean falls within the observed envelope (no-op when the mean is already inside it). Bounds the global-mean of the latent the transformer blocks see at inference to the range observed in training. The envelope is reset at the start of each training epoch (lazily, on the next training-mode forward) via request_latent_global_mean_envelope_reset, which the stepper invokes through the TrainStepperABC.set_epoch hook. Exposed as a single clip_latent_global_means: bool option on SFNONetConfig and NoiseConditionedSFNOBuilder; defaults to False so existing models are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nt_global_means

elynnwu · 2026-06-11T19:03:46Z

These two jobs are identical except the new job disabled corrector for the first epoch: wandb. The new job does not have the same zero frozen precipitation issue. Note that I was able to reproduce the zero frozen precip behavior in 3 experiments, all with n384 (one with equal loss, two with ACE2 loss weighting on Beaker and Perlmutter).

mcgibbon

LGTM, but consider the corrector refactor sooner rather than later. I can also act on it if you like, a Claude agent pointed to your issue and this PR should be able to handle it.

These changes move `corrector_disabled_epochs` out of `SingleModuleStep` and into the shared corrector configuration layer. Correctors are now wrapped by `EpochScheduledCorrector`, which can disable any corrector during the first N training epochs while still applying it during eval/validation/inference. This makes the behavior available to atmosphere, ocean, ice, and future correctors without copying logic into each Step. Step train/eval, epoch, and checkpoint state now forward through the corrector lifecycle so mid-epoch resume preserves the scheduler state. Resolves #1261

…i2cm/ace into feature/disable-corrector-first-epochs

mcgibbon and others added 6 commits June 5, 2026 16:57

Update NoiseConditionedSFNO backwards-compat checkpoint for clip_late…

d3b4321

…nt_global_means

Merge branch 'main' into feature/clip-latent-global-means

b3f3ca4

Merge branch 'main' into feature/clip-latent-global-means

e578447

disable corrector in the beginning of training

8561348

elynnwu changed the title ~~disable corrector in the beginning of training~~ Disable corrector for the first N epochs Jun 10, 2026

Arcomano1234 mentioned this pull request Jun 10, 2026

Add clip_latent_global_means option to conditional SFNO #1230

Merged

2 tasks

Base automatically changed from feature/clip-latent-global-means to main June 11, 2026 19:18

Merge branch 'main' into feature/disable-corrector-first-epochs

dbf0496

mcgibbon reviewed Jun 11, 2026

View reviewed changes

Comment thread fme/core/step/single_module.py Outdated

mcgibbon reviewed Jun 11, 2026

View reviewed changes

Comment thread fme/core/step/single_module.py Outdated

address PR comments

180c343

elynnwu mentioned this pull request Jun 11, 2026

Move corrector_disabled_epochs scheduling into corrector layer #1261

Open

Merge branch 'main' into feature/disable-corrector-first-epochs

e9d3f05

mcgibbon approved these changes Jun 12, 2026

View reviewed changes

elynnwu and others added 7 commits June 15, 2026 14:34

Merge branch 'main' into feature/disable-corrector-first-epochs

b4daf28

avoid having >2 levels of inheritance

bfc5731

Merge branch 'main' into feature/disable-corrector-first-epochs

a29848a

remove redundant tests

84387a3

Merge branch 'feature/disable-corrector-first-epochs' of github.com:a…

8ffd7c1

…i2cm/ace into feature/disable-corrector-first-epochs

Merge branch 'main' into feature/disable-corrector-first-epochs

021f6b6

elynnwu enabled auto-merge (squash) June 16, 2026 21:06

Merge branch 'main' into feature/disable-corrector-first-epochs

2c45561

elynnwu merged commit 5232d60 into main Jun 17, 2026
7 checks passed

elynnwu deleted the feature/disable-corrector-first-epochs branch June 17, 2026 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disable corrector for the first N epochs#1260

Disable corrector for the first N epochs#1260
elynnwu merged 17 commits into
mainfrom
feature/disable-corrector-first-epochs

elynnwu commented Jun 10, 2026 •

edited

Loading

Uh oh!

elynnwu commented Jun 11, 2026

Uh oh!

Uh oh!

Uh oh!

mcgibbon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

elynnwu commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Config

Changes

Uh oh!

elynnwu commented Jun 11, 2026

Uh oh!

Uh oh!

Uh oh!

mcgibbon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

elynnwu commented Jun 10, 2026 •

edited

Loading