Skip to content

Disable corrector for the first N epochs#1260

Merged
elynnwu merged 17 commits into
mainfrom
feature/disable-corrector-first-epochs
Jun 17, 2026
Merged

Disable corrector for the first N epochs#1260
elynnwu merged 17 commits into
mainfrom
feature/disable-corrector-first-epochs

Conversation

@elynnwu

@elynnwu elynnwu commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Adds a corrector_disabled_epochs option to corrector configs that skips the corrector for train-mode steps during the first N training epochs. It is always applied in eval mode (validation and inference), so only training is affected.

Some runs get frozen precipitation stuck at exactly zero (e.g. job): the force-positive clamp zeroes the sparse frozen precip channel whenever an early update makes the raw prediction negative, and gradients through the clamp are zero there, so it never recovers. Disabling the corrector for the first epoch lets the loss see the raw prediction so gradients can pull the channel positive before the clamp returns.

corrector_disabled_epochs is a field on CorrectorConfigABC; its @final get_corrector wraps the built corrector in a new EpochScheduledCorrector (which skips an arbitrary wrapped corrector in train mode during the disabled epochs) when the value is > 0. Concrete configs just implement _get_corrector and are unaware of scheduling. The flag is valid only on the corrector config — CorrectorSelector rejects it and delegates — so there's one place to set it and no double-wrapping. The disabled flag is persisted in checkpoint state so mid-epoch resume keeps the interrupted epoch's behavior. Defaults to 0, so existing configs and checkpoints are unaffected.

Config

corrector:
  conserve_dry_air: true
  force_positive_names:
  - frozen_precipitation_rate
  corrector_disabled_epochs: 1   # inside `config:` for selector-based correctors

Changes

  • CorrectorConfigABC: corrector_disabled_epochs field + @final get_corrector wrapping via new abstract _get_corrector; CorrectorABC gains default no-op train/set_epoch/get_state/load_state; new EpochScheduledCorrector wrapper.
  • CorrectorSelector rejects corrector_disabled_epochs (must be set on the wrapped config).
  • Atmosphere/Ocean/Ice configs implement _get_corrector; step layer + Stepper forward set_epoch/train/get_state/load_state to the corrector.

mcgibbon and others added 6 commits June 5, 2026 16:57
Adds a TrainStepperABC.set_epoch(epoch) hook with a no-op default and
wires it from the trainer at fresh-epoch boundaries (mid-epoch resume
preserves in-module state so partial-epoch accumulators continue from
where they left off).

Stepper and CoupledStepper implement set_epoch by walking submodules
and invoking request_latent_global_mean_envelope_reset where present,
giving model components a way to reset per-epoch in-module statistics
without coupling the stepper to model internals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When enabled, the per-channel spatial mean of the post-encoder latent
is tracked during training and, in eval, the latent is shifted so that
mean falls within the observed envelope (no-op when the mean is
already inside it). Bounds the global-mean of the latent the
transformer blocks see at inference to the range observed in training.

The envelope is reset at the start of each training epoch (lazily, on
the next training-mode forward) via
request_latent_global_mean_envelope_reset, which the stepper invokes
through the TrainStepperABC.set_epoch hook.

Exposed as a single clip_latent_global_means: bool option on
SFNONetConfig and NoiseConditionedSFNOBuilder; defaults to False so
existing models are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elynnwu elynnwu changed the title disable corrector in the beginning of training Disable corrector for the first N epochs Jun 10, 2026
@elynnwu

elynnwu commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

These two jobs are identical except the new job disabled corrector for the first epoch: wandb. The new job does not have the same zero frozen precipitation issue. Note that I was able to reproduce the zero frozen precip behavior in 3 experiments, all with n384 (one with equal loss, two with ACE2 loss weighting on Beaker and Perlmutter).

Base automatically changed from feature/clip-latent-global-means to main June 11, 2026 19:18
Comment thread fme/core/step/single_module.py Outdated
Comment thread fme/core/step/single_module.py Outdated

@mcgibbon mcgibbon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but consider the corrector refactor sooner rather than later. I can also act on it if you like, a Claude agent pointed to your issue and this PR should be able to handle it.

elynnwu and others added 7 commits June 15, 2026 14:34
These changes move `corrector_disabled_epochs` out of `SingleModuleStep`
and into the shared corrector configuration layer. Correctors are now
wrapped by `EpochScheduledCorrector`, which can disable any corrector
during the first N training epochs while still applying it during
eval/validation/inference.
This makes the behavior available to atmosphere, ocean, ice, and future
correctors without copying logic into each Step. Step train/eval, epoch,
and checkpoint state now forward through the corrector lifecycle so
mid-epoch resume preserves the scheduler state.

Resolves #1261
…i2cm/ace into feature/disable-corrector-first-epochs
@elynnwu elynnwu enabled auto-merge (squash) June 16, 2026 21:06
@elynnwu elynnwu merged commit 5232d60 into main Jun 17, 2026
7 checks passed
@elynnwu elynnwu deleted the feature/disable-corrector-first-epochs branch June 17, 2026 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants