Disable corrector for the first N epochs#1260
Merged
Merged
Conversation
Adds a TrainStepperABC.set_epoch(epoch) hook with a no-op default and wires it from the trainer at fresh-epoch boundaries (mid-epoch resume preserves in-module state so partial-epoch accumulators continue from where they left off). Stepper and CoupledStepper implement set_epoch by walking submodules and invoking request_latent_global_mean_envelope_reset where present, giving model components a way to reset per-epoch in-module statistics without coupling the stepper to model internals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When enabled, the per-channel spatial mean of the post-encoder latent is tracked during training and, in eval, the latent is shifted so that mean falls within the observed envelope (no-op when the mean is already inside it). Bounds the global-mean of the latent the transformer blocks see at inference to the range observed in training. The envelope is reset at the start of each training epoch (lazily, on the next training-mode forward) via request_latent_global_mean_envelope_reset, which the stepper invokes through the TrainStepperABC.set_epoch hook. Exposed as a single clip_latent_global_means: bool option on SFNONetConfig and NoiseConditionedSFNOBuilder; defaults to False so existing models are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
Contributor
Author
|
These two jobs are identical except the new job disabled corrector for the first epoch: wandb. The new job does not have the same zero frozen precipitation issue. Note that I was able to reproduce the zero frozen precip behavior in 3 experiments, all with n384 (one with equal loss, two with ACE2 loss weighting on Beaker and Perlmutter). |
mcgibbon
reviewed
Jun 11, 2026
mcgibbon
reviewed
Jun 11, 2026
mcgibbon
approved these changes
Jun 12, 2026
mcgibbon
left a comment
Contributor
There was a problem hiding this comment.
LGTM, but consider the corrector refactor sooner rather than later. I can also act on it if you like, a Claude agent pointed to your issue and this PR should be able to handle it.
These changes move `corrector_disabled_epochs` out of `SingleModuleStep` and into the shared corrector configuration layer. Correctors are now wrapped by `EpochScheduledCorrector`, which can disable any corrector during the first N training epochs while still applying it during eval/validation/inference. This makes the behavior available to atmosphere, ocean, ice, and future correctors without copying logic into each Step. Step train/eval, epoch, and checkpoint state now forward through the corrector lifecycle so mid-epoch resume preserves the scheduler state. Resolves #1261
…i2cm/ace into feature/disable-corrector-first-epochs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a
corrector_disabled_epochsoption to corrector configs that skips the corrector for train-mode steps during the first N training epochs. It is always applied in eval mode (validation and inference), so only training is affected.Some runs get frozen precipitation stuck at exactly zero (e.g. job): the force-positive clamp zeroes the sparse frozen precip channel whenever an early update makes the raw prediction negative, and gradients through the clamp are zero there, so it never recovers. Disabling the corrector for the first epoch lets the loss see the raw prediction so gradients can pull the channel positive before the clamp returns.
corrector_disabled_epochsis a field onCorrectorConfigABC; its@final get_correctorwraps the built corrector in a newEpochScheduledCorrector(which skips an arbitrary wrapped corrector in train mode during the disabled epochs) when the value is > 0. Concrete configs just implement_get_correctorand are unaware of scheduling. The flag is valid only on the corrector config —CorrectorSelectorrejects it and delegates — so there's one place to set it and no double-wrapping. The disabled flag is persisted in checkpoint state so mid-epoch resume keeps the interrupted epoch's behavior. Defaults to 0, so existing configs and checkpoints are unaffected.Config
Changes
CorrectorConfigABC:corrector_disabled_epochsfield +@final get_correctorwrapping via new abstract_get_corrector;CorrectorABCgains default no-op train/set_epoch/get_state/load_state; newEpochScheduledCorrectorwrapper.CorrectorSelectorrejectscorrector_disabled_epochs(must be set on the wrapped config)._get_corrector; step layer + Stepper forward set_epoch/train/get_state/load_state to the corrector.