Description & Motivation
After I trained a model in some numbers of gpus, say, 8 gpus for a while, It's difficult to load the checkpoint to 16 gpus with optimizer and model states unchanged. The deepspeed has developed the universal checkpointing strategy to solve this problem, but I didn't see the pytorch-lightning has this feature.
Pitch
I want the pytorch-lightning could support this feature
Alternatives
try to add universal_checkpoint as a param of DeepSpeedStrategy and modify the class refering to https://www.deepspeed.ai/tutorials/universal-checkpointing/
Additional context
No response
cc @Borda @awaelchli
Description & Motivation
After I trained a model in some numbers of gpus, say, 8 gpus for a while, It's difficult to load the checkpoint to 16 gpus with optimizer and model states unchanged. The deepspeed has developed the universal checkpointing strategy to solve this problem, but I didn't see the
pytorch-lightninghas this feature.Pitch
I want the
pytorch-lightningcould support this featureAlternatives
try to add
universal_checkpointas a param ofDeepSpeedStrategyand modify the class refering tohttps://www.deepspeed.ai/tutorials/universal-checkpointing/Additional context
No response
cc @Borda @awaelchli