Is your feature request related to a problem? Please describe.
In large-scale video generation distillation training, we typically need to load three models simultaneously: teacher, student, and a fake net (DMD). On memory-constrained GPUs (e.g., H20), it’s often not possible to fit all three models and their optimizer states/activations in GPU memory, which blocks training or forces significant compromises (smaller batch/sequence, more checkpointing, etc.).
Tagging @mcore-oncall.
Describe the solution you'd like
Support (or provide an officially recommended integration path) to combine FSDP2 (PyTorch fully_shard) with Context Parallel (CP) for video generation training, along with the required distributed communication operators commonly used in LLM training (e.g., A2A and related collectives).
Ideally this would enable training workflows where the teacher/student/fake models can be sharded and trained efficiently under the same distributed setup.
Describe alternatives you've considered
FSDP-only: While FSDP2 (fully_shard) effectively reduces parameter and optimizer memory, it is insufficient on its own for large-scale video distillation workloads. Without context parallelism, sequence lengths and batch sizes must be reduced to impractical levels, or training becomes unstable/slow.
Custom CP + FSDP integration: In practice, context parallelism is required in addition to FSDP, but CP is not part of a unified, officially supported architecture with FSDP. Users must manually implement CP process groups and distributed collectives (e.g., A2A) on top of sharding. This approach is brittle, hard to maintain, and difficult to keep aligned with upstream changes in PyTorch, Megatron, and MCore as CP semantics and distributed operators evolve.
Additional context
In the RCM project, FSDP is implemented via PyTorch fully_shard (FSDP2-style). For context parallelism, Megatron provides CP process groups, while A2A and other ops are implemented via torch.distributed. I’d like to know if there is a roadmap/schedule to bring these pieces together in an officially supported way for video generation training (especially distillation).
Is your feature request related to a problem? Please describe.
In large-scale video generation distillation training, we typically need to load three models simultaneously: teacher, student, and a fake net (DMD). On memory-constrained GPUs (e.g., H20), it’s often not possible to fit all three models and their optimizer states/activations in GPU memory, which blocks training or forces significant compromises (smaller batch/sequence, more checkpointing, etc.).
Tagging @mcore-oncall.
Describe the solution you'd like
Support (or provide an officially recommended integration path) to combine FSDP2 (PyTorch fully_shard) with Context Parallel (CP) for video generation training, along with the required distributed communication operators commonly used in LLM training (e.g., A2A and related collectives).
Ideally this would enable training workflows where the teacher/student/fake models can be sharded and trained efficiently under the same distributed setup.
Describe alternatives you've considered
FSDP-only: While FSDP2 (fully_shard) effectively reduces parameter and optimizer memory, it is insufficient on its own for large-scale video distillation workloads. Without context parallelism, sequence lengths and batch sizes must be reduced to impractical levels, or training becomes unstable/slow.
Custom CP + FSDP integration: In practice, context parallelism is required in addition to FSDP, but CP is not part of a unified, officially supported architecture with FSDP. Users must manually implement CP process groups and distributed collectives (e.g., A2A) on top of sharding. This approach is brittle, hard to maintain, and difficult to keep aligned with upstream changes in PyTorch, Megatron, and MCore as CP semantics and distributed operators evolve.
Additional context
In the RCM project, FSDP is implemented via PyTorch fully_shard (FSDP2-style). For context parallelism, Megatron provides CP process groups, while A2A and other ops are implemented via torch.distributed. I’d like to know if there is a roadmap/schedule to bring these pieces together in an officially supported way for video generation training (especially distillation).