Skip to content

Add forward all-gather overlap to experimental FSDP#5513

Open
wujingyue wants to merge 1 commit into
NVIDIA:mainfrom
wujingyue:codex/fsdp-allgather-overlap
Open

Add forward all-gather overlap to experimental FSDP#5513
wujingyue wants to merge 1 commit into
NVIDIA:mainfrom
wujingyue:codex/fsdp-allgather-overlap

Conversation

@wujingyue

@wujingyue wujingyue commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Split the all-gather overlap portion from Add FSDP stream context delayed release #5124 onto current main.
  • Add an FsdpContext that lazily owns a CUDA all-gather stream for one FSDP subtree.
  • Run parameter unshard/all-gather on that context stream while compute remains on the default stream.
  • Delay release of unsharded parameter storage so child-unit forward all-gathers can overlap default-stream GEMMs under a shared root context.
  • Add context-sharing and memory-bound tests for nested FSDP units, plus a profiler test that asserts the forward child pipeline gives num_children - 1 all-gather/compute overlaps.

Scope

  • This PR intentionally covers forward all-gather overlap only.
  • Backward all-gathers are present in the profiled iteration, but they do not overlap compute in this PR because gradient reduction is still launched synchronously in post_backward before autograd reaches the next module’s pre_backward all-gather.
  • The delayed gradient-reduction follow-up PR will address backward overlap.
  • Reduce-scatter overlap is not included here.

Testing

  • python -m torch.distributed.run --nproc-per-node 2 -m pytest -q tests/unit_tests/distributed/megatron_fsdp/test_context.py tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.py --tb=short --disable-warnings -rN
  • git diff --check

@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wujingyue wujingyue force-pushed the codex/fsdp-allgather-overlap branch 29 times, most recently from 0ca3392 to 89490e3 Compare June 26, 2026 22:52
@wujingyue wujingyue force-pushed the codex/fsdp-allgather-overlap branch 8 times, most recently from 15d1fd8 to 418af45 Compare June 27, 2026 00:57
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
@wujingyue wujingyue changed the title Add FSDP all-gather stream overlap Add forward all-gather overlap to experimental FSDP Jun 27, 2026
@wujingyue wujingyue marked this pull request as ready for review June 27, 2026 01:00
@wujingyue wujingyue requested review from a team as code owners June 27, 2026 01:00
@wujingyue wujingyue force-pushed the codex/fsdp-allgather-overlap branch from 418af45 to 2231d90 Compare June 27, 2026 01:01
@wujingyue wujingyue marked this pull request as draft June 27, 2026 01:01
@copy-pr-bot

copy-pr-bot Bot commented Jun 27, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@wujingyue wujingyue marked this pull request as ready for review June 27, 2026 01:01
@wujingyue

Copy link
Copy Markdown
Contributor Author

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants