perf(fsdp2): reduce per-step host/comm overhead#181
Merged
Conversation
send_to_device defaulted to non_blocking=False, which made the H2D transfer a synchronous step even when the dataloader produced pinned tensors (the repo's default). With non_blocking=True the copy is submitted to the CUDA stream and overlaps the first kernels of the following training_step, eliminating the dedicated h2d wait. Adds a one-time warning at trainer init when dataloader_pin_memory is False, since the flag becomes a no-op on pageable memory.
…ther calculate_training_metrics previously issued five independent all_reduce calls (mfu/sum/avg/min/max) on tiny scalar tensors, each paying full collective latency and an extra GPU->CPU sync via .item(). Replace them with one all_gather_into_tensor over a 2-element per-rank tensor [mfu_local, seq_len_sum_local], reduce locally (mean/sum/min/max), and do a single batched .tolist() to pull all scalars at once. Also drops the redundant torch.tensor(flops, device=...) wrapper since the callee now accepts a Python float directly, removing one host->device roundtrip per step.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two small follow-ups discussed in #179. Each is one commit, independently revertable.
1.
non_blocking=Truefor host-to-device`send_to_device` defaulted to `non_blocking=False`, so the H2D transfer was synchronous on every step even though the repo already defaults `dataloader_pin_memory=True`. Flipping the flag lets the copy overlap the first kernels of the following `training_step`.
Adds a one-time warning at trainer init when `dataloader_pin_memory=False`, since the flag becomes a no-op on pageable memory (mirrors the existing check in `train/hf/trainer.py`).
2. Five `all_reduce` collectives → one `all_gather`
`calculate_training_metrics` issued five separate `all_reduce` calls (`AVG` mfu, `SUM` total seq len, `AVG`/`MIN`/`MAX` seq len stats) on scalar tensors. Each one paid full collective latency and forced a `.item()` GPU→CPU sync.
Replaced with a single `all_gather_into_tensor` over a 2-element per-rank tensor `[mfu_local, seq_len_sum_local]`, then mean/sum/min/max are computed locally on the gathered tensor with one batched `.tolist()` to pull all five scalars at once.
Also dropped the `torch.tensor(flops, device=...)` wrapper at the caller — `calculate_training_metrics` now accepts the Python float from `estimate_flops` directly, removing one host→device roundtrip.
Notes
Per-step wins here are small in absolute terms — `training_metrics` was already only a few ms — but the changes are pure cleanup with no behavior change. Output metrics are bit-equivalent to before (modulo float32 reduction order, which already varied across ranks).
Test
Verified via `cicd/run_traincicd.sh --model-name qwen3_vl --gpu-count 4` locally; loss curve and reported metric values match main.
Out of scope
Discussed but deferred: