Skip to content

perf(fsdp2): reduce per-step host/comm overhead#181

Merged
kcz358 merged 2 commits into
mainfrom
perf/metrics-and-h2d
May 28, 2026
Merged

perf(fsdp2): reduce per-step host/comm overhead#181
kcz358 merged 2 commits into
mainfrom
perf/metrics-and-h2d

Conversation

@kcz358

@kcz358 kcz358 commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Two small follow-ups discussed in #179. Each is one commit, independently revertable.

1. non_blocking=True for host-to-device

`send_to_device` defaulted to `non_blocking=False`, so the H2D transfer was synchronous on every step even though the repo already defaults `dataloader_pin_memory=True`. Flipping the flag lets the copy overlap the first kernels of the following `training_step`.

Adds a one-time warning at trainer init when `dataloader_pin_memory=False`, since the flag becomes a no-op on pageable memory (mirrors the existing check in `train/hf/trainer.py`).

2. Five `all_reduce` collectives → one `all_gather`

`calculate_training_metrics` issued five separate `all_reduce` calls (`AVG` mfu, `SUM` total seq len, `AVG`/`MIN`/`MAX` seq len stats) on scalar tensors. Each one paid full collective latency and forced a `.item()` GPU→CPU sync.

Replaced with a single `all_gather_into_tensor` over a 2-element per-rank tensor `[mfu_local, seq_len_sum_local]`, then mean/sum/min/max are computed locally on the gathered tensor with one batched `.tolist()` to pull all five scalars at once.

Also dropped the `torch.tensor(flops, device=...)` wrapper at the caller — `calculate_training_metrics` now accepts the Python float from `estimate_flops` directly, removing one host→device roundtrip.

Notes

Per-step wins here are small in absolute terms — `training_metrics` was already only a few ms — but the changes are pure cleanup with no behavior change. Output metrics are bit-equivalent to before (modulo float32 reduction order, which already varied across ranks).

Test

Verified via `cicd/run_traincicd.sh --model-name qwen3_vl --gpu-count 4` locally; loss curve and reported metric values match main.

Out of scope

Discussed but deferred:

  • Skipping the metrics computation entirely on non-`logging_steps` steps (largest potential win, but changes user-visible logging behavior).
  • Removing the `seq_len.detach().cpu().tolist()` sync inside the metrics block. This would require changing `flops_counter.estimate_flops` to accept a tensor; cross-cutting and outside this PR.
  • Async-pipelined metrics (compute on step N, log on step N+1).

kcz358 added 2 commits May 27, 2026 18:32
send_to_device defaulted to non_blocking=False, which made the H2D
transfer a synchronous step even when the dataloader produced pinned
tensors (the repo's default). With non_blocking=True the copy is
submitted to the CUDA stream and overlaps the first kernels of the
following training_step, eliminating the dedicated h2d wait.

Adds a one-time warning at trainer init when dataloader_pin_memory is
False, since the flag becomes a no-op on pageable memory.
…ther

calculate_training_metrics previously issued five independent all_reduce
calls (mfu/sum/avg/min/max) on tiny scalar tensors, each paying full
collective latency and an extra GPU->CPU sync via .item(). Replace them
with one all_gather_into_tensor over a 2-element per-rank tensor
[mfu_local, seq_len_sum_local], reduce locally (mean/sum/min/max), and
do a single batched .tolist() to pull all scalars at once.

Also drops the redundant torch.tensor(flops, device=...) wrapper since
the callee now accepts a Python float directly, removing one host->device
roundtrip per step.
@kcz358 kcz358 merged commit 2fe89eb into main May 28, 2026
3 checks passed
@kcz358 kcz358 deleted the perf/metrics-and-h2d branch May 28, 2026 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant