perf(fsdp2): reduce per-step host/comm overhead by kcz358 · Pull Request #181 · EvolvingLMMs-Lab/lmms-engine

kcz358 · 2026-05-28T01:43:10Z

Summary

Two small follow-ups discussed in #179. Each is one commit, independently revertable.

1. `non_blocking=True` for host-to-device

`send_to_device` defaulted to `non_blocking=False`, so the H2D transfer was synchronous on every step even though the repo already defaults `dataloader_pin_memory=True`. Flipping the flag lets the copy overlap the first kernels of the following `training_step`.

Adds a one-time warning at trainer init when `dataloader_pin_memory=False`, since the flag becomes a no-op on pageable memory (mirrors the existing check in `train/hf/trainer.py`).

2. Five `all_reduce` collectives → one `all_gather`

`calculate_training_metrics` issued five separate `all_reduce` calls (`AVG` mfu, `SUM` total seq len, `AVG`/`MIN`/`MAX` seq len stats) on scalar tensors. Each one paid full collective latency and forced a `.item()` GPU→CPU sync.

Replaced with a single `all_gather_into_tensor` over a 2-element per-rank tensor `[mfu_local, seq_len_sum_local]`, then mean/sum/min/max are computed locally on the gathered tensor with one batched `.tolist()` to pull all five scalars at once.

Also dropped the `torch.tensor(flops, device=...)` wrapper at the caller — `calculate_training_metrics` now accepts the Python float from `estimate_flops` directly, removing one host→device roundtrip.

Notes

Per-step wins here are small in absolute terms — `training_metrics` was already only a few ms — but the changes are pure cleanup with no behavior change. Output metrics are bit-equivalent to before (modulo float32 reduction order, which already varied across ranks).

Test

Verified via `cicd/run_traincicd.sh --model-name qwen3_vl --gpu-count 4` locally; loss curve and reported metric values match main.

Out of scope

Discussed but deferred:

Skipping the metrics computation entirely on non-`logging_steps` steps (largest potential win, but changes user-visible logging behavior).
Removing the `seq_len.detach().cpu().tolist()` sync inside the metrics block. This would require changing `flops_counter.estimate_flops` to accept a tensor; cross-cutting and outside this PR.
Async-pipelined metrics (compute on step N, log on step N+1).

send_to_device defaulted to non_blocking=False, which made the H2D transfer a synchronous step even when the dataloader produced pinned tensors (the repo's default). With non_blocking=True the copy is submitted to the CUDA stream and overlaps the first kernels of the following training_step, eliminating the dedicated h2d wait. Adds a one-time warning at trainer init when dataloader_pin_memory is False, since the flag becomes a no-op on pageable memory.

…ther calculate_training_metrics previously issued five independent all_reduce calls (mfu/sum/avg/min/max) on tiny scalar tensors, each paying full collective latency and an extra GPU->CPU sync via .item(). Replace them with one all_gather_into_tensor over a 2-element per-rank tensor [mfu_local, seq_len_sum_local], reduce locally (mean/sum/min/max), and do a single batched .tolist() to pull all scalars at once. Also drops the redundant torch.tensor(flops, device=...) wrapper since the callee now accepts a Python float directly, removing one host->device roundtrip per step.

kcz358 added 2 commits May 27, 2026 18:32

kcz358 merged commit 2fe89eb into main May 28, 2026
3 checks passed

kcz358 deleted the perf/metrics-and-h2d branch May 28, 2026 01:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(fsdp2): reduce per-step host/comm overhead#181

perf(fsdp2): reduce per-step host/comm overhead#181
kcz358 merged 2 commits into
mainfrom
perf/metrics-and-h2d

kcz358 commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kcz358 commented May 28, 2026

Summary

1. non_blocking=True for host-to-device

2. Five `all_reduce` collectives → one `all_gather`

Notes

Test

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `non_blocking=True` for host-to-device