Distributed Training

A five-post series on how distributed training actually works, not just how to call the API. Each post picks one technique, explains the mechanics from first principles, and uses finetuning Qwen3-4B on AMI meeting transcripts as a concrete application so the tradeoffs are measurable on the same hardware across posts.

The series is scoped to finetuning workflows that realistically fit on one or two consumer / workstation GPUs: DDP, FSDP, TP+SP, and PP. Techniques that only become relevant at pretraining scale (3D parallelism, expert parallelism for MoE) are intentionally out of scope.

Structure

assets/        images and scripts to generate them
blogs/         long-form writeups
  devlog/      run logs with hardware, config, and results
experiments/   runnable training scripts

The Series

Overview: From Single-GPU to Distributed Training

The framework post. What actually fills GPU memory during finetuning, the difference between a memory problem and a throughput problem, and a decision tree for picking DDP, FSDP, TP+SP, or PP. The foundation for the rest of the series.

Part 1: DistributedDataParallel (DDP)

The mechanics of DDP: process groups, Ring-AllReduce (ReduceScatter + AllGather), gradient buckets, the compute-communication overlap and the autograd hooks that create it, and no_sync() for gradient accumulation. Application: QLoRA finetuning of Qwen3-4B on meeting transcripts. Includes the common misconfigurations that silently collapse speedup from 2x to 1.3x.

Code: experiments/ddp-ft/train.py

uv sync
uv run torchrun --nproc_per_node=2 experiments/ddp-ft/train.py

Part 2: Fully Sharded Data Parallel (FSDP)

What FSDP is actually doing: ZeRO stage mapping, AllGather + ReduceScatter per layer, auto_wrap_policy granularity, MixedPrecision's three separate dtype roles, backward prefetch, CPU offload, and why QLoRA is incompatible. Application: LoRA finetuning of Qwen3-4B in BF16.

Code: experiments/fsdp-ft/train.py

uv sync
uv run torchrun --nproc_per_node=2 experiments/fsdp-ft/train.py

Part 3: Tensor Parallelism + Sequence Parallelism (TP+SP)

Column/row-parallel linears, the hidden activation-replication problem standard TP leaves behind, and how SP fixes it. Why PyTorch's parallelize_module + PEFT collide on LoRA (the DTensor/Tensor mismatch), and a hand-rolled Megatron-style version that makes SP work. On PCIe-connected GPUs without NVLink, TP+SP substantially outperforms FSDP for the same model.

Code: experiments/tp-ft/train.py (PyTorch native, no SP) and experiments/tp-sp-ft/train.py (hand-rolled, SP works).

uv run torchrun --nproc_per_node=2 experiments/tp-ft/train.py
uv run torchrun --nproc_per_node=2 experiments/tp-sp-ft/train.py

Part 4: Pipeline Parallelism (PP)

How PP splits the layer stack across GPUs, where the pipeline bubble comes from, and the sequence of techniques that shrink it: microbatching, GPipe, 1F1B, interleaved 1F1B. PP is overkill for 4B-scale finetuning; the experiment uses it at small scale to make the mechanics visible, and the post covers where it actually becomes necessary (100B+, multi-node).

Code: experiments/pp-ft/train.py

uv run torchrun --nproc_per_node=2 experiments/pp-ft/train.py

Devlogs

Single-GPU finetuning run: Qwen3-4B + QLoRA on a single RTX 4090. Hardware, config, issues, and results. Baseline for the series.
FSDP finetuning run: Qwen3-4B + LoRA (BF16) with FSDP FULL_SHARD on 2x RTX 4090. The silent failures: wrong decoder layer class, mixed dtypes, KV cache vs activation checkpointing.
TP+SP finetuning run: why the hand-rolled Megatron-style script had to exist, what parallelize_module was hiding, and what it bought at MAX_LENGTH=512.

System Requirements

OS: Linux (tested on Ubuntu 22.04). bitsandbytes does not officially support macOS or Windows.
GPU: NVIDIA, Ampere or newer (RTX 30xx/40xx, A100, H100). BF16 and Flash Attention 2 require Ampere+.
- Single-GPU experiments: 1x 24 GB card (e.g. RTX 3090/4090).
- DDP / FSDP / TP / PP experiments: 2+ GPUs on the same node, 24 GB each.
CUDA: 12.1+ with a matching NVIDIA driver. The pinned torch wheel ships its own CUDA runtime, so only the driver needs to be installed on the host.
Python: 3.10 or newer.
uv: install from https://docs.astral.sh/uv/getting-started/installation/
HF token: required for gated models. export HF_TOKEN=<your-token>.
Disk: ~20 GB free for the base model weights, dataset cache, and adapter checkpoints.

Experiments

Dependencies are managed at the project level via pyproject.toml + uv.lock. Run uv sync once to create .venv/ with the locked deps, then invoke scripts with uv run (or source .venv/bin/activate first).

Script	Description	Run
`experiments/single-gpu-ft/train.py`	QLoRA finetune on single GPU	`uv run python experiments/single-gpu-ft/train.py`
`experiments/single-gpu-ft/infer.py`	Inference with saved adapter	`uv run python experiments/single-gpu-ft/infer.py`
`experiments/ddp-ft/train.py`	QLoRA finetune with DDP across N GPUs	`uv run torchrun --nproc_per_node=N experiments/ddp-ft/train.py`
`experiments/ddp-ft/infer.py`	Inference with DDP-trained adapter	`uv run python experiments/ddp-ft/infer.py`
`experiments/fsdp-ft/train.py`	LoRA finetune with FSDP across N GPUs	`uv run torchrun --nproc_per_node=N experiments/fsdp-ft/train.py`
`experiments/fsdp-ft/infer.py`	Inference with FSDP-trained adapter (BF16)	`uv run python experiments/fsdp-ft/infer.py`
`experiments/tp-ft/train.py`	LoRA finetune with Tensor Parallel (PyTorch native) across N GPUs	`uv run torchrun --nproc_per_node=N experiments/tp-ft/train.py`
`experiments/tp-ft/infer.py`	Inference with TP-trained adapter (reassembled)	`uv run python experiments/tp-ft/infer.py`
`experiments/tp-sp-ft/train.py`	Hand-rolled Megatron-style TP + Sequence Parallel + LoRA across N GPUs	`uv run torchrun --nproc_per_node=N experiments/tp-sp-ft/train.py`
`experiments/tp-sp-ft/infer.py`	Inference with TP+SP-trained adapter (reassembled and merged)	`uv run python experiments/tp-sp-ft/infer.py`
`experiments/pp-ft/train.py`	LoRA finetune with Pipeline Parallel across N stages	`uv run torchrun --nproc_per_node=N experiments/pp-ft/train.py`
`experiments/pp-ft/infer.py`	Inference with PP-trained adapter (layer indices remapped)	`uv run python experiments/pp-ft/infer.py`

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
assets		assets
blogs		blogs
experiments		experiments
tmp		tmp
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Training

Structure

The Series

Overview: From Single-GPU to Distributed Training

Part 1: DistributedDataParallel (DDP)

Part 2: Fully Sharded Data Parallel (FSDP)

Part 3: Tensor Parallelism + Sequence Parallelism (TP+SP)

Part 4: Pipeline Parallelism (PP)

Devlogs

System Requirements

Experiments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed Training

Structure

The Series

Devlogs

System Requirements

Experiments

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages