A five-post series on how distributed training actually works, not just how to call the API. Each post picks one technique, explains the mechanics from first principles, and uses finetuning Qwen3-4B on AMI meeting transcripts as a concrete application so the tradeoffs are measurable on the same hardware across posts.
The series is scoped to finetuning workflows that realistically fit on one or two consumer / workstation GPUs: DDP, FSDP, TP+SP, and PP. Techniques that only become relevant at pretraining scale (3D parallelism, expert parallelism for MoE) are intentionally out of scope.
assets/ images and scripts to generate them
blogs/ long-form writeups
devlog/ run logs with hardware, config, and results
experiments/ runnable training scripts
The framework post. What actually fills GPU memory during finetuning, the difference between a memory problem and a throughput problem, and a decision tree for picking DDP, FSDP, TP+SP, or PP. The foundation for the rest of the series.
The mechanics of DDP: process groups, Ring-AllReduce (ReduceScatter + AllGather), gradient buckets, the compute-communication overlap and the autograd hooks that create it, and no_sync() for gradient accumulation. Application: QLoRA finetuning of Qwen3-4B on meeting transcripts. Includes the common misconfigurations that silently collapse speedup from 2x to 1.3x.
Code: experiments/ddp-ft/train.py
uv sync
uv run torchrun --nproc_per_node=2 experiments/ddp-ft/train.pyWhat FSDP is actually doing: ZeRO stage mapping, AllGather + ReduceScatter per layer, auto_wrap_policy granularity, MixedPrecision's three separate dtype roles, backward prefetch, CPU offload, and why QLoRA is incompatible. Application: LoRA finetuning of Qwen3-4B in BF16.
Code: experiments/fsdp-ft/train.py
uv sync
uv run torchrun --nproc_per_node=2 experiments/fsdp-ft/train.pyColumn/row-parallel linears, the hidden activation-replication problem standard TP leaves behind, and how SP fixes it. Why PyTorch's parallelize_module + PEFT collide on LoRA (the DTensor/Tensor mismatch), and a hand-rolled Megatron-style version that makes SP work. On PCIe-connected GPUs without NVLink, TP+SP substantially outperforms FSDP for the same model.
Code: experiments/tp-ft/train.py (PyTorch native, no SP) and experiments/tp-sp-ft/train.py (hand-rolled, SP works).
uv run torchrun --nproc_per_node=2 experiments/tp-ft/train.py
uv run torchrun --nproc_per_node=2 experiments/tp-sp-ft/train.pyHow PP splits the layer stack across GPUs, where the pipeline bubble comes from, and the sequence of techniques that shrink it: microbatching, GPipe, 1F1B, interleaved 1F1B. PP is overkill for 4B-scale finetuning; the experiment uses it at small scale to make the mechanics visible, and the post covers where it actually becomes necessary (100B+, multi-node).
Code: experiments/pp-ft/train.py
uv run torchrun --nproc_per_node=2 experiments/pp-ft/train.py- Single-GPU finetuning run: Qwen3-4B + QLoRA on a single RTX 4090. Hardware, config, issues, and results. Baseline for the series.
- FSDP finetuning run: Qwen3-4B + LoRA (BF16) with FSDP
FULL_SHARDon 2x RTX 4090. The silent failures: wrong decoder layer class, mixed dtypes, KV cache vs activation checkpointing. - TP+SP finetuning run: why the hand-rolled Megatron-style script had to exist, what
parallelize_modulewas hiding, and what it bought atMAX_LENGTH=512.
- OS: Linux (tested on Ubuntu 22.04). bitsandbytes does not officially support macOS or Windows.
- GPU: NVIDIA, Ampere or newer (RTX 30xx/40xx, A100, H100). BF16 and Flash Attention 2 require Ampere+.
- Single-GPU experiments: 1x 24 GB card (e.g. RTX 3090/4090).
- DDP / FSDP / TP / PP experiments: 2+ GPUs on the same node, 24 GB each.
- CUDA: 12.1+ with a matching NVIDIA driver. The pinned
torchwheel ships its own CUDA runtime, so only the driver needs to be installed on the host. - Python: 3.10 or newer.
- uv: install from https://docs.astral.sh/uv/getting-started/installation/
- HF token: required for gated models.
export HF_TOKEN=<your-token>. - Disk: ~20 GB free for the base model weights, dataset cache, and adapter checkpoints.
Dependencies are managed at the project level via pyproject.toml + uv.lock. Run uv sync once to create .venv/ with the locked deps, then invoke scripts with uv run (or source .venv/bin/activate first).
| Script | Description | Run |
|---|---|---|
experiments/single-gpu-ft/train.py |
QLoRA finetune on single GPU | uv run python experiments/single-gpu-ft/train.py |
experiments/single-gpu-ft/infer.py |
Inference with saved adapter | uv run python experiments/single-gpu-ft/infer.py |
experiments/ddp-ft/train.py |
QLoRA finetune with DDP across N GPUs | uv run torchrun --nproc_per_node=N experiments/ddp-ft/train.py |
experiments/ddp-ft/infer.py |
Inference with DDP-trained adapter | uv run python experiments/ddp-ft/infer.py |
experiments/fsdp-ft/train.py |
LoRA finetune with FSDP across N GPUs | uv run torchrun --nproc_per_node=N experiments/fsdp-ft/train.py |
experiments/fsdp-ft/infer.py |
Inference with FSDP-trained adapter (BF16) | uv run python experiments/fsdp-ft/infer.py |
experiments/tp-ft/train.py |
LoRA finetune with Tensor Parallel (PyTorch native) across N GPUs | uv run torchrun --nproc_per_node=N experiments/tp-ft/train.py |
experiments/tp-ft/infer.py |
Inference with TP-trained adapter (reassembled) | uv run python experiments/tp-ft/infer.py |
experiments/tp-sp-ft/train.py |
Hand-rolled Megatron-style TP + Sequence Parallel + LoRA across N GPUs | uv run torchrun --nproc_per_node=N experiments/tp-sp-ft/train.py |
experiments/tp-sp-ft/infer.py |
Inference with TP+SP-trained adapter (reassembled and merged) | uv run python experiments/tp-sp-ft/infer.py |
experiments/pp-ft/train.py |
LoRA finetune with Pipeline Parallel across N stages | uv run torchrun --nproc_per_node=N experiments/pp-ft/train.py |
experiments/pp-ft/infer.py |
Inference with PP-trained adapter (layer indices remapped) | uv run python experiments/pp-ft/infer.py |