Skip to content

conscious-engines/distributed-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

186 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Training

A five-post series on how distributed training actually works, not just how to call the API. Each post picks one technique, explains the mechanics from first principles, and uses finetuning Qwen3-4B on AMI meeting transcripts as a concrete application so the tradeoffs are measurable on the same hardware across posts.

The series is scoped to finetuning workflows that realistically fit on one or two consumer / workstation GPUs: DDP, FSDP, TP+SP, and PP. Techniques that only become relevant at pretraining scale (3D parallelism, expert parallelism for MoE) are intentionally out of scope.

Structure

assets/        images and scripts to generate them
blogs/         long-form writeups
  devlog/      run logs with hardware, config, and results
experiments/   runnable training scripts

The Series

The framework post. What actually fills GPU memory during finetuning, the difference between a memory problem and a throughput problem, and a decision tree for picking DDP, FSDP, TP+SP, or PP. The foundation for the rest of the series.

The mechanics of DDP: process groups, Ring-AllReduce (ReduceScatter + AllGather), gradient buckets, the compute-communication overlap and the autograd hooks that create it, and no_sync() for gradient accumulation. Application: QLoRA finetuning of Qwen3-4B on meeting transcripts. Includes the common misconfigurations that silently collapse speedup from 2x to 1.3x.

Code: experiments/ddp-ft/train.py

uv sync
uv run torchrun --nproc_per_node=2 experiments/ddp-ft/train.py

What FSDP is actually doing: ZeRO stage mapping, AllGather + ReduceScatter per layer, auto_wrap_policy granularity, MixedPrecision's three separate dtype roles, backward prefetch, CPU offload, and why QLoRA is incompatible. Application: LoRA finetuning of Qwen3-4B in BF16.

Code: experiments/fsdp-ft/train.py

uv sync
uv run torchrun --nproc_per_node=2 experiments/fsdp-ft/train.py

Column/row-parallel linears, the hidden activation-replication problem standard TP leaves behind, and how SP fixes it. Why PyTorch's parallelize_module + PEFT collide on LoRA (the DTensor/Tensor mismatch), and a hand-rolled Megatron-style version that makes SP work. On PCIe-connected GPUs without NVLink, TP+SP substantially outperforms FSDP for the same model.

Code: experiments/tp-ft/train.py (PyTorch native, no SP) and experiments/tp-sp-ft/train.py (hand-rolled, SP works).

uv run torchrun --nproc_per_node=2 experiments/tp-ft/train.py
uv run torchrun --nproc_per_node=2 experiments/tp-sp-ft/train.py

How PP splits the layer stack across GPUs, where the pipeline bubble comes from, and the sequence of techniques that shrink it: microbatching, GPipe, 1F1B, interleaved 1F1B. PP is overkill for 4B-scale finetuning; the experiment uses it at small scale to make the mechanics visible, and the post covers where it actually becomes necessary (100B+, multi-node).

Code: experiments/pp-ft/train.py

uv run torchrun --nproc_per_node=2 experiments/pp-ft/train.py

Devlogs

  • Single-GPU finetuning run: Qwen3-4B + QLoRA on a single RTX 4090. Hardware, config, issues, and results. Baseline for the series.
  • FSDP finetuning run: Qwen3-4B + LoRA (BF16) with FSDP FULL_SHARD on 2x RTX 4090. The silent failures: wrong decoder layer class, mixed dtypes, KV cache vs activation checkpointing.
  • TP+SP finetuning run: why the hand-rolled Megatron-style script had to exist, what parallelize_module was hiding, and what it bought at MAX_LENGTH=512.

System Requirements

  • OS: Linux (tested on Ubuntu 22.04). bitsandbytes does not officially support macOS or Windows.
  • GPU: NVIDIA, Ampere or newer (RTX 30xx/40xx, A100, H100). BF16 and Flash Attention 2 require Ampere+.
    • Single-GPU experiments: 1x 24 GB card (e.g. RTX 3090/4090).
    • DDP / FSDP / TP / PP experiments: 2+ GPUs on the same node, 24 GB each.
  • CUDA: 12.1+ with a matching NVIDIA driver. The pinned torch wheel ships its own CUDA runtime, so only the driver needs to be installed on the host.
  • Python: 3.10 or newer.
  • uv: install from https://docs.astral.sh/uv/getting-started/installation/
  • HF token: required for gated models. export HF_TOKEN=<your-token>.
  • Disk: ~20 GB free for the base model weights, dataset cache, and adapter checkpoints.

Experiments

Dependencies are managed at the project level via pyproject.toml + uv.lock. Run uv sync once to create .venv/ with the locked deps, then invoke scripts with uv run (or source .venv/bin/activate first).

Script Description Run
experiments/single-gpu-ft/train.py QLoRA finetune on single GPU uv run python experiments/single-gpu-ft/train.py
experiments/single-gpu-ft/infer.py Inference with saved adapter uv run python experiments/single-gpu-ft/infer.py
experiments/ddp-ft/train.py QLoRA finetune with DDP across N GPUs uv run torchrun --nproc_per_node=N experiments/ddp-ft/train.py
experiments/ddp-ft/infer.py Inference with DDP-trained adapter uv run python experiments/ddp-ft/infer.py
experiments/fsdp-ft/train.py LoRA finetune with FSDP across N GPUs uv run torchrun --nproc_per_node=N experiments/fsdp-ft/train.py
experiments/fsdp-ft/infer.py Inference with FSDP-trained adapter (BF16) uv run python experiments/fsdp-ft/infer.py
experiments/tp-ft/train.py LoRA finetune with Tensor Parallel (PyTorch native) across N GPUs uv run torchrun --nproc_per_node=N experiments/tp-ft/train.py
experiments/tp-ft/infer.py Inference with TP-trained adapter (reassembled) uv run python experiments/tp-ft/infer.py
experiments/tp-sp-ft/train.py Hand-rolled Megatron-style TP + Sequence Parallel + LoRA across N GPUs uv run torchrun --nproc_per_node=N experiments/tp-sp-ft/train.py
experiments/tp-sp-ft/infer.py Inference with TP+SP-trained adapter (reassembled and merged) uv run python experiments/tp-sp-ft/infer.py
experiments/pp-ft/train.py LoRA finetune with Pipeline Parallel across N stages uv run torchrun --nproc_per_node=N experiments/pp-ft/train.py
experiments/pp-ft/infer.py Inference with PP-trained adapter (layer indices remapped) uv run python experiments/pp-ft/infer.py

About

Experiments and blog posts on distributed training

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors