Minimal PyTorch re-implementation of "Human Motion Diffusion Model" (Tevet et al., ICLR 2023), built from scratch for learning purposes.
The full official implementation is at GuyTevet/motion-diffusion-model.
Action 3 (jump), trained on HumanAct12Poses for 30,000 epochs on a single RTX 2080 Ti.
MDM generates human motion sequences (e.g., a person walks forward) using a diffusion model — a generative approach that learns to reverse a process which gradually corrupts data with noise.
This repository is a minimal, from-scratch implementation that isolates the core mechanics: the noise scheduler, the Transformer-based denoising model, and the training loop. It intentionally omits CLIP/BERT text encoding, real datasets, and evaluation pipelines so the essential ideas stay readable.
The three components and how they connect:
graph LR
PE["PositionalEncoding\n(model.py)"]
MDM["MDM\n(model.py)"]
NS["NoiseScheduler\n(scheduler.py)"]
TS["train_step()\n(train_step.py)"]
PE --> MDM
TS -->|"① add_noise(x₀, ε, t) → x_t"| NS
NS -->|"x_t"| TS
TS -->|"② forward(x_t, t, action) → pred_x₀"| MDM
MDM -->|"pred_x₀"| TS
graph LR
x0["x₀ · Clean motion\n[B, F, J×3]"]
eps["ε ~ N(0, I)\n[B, F, J×3]"]
xt["x_t · Noisy motion\n[B, F, J×3]"]
pred["pred_x₀\n[B, F, J×3]"]
loss["MSE Loss"]
optim["Adam update"]
x0 --> xt
eps --> xt
xt -->|"MDM.forward"| pred
pred --> loss
x0 -->|"target"| loss
loss --> optim
optim -->|"∇θ"| pred
action_class ──► Embedding ──► [B, 1, 512] ─┐
t ──► Linear → SiLU → Linear ──► [B, 1, 512] ─┤ torch.cat ──► [B, F+2, 512]
x_t ──► Linear ──► [B, F, 512] ─┘
│
PositionalEncoding
│
TransformerEncoder (8 layers)
│
remove first 2 tokens ──► [B, F, 512]
│
Linear ──► [B, F, J×3]
MDM is built on DDPM. The forward process adds Gaussian noise to clean motion
where $\bar{\alpha}t = \prod{s=1}^{t}(1 - \beta_s)$ and
The model
where
mdm-scratch/
├── model.py # MDM model: Transformer + PositionalEncoding
├── scheduler.py # NoiseScheduler: linear beta schedule, add_noise(), step()
├── train.py # Full training loop on HumanAct12Poses
├── sample.py # Inference: load checkpoint and generate motion
├── visualise.py # 3D skeleton visualization → animated GIF (matplotlib)
├── README.md # This file (English)
├── README_ja.md # Japanese version
├── examples/
│ ├── train_step.py # Demo: single training step with dummy data
│ └── sample_step.py # Demo: single sampling pass (reverse diffusion)
├── tests/
│ ├── test_model.py # Unit tests for MDM
│ └── test_scheduler.py # Unit tests for NoiseScheduler
├── .github/workflows/
│ └── test.yml # GitHub Actions CI: runs pytest on push
├── docs/
│ ├── decisions.md # Architecture Decision Records (English)
│ └── decisions_ja.md # Architecture Decision Records (Japanese)
└── assets/
└── demo_jump.gif # Generated motion demo (action 3: jump)
# 1. Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 2. Install PyTorch (CPU is fine for this step)
pip install torch
# 3. Run one training step (smoke test)
python examples/train_step.py
# 4. Run full training on HumanAct12Poses
python train.py
# 5. Generate motion from a trained checkpoint
python sample.py --checkpoint checkpoints/mdm_final.pth --action_id 3
# 6. Visualize generated motion as an animated GIF
python visualise.py --input output/generated_action3_samples4.npy --output assets/demo.gif --title "MDM - Jump"
# 7. Run unit tests
pytest tests/ -vExpected output (train.py):
--- トレーニング開始 (device: cuda) ---
Epoch 1/30000, Loss: 0.5234 MSE: 0.0476 Vel: 0.0476
...
Epoch 5000/30000, Loss: 0.0312 MSE: 0.0028 Vel: 0.0028
-> checkpoint: checkpoints/mdm_epoch5000.pth
...
トレーニング完了。モデルを checkpoints/mdm_final.pth に保存しました。
This implementation covers the core training loop only.
| Feature | This repo | reference/ |
|---|---|---|
| Transformer-based denoising model | ✅ | ✅ |
| Action-conditioned generation | ✅ | ✅ |
Forward diffusion (add_noise) |
✅ | ✅ |
| Reverse diffusion (sampling loop) | ✅ | ✅ |
| Full training loop with real data | ✅ (HumanAct12Poses) | ✅ |
| Unit tests + CI (GitHub Actions) | ✅ | ❌ |
| Text conditioning (CLIP / BERT) | ❌ | ✅ |
| Large-scale datasets (HumanML3D, KIT) | ❌ | ✅ |
| Evaluation metrics (FID, R-Precision) | ❌ | ✅ |
| 3D skeleton visualization (matplotlib) | ✅ | ❌ |
| SMPL mesh rendering | ❌ | ✅ |
Design decisions and trade-offs are documented in docs/decisions.md. Feature comparison is against the official implementation.
@inproceedings{tevet2023human,
title = {Human Motion Diffusion Model},
author = {Guy Tevet and Sigal Raab and Brian Gordon and Yoni Shafir
and Daniel Cohen-or and Amit Haim Bermano},
booktitle = {The Eleventh International Conference on Learning Representations},
year = {2023},
url = {https://openreview.net/forum?id=SJ1kSyO2jwu}
}