Skip to content

Said-Ay/mdm-scratch

Repository files navigation

MDM from Scratch

tests arXiv Python PyTorch ja

Minimal PyTorch re-implementation of "Human Motion Diffusion Model" (Tevet et al., ICLR 2023), built from scratch for learning purposes.

The full official implementation is at GuyTevet/motion-diffusion-model.


Demo

MDM jump demo

Action 3 (jump), trained on HumanAct12Poses for 30,000 epochs on a single RTX 2080 Ti.


What Is This?

MDM generates human motion sequences (e.g., a person walks forward) using a diffusion model — a generative approach that learns to reverse a process which gradually corrupts data with noise.

This repository is a minimal, from-scratch implementation that isolates the core mechanics: the noise scheduler, the Transformer-based denoising model, and the training loop. It intentionally omits CLIP/BERT text encoding, real datasets, and evaluation pipelines so the essential ideas stay readable.


Architecture

The three components and how they connect:

graph LR
    PE["PositionalEncoding\n(model.py)"]
    MDM["MDM\n(model.py)"]
    NS["NoiseScheduler\n(scheduler.py)"]
    TS["train_step()\n(train_step.py)"]

    PE --> MDM
    TS -->|"① add_noise(x₀, ε, t) → x_t"| NS
    NS -->|"x_t"| TS
    TS -->|"② forward(x_t, t, action) → pred_x₀"| MDM
    MDM -->|"pred_x₀"| TS
Loading

Data flow — one training step

graph LR
    x0["x₀ · Clean motion\n[B, F, J×3]"]
    eps["ε ~ N(0, I)\n[B, F, J×3]"]
    xt["x_t · Noisy motion\n[B, F, J×3]"]
    pred["pred_x₀\n[B, F, J×3]"]
    loss["MSE Loss"]
    optim["Adam update"]

    x0 --> xt
    eps --> xt
    xt -->|"MDM.forward"| pred
    pred --> loss
    x0 -->|"target"| loss
    loss --> optim
    optim -->|"∇θ"| pred
Loading

Inside MDM.forward()

action_class ──► Embedding          ──► [B, 1, 512] ─┐
t            ──► Linear → SiLU → Linear ──► [B, 1, 512] ─┤ torch.cat ──► [B, F+2, 512]
x_t          ──► Linear             ──► [B, F, 512] ─┘
                                                         │
                                              PositionalEncoding
                                                         │
                                         TransformerEncoder (8 layers)
                                                         │
                              remove first 2 tokens  ──► [B, F, 512]
                                                         │
                                              Linear  ──► [B, F, J×3]

Theory in Brief

MDM is built on DDPM. The forward process adds Gaussian noise to clean motion $x_0$ step by step. In closed form, the noisy motion at any timestep $t$ can be sampled directly:

$$x_t = \sqrt{\bar{\alpha}_t}, x_0 + \sqrt{1 - \bar{\alpha}_t}, \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \mathbf{I})$$

where $\bar{\alpha}t = \prod{s=1}^{t}(1 - \beta_s)$ and $\beta_s$ follows a linear schedule from $0.0001$ to $0.02$ over 1000 steps.

The model $f_\theta$ is trained to recover the clean motion from the noisy input. The training objective is:

$$\mathcal{L} = \mathbb{E}_{x_0,, t,, \varepsilon}!\left[\left| x_0 - f_\theta(x_t, t, a) \right|^2\right]$$

where $a$ is the action condition. See docs/decisions.md for why $x_0$-prediction was chosen over noise-prediction.


File Structure

mdm-scratch/
├── model.py          # MDM model: Transformer + PositionalEncoding
├── scheduler.py      # NoiseScheduler: linear beta schedule, add_noise(), step()
├── train.py          # Full training loop on HumanAct12Poses
├── sample.py         # Inference: load checkpoint and generate motion
├── visualise.py      # 3D skeleton visualization → animated GIF (matplotlib)
├── README.md         # This file (English)
├── README_ja.md      # Japanese version
├── examples/
│   ├── train_step.py    # Demo: single training step with dummy data
│   └── sample_step.py   # Demo: single sampling pass (reverse diffusion)
├── tests/
│   ├── test_model.py      # Unit tests for MDM
│   └── test_scheduler.py  # Unit tests for NoiseScheduler
├── .github/workflows/
│   └── test.yml      # GitHub Actions CI: runs pytest on push
├── docs/
│   ├── decisions.md     # Architecture Decision Records (English)
│   └── decisions_ja.md  # Architecture Decision Records (Japanese)
└── assets/
    └── demo_jump.gif    # Generated motion demo (action 3: jump)

Quick Start

# 1. Create and activate a virtual environment
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate

# 2. Install PyTorch (CPU is fine for this step)
pip install torch

# 3. Run one training step (smoke test)
python examples/train_step.py

# 4. Run full training on HumanAct12Poses
python train.py

# 5. Generate motion from a trained checkpoint
python sample.py --checkpoint checkpoints/mdm_final.pth --action_id 3

# 6. Visualize generated motion as an animated GIF
python visualise.py --input output/generated_action3_samples4.npy --output assets/demo.gif --title "MDM - Jump"

# 7. Run unit tests
pytest tests/ -v

Expected output (train.py):

--- トレーニング開始 (device: cuda) ---
Epoch 1/30000, Loss: 0.5234  MSE: 0.0476  Vel: 0.0476
...
Epoch 5000/30000, Loss: 0.0312  MSE: 0.0028  Vel: 0.0028
  -> checkpoint: checkpoints/mdm_epoch5000.pth
...
トレーニング完了。モデルを checkpoints/mdm_final.pth に保存しました。

Scope

This implementation covers the core training loop only.

Feature This repo reference/
Transformer-based denoising model
Action-conditioned generation
Forward diffusion (add_noise)
Reverse diffusion (sampling loop)
Full training loop with real data ✅ (HumanAct12Poses)
Unit tests + CI (GitHub Actions)
Text conditioning (CLIP / BERT)
Large-scale datasets (HumanML3D, KIT)
Evaluation metrics (FID, R-Precision)
3D skeleton visualization (matplotlib)
SMPL mesh rendering

Design decisions and trade-offs are documented in docs/decisions.md. Feature comparison is against the official implementation.


Reference

@inproceedings{tevet2023human,
  title     = {Human Motion Diffusion Model},
  author    = {Guy Tevet and Sigal Raab and Brian Gordon and Yoni Shafir
               and Daniel Cohen-or and Amit Haim Bermano},
  booktitle = {The Eleventh International Conference on Learning Representations},
  year      = {2023},
  url       = {https://openreview.net/forum?id=SJ1kSyO2jwu}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages