PostTrain-Lab

A research and educational repo for exploring, implementing, and eventually inventing post-training techniques for LLMs.

Everything runs on Tinker — a hosted training API that handles the GPU side, so experiments stay clean Python notebooks.

The Goal

Post-training is where a pre-trained model becomes actually useful. This lab exists to:

Master the foundation — implement every known post-training methodology from scratch, understand what each one does and when it works
Invent new pipelines — once the fundamentals are solid, combine and extend them in ways that haven't been tried

Post-Training Methodologies

There are 4 core approaches. Everything else (math RL, code RL, tool use RL) is just one of these with a different dataset or reward signal.

1. Supervised Fine-Tuning (SFT)

Standard next-token prediction on labeled data. The simplest and most common form of post-training. Teaches a model new behaviors by showing it examples.

2. Reinforcement Learning (RL)

The model generates outputs, and a reward signal tells it what was good. The reward can come from anything — a verifier, a code executor, an LLM judge, a math checker. The model learns to maximize that reward.

3. Preference Learning

Align the model with what humans (or AI) prefer, rather than just labelled examples.

DPO — learns directly from (chosen, rejected) pairs. No reward model needed.
RLHF — first trains a reward model on preferences, then runs RL against it.

4. Distillation

Transfer knowledge from a stronger teacher model into the student.

On-policy — student generates, teacher scores/corrects, student learns from that
Off-policy — student learns from pre-collected teacher outputs
SDFT (Self-Distillation) — no teacher needed; model distills from itself using forward KL

Roadmap

Phase 1 — Foundation

Implement every major methodology as a clean, runnable notebook. Each notebook should be self-contained: dataset, training, and a clear before/after comparison.

See PLAN.md for the full detailed plan.

Method	Notebook	Status
Supervised Fine-Tuning (SFT)	`methods/sft/sft_train.ipynb`	✅ Done
RL with Verifiable Rewards (RLVR)	`methods/rlvr/rlvr_train.ipynb`	✅ Done
Direct Preference Optimization (DPO)	`methods/dpo/dpo_train.ipynb`	📋 Planned
Rubric-based RL	`methods/rubric_rl/rubric_rl_train.ipynb`	📋 Planned
RLHF (reward model + RL)	`methods/rlhf/rlhf_train.ipynb`	📋 Planned
On-policy Distillation	`methods/distillation/on_policy.ipynb`	📋 Planned
Off-policy Distillation	`methods/distillation/off_policy.ipynb`	📋 Planned
Self-Distillation (SDFT)	`methods/sdft/sdft_train.ipynb`	📋 Planned
Multi-Agent RL (MARL)	`methods/marl/marl_train.ipynb`	📋 Planned

Phase 2 — Stacking & Invention

Methods don't just compete — some are designed to stack. Phase 2 runs these known combinations as controlled experiments:

Stack	Idea
SFT → RL	warm-start RL with SFT instead of cold start
SFT → DPO	align to domain first, then refine preferences
SFT → RLHF	the classic 3-stage pipeline
Distill → RL	teacher sets the floor, RL removes the ceiling
SFT → Distill → RL	the full stack
RLVR → Rubric RL	correctness first, reasoning quality second
Iterative RLHF	loop: RL → collect outputs → retrain reward model → RL again

Plus open-ended invention: multi-objective RL, curriculum training, cross-method benchmarks on the same task.

See PLAN.md for the full breakdown.

Project Structure

PostTrain-Lab/
├── methods/                # One folder per methodology
│   ├── sft/               # Supervised Fine-Tuning
│   ├── rlvr/              # RL with Verifiable Rewards
│   ├── dpo/               # Direct Preference Optimization
│   ├── rlhf/              # RLHF pipeline
│   ├── distillation/      # Knowledge Distillation
│   ├── rubric_rl/         # Rubric-based RL
│   ├── sdft/              # Self-Distillation Fine-Tuning
│   └── multiagent_rl/     # Multi-Agent RL
│
└── tinker-cookbook/        # Reference implementations (from Thinking Machines)

Getting Started

git clone https://github.com/yourusername/PostTrain-Lab.git
cd PostTrain-Lab
pip install -r requirements.txt
export TINKER_API_KEY=<your-key>

Start with SFT:

jupyter notebook methods/sft/sft_train.ipynb

Infrastructure

Built on Tinker — training loops run locally in Python, GPU work runs remotely via the Tinker API.

Acknowledgments

Thinking Machines for the $5k grant
The tinker-cookbook for reference implementations

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.claude/skills		.claude/skills
docs		docs
methods		methods
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
PLAN.md		PLAN.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PostTrain-Lab

The Goal

Post-Training Methodologies

1. Supervised Fine-Tuning (SFT)

2. Reinforcement Learning (RL)

3. Preference Learning

4. Distillation

Roadmap

Phase 1 — Foundation

Phase 2 — Stacking & Invention

Project Structure

Getting Started

Infrastructure

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PostTrain-Lab

The Goal

Post-Training Methodologies

1. Supervised Fine-Tuning (SFT)

2. Reinforcement Learning (RL)

3. Preference Learning

4. Distillation

Roadmap

Phase 1 — Foundation

Phase 2 — Stacking & Invention

Project Structure

Getting Started

Infrastructure

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages