Skip to content

Turi-Labs/PostTrain-Lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PostTrain-Lab

Python 3.11+ License: MIT

A research and educational repo for exploring, implementing, and eventually inventing post-training techniques for LLMs.

Everything runs on Tinker — a hosted training API that handles the GPU side, so experiments stay clean Python notebooks.


The Goal

Post-training is where a pre-trained model becomes actually useful. This lab exists to:

  1. Master the foundation — implement every known post-training methodology from scratch, understand what each one does and when it works
  2. Invent new pipelines — once the fundamentals are solid, combine and extend them in ways that haven't been tried

Post-Training Methodologies

There are 4 core approaches. Everything else (math RL, code RL, tool use RL) is just one of these with a different dataset or reward signal.

1. Supervised Fine-Tuning (SFT)

Standard next-token prediction on labeled data. The simplest and most common form of post-training. Teaches a model new behaviors by showing it examples.

2. Reinforcement Learning (RL)

The model generates outputs, and a reward signal tells it what was good. The reward can come from anything — a verifier, a code executor, an LLM judge, a math checker. The model learns to maximize that reward.

3. Preference Learning

Align the model with what humans (or AI) prefer, rather than just labelled examples.

  • DPO — learns directly from (chosen, rejected) pairs. No reward model needed.
  • RLHF — first trains a reward model on preferences, then runs RL against it.

4. Distillation

Transfer knowledge from a stronger teacher model into the student.

  • On-policy — student generates, teacher scores/corrects, student learns from that
  • Off-policy — student learns from pre-collected teacher outputs
  • SDFT (Self-Distillation) — no teacher needed; model distills from itself using forward KL

Roadmap

Phase 1 — Foundation

Implement every major methodology as a clean, runnable notebook. Each notebook should be self-contained: dataset, training, and a clear before/after comparison.

See PLAN.md for the full detailed plan.

Method Notebook Status
Supervised Fine-Tuning (SFT) methods/sft/sft_train.ipynb ✅ Done
RL with Verifiable Rewards (RLVR) methods/rlvr/rlvr_train.ipynb ✅ Done
Direct Preference Optimization (DPO) methods/dpo/dpo_train.ipynb 📋 Planned
Rubric-based RL methods/rubric_rl/rubric_rl_train.ipynb 📋 Planned
RLHF (reward model + RL) methods/rlhf/rlhf_train.ipynb 📋 Planned
On-policy Distillation methods/distillation/on_policy.ipynb 📋 Planned
Off-policy Distillation methods/distillation/off_policy.ipynb 📋 Planned
Self-Distillation (SDFT) methods/sdft/sdft_train.ipynb 📋 Planned
Multi-Agent RL (MARL) methods/marl/marl_train.ipynb 📋 Planned

Phase 2 — Stacking & Invention

Methods don't just compete — some are designed to stack. Phase 2 runs these known combinations as controlled experiments:

Stack Idea
SFT → RL warm-start RL with SFT instead of cold start
SFT → DPO align to domain first, then refine preferences
SFT → RLHF the classic 3-stage pipeline
Distill → RL teacher sets the floor, RL removes the ceiling
SFT → Distill → RL the full stack
RLVR → Rubric RL correctness first, reasoning quality second
Iterative RLHF loop: RL → collect outputs → retrain reward model → RL again

Plus open-ended invention: multi-objective RL, curriculum training, cross-method benchmarks on the same task.

See PLAN.md for the full breakdown.


Project Structure

PostTrain-Lab/
├── methods/                # One folder per methodology
│   ├── sft/               # Supervised Fine-Tuning
│   ├── rlvr/              # RL with Verifiable Rewards
│   ├── dpo/               # Direct Preference Optimization
│   ├── rlhf/              # RLHF pipeline
│   ├── distillation/      # Knowledge Distillation
│   ├── rubric_rl/         # Rubric-based RL
│   ├── sdft/              # Self-Distillation Fine-Tuning
│   └── multiagent_rl/     # Multi-Agent RL
│
└── tinker-cookbook/        # Reference implementations (from Thinking Machines)

Getting Started

git clone https://github.com/yourusername/PostTrain-Lab.git
cd PostTrain-Lab
pip install -r requirements.txt
export TINKER_API_KEY=<your-key>

Start with SFT:

jupyter notebook methods/sft/sft_train.ipynb

Infrastructure

Built on Tinker — training loops run locally in Python, GPU work runs remotely via the Tinker API.


Acknowledgments

About

PostTrain-Lab is a modular educational project designed to explore post-training methods for LLMs, from supervised fine-tuning to reinforcement learning and beyond.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors