A research and educational repo for exploring, implementing, and eventually inventing post-training techniques for LLMs.
Everything runs on Tinker — a hosted training API that handles the GPU side, so experiments stay clean Python notebooks.
Post-training is where a pre-trained model becomes actually useful. This lab exists to:
- Master the foundation — implement every known post-training methodology from scratch, understand what each one does and when it works
- Invent new pipelines — once the fundamentals are solid, combine and extend them in ways that haven't been tried
There are 4 core approaches. Everything else (math RL, code RL, tool use RL) is just one of these with a different dataset or reward signal.
Standard next-token prediction on labeled data. The simplest and most common form of post-training. Teaches a model new behaviors by showing it examples.
The model generates outputs, and a reward signal tells it what was good. The reward can come from anything — a verifier, a code executor, an LLM judge, a math checker. The model learns to maximize that reward.
Align the model with what humans (or AI) prefer, rather than just labelled examples.
- DPO — learns directly from (chosen, rejected) pairs. No reward model needed.
- RLHF — first trains a reward model on preferences, then runs RL against it.
Transfer knowledge from a stronger teacher model into the student.
- On-policy — student generates, teacher scores/corrects, student learns from that
- Off-policy — student learns from pre-collected teacher outputs
- SDFT (Self-Distillation) — no teacher needed; model distills from itself using forward KL
Implement every major methodology as a clean, runnable notebook. Each notebook should be self-contained: dataset, training, and a clear before/after comparison.
See PLAN.md for the full detailed plan.
| Method | Notebook | Status |
|---|---|---|
| Supervised Fine-Tuning (SFT) | methods/sft/sft_train.ipynb |
✅ Done |
| RL with Verifiable Rewards (RLVR) | methods/rlvr/rlvr_train.ipynb |
✅ Done |
| Direct Preference Optimization (DPO) | methods/dpo/dpo_train.ipynb |
📋 Planned |
| Rubric-based RL | methods/rubric_rl/rubric_rl_train.ipynb |
📋 Planned |
| RLHF (reward model + RL) | methods/rlhf/rlhf_train.ipynb |
📋 Planned |
| On-policy Distillation | methods/distillation/on_policy.ipynb |
📋 Planned |
| Off-policy Distillation | methods/distillation/off_policy.ipynb |
📋 Planned |
| Self-Distillation (SDFT) | methods/sdft/sdft_train.ipynb |
📋 Planned |
| Multi-Agent RL (MARL) | methods/marl/marl_train.ipynb |
📋 Planned |
Methods don't just compete — some are designed to stack. Phase 2 runs these known combinations as controlled experiments:
| Stack | Idea |
|---|---|
| SFT → RL | warm-start RL with SFT instead of cold start |
| SFT → DPO | align to domain first, then refine preferences |
| SFT → RLHF | the classic 3-stage pipeline |
| Distill → RL | teacher sets the floor, RL removes the ceiling |
| SFT → Distill → RL | the full stack |
| RLVR → Rubric RL | correctness first, reasoning quality second |
| Iterative RLHF | loop: RL → collect outputs → retrain reward model → RL again |
Plus open-ended invention: multi-objective RL, curriculum training, cross-method benchmarks on the same task.
See PLAN.md for the full breakdown.
PostTrain-Lab/
├── methods/ # One folder per methodology
│ ├── sft/ # Supervised Fine-Tuning
│ ├── rlvr/ # RL with Verifiable Rewards
│ ├── dpo/ # Direct Preference Optimization
│ ├── rlhf/ # RLHF pipeline
│ ├── distillation/ # Knowledge Distillation
│ ├── rubric_rl/ # Rubric-based RL
│ ├── sdft/ # Self-Distillation Fine-Tuning
│ └── multiagent_rl/ # Multi-Agent RL
│
└── tinker-cookbook/ # Reference implementations (from Thinking Machines)
git clone https://github.com/yourusername/PostTrain-Lab.git
cd PostTrain-Lab
pip install -r requirements.txt
export TINKER_API_KEY=<your-key>Start with SFT:
jupyter notebook methods/sft/sft_train.ipynbBuilt on Tinker — training loops run locally in Python, GPU work runs remotely via the Tinker API.
- Thinking Machines for the $5k grant
- The tinker-cookbook for reference implementations