GRPO fine-tuning a small open-weight LLM (Llama 3.2 3B) to play Taboo — give a clue for a target word without using any of the forbidden words, against a frozen guesser.
The setup:
- Giver: Llama 3.2 3B Instruct (the model being trained)
- Guesser: Llama 3.1 8B Instruct (frozen)
- Verifier: rule-based string matching with morphological stemming
- Training: GRPO on Modal, 2× A100-40GB
With the help of Claude, I am compiling a living writeup of the project. I'll link to the chapters below as I add them:
results/ch1/ch1.mdintroduces the project and describes the first round of training runs.results/ch2/ch2.mddetails ablations for improving win rate while keeping violation rate down, ultimately culminating in a checkpoint that outperformed much larger models.