Skip to content

jackdavidweber/taboo-rl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

taboo-rl

GRPO fine-tuning a small open-weight LLM (Llama 3.2 3B) to play Taboo — give a clue for a target word without using any of the forbidden words, against a frozen guesser.

The setup:

  • Giver: Llama 3.2 3B Instruct (the model being trained)
  • Guesser: Llama 3.1 8B Instruct (frozen)
  • Verifier: rule-based string matching with morphological stemming
  • Training: GRPO on Modal, 2× A100-40GB

Results

With the help of Claude, I am compiling a living writeup of the project. I'll link to the chapters below as I add them:

  • results/ch1/ch1.md introduces the project and describes the first round of training runs.
  • results/ch2/ch2.md details ablations for improving win rate while keeping violation rate down, ultimately culminating in a checkpoint that outperformed much larger models.

About

RL fine-tune a small LLM to play Taboo. Baseline eval + GRPO training scaffolding.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages