rolo

A minimal RL framework for LLMs.

RL fine-tuning of language models follows a straightforward loop: load problems, roll out conversations, score the results, compute advantages, and update the model weights. Each of these steps maps to a Python protocol — DataLoader, Env, Generator, RewardModel, Model — with minimal coupling between them. The two included examples (GSM8K math and MuSiQue multi-hop search) are self-contained single-file programs that wire these pieces together.

The loop

DataLoader  →  Env  →  Generator  →  RewardModel  →  GRPOTrainer  →  Model.update()
 (problems)   (rollouts)  (trajectories)  (scores)     (advantages)    (weight update)

One GRPO step:

DataLoader yields a batch of examples.
Generator builds an Env per example, samples from the Policy, and collects Trajectory groups.
RewardModel scores each trajectory.
GRPOTrainer mean-centers rewards within each group (the GRPO baseline) and calls Model.update().

Three layers

The framework keeps three layers separate, each with its own types:

Messages — Message, ToolCall. The chat protocol.
RL — State, Action, Env, Step, Trajectory, Reward. Environments and rollouts.
Tokens — prompt ids, completion ids, logprobs. The Renderer bridges messages and tokens.

The Renderer is the seam. Envs work with messages. Training works with token ids. The renderer translates between the two so envs stay model-agnostic.

Protocols

Everything pluggable is a Python Protocol — no base classes, no registration:

Protocol	Does
`Env`	`start() → State`, `step(Action) → State \| None`
`EnvBuilder`	`build(example) → Env`
`Renderer`	messages ↔ tokens
`RewardModel`	`score(example, Trajectory) → Reward`
`Policy`	`sample(prompt) → CompletionOutput`
`Model`	`policy() → Policy`, `update(batch) → UpdateResult`
`Generator`	`generate(examples, ...) → list[list[Trajectory]]`
`DataLoader`	`batches() → Iterator[list[Example]]`

Package layout

rolo/
  message.py          # Message, ToolCall
  rl.py               # State, Action, Env, Step, Trajectory, Reward
  rendering.py        # Renderer, RenderedPrompt, ToolSpec
  generation.py       # Policy, Generator, RolloutGenerator
  rewards.py          # RewardModel
  model.py            # Model, TrainingBatch
  training.py         # GRPOTrainer, GRPOConfig, compute_advantages
  data.py             # DataLoader, HuggingFaceDataLoader
  logging.py          # MetricsLogger, TensorBoardLogger
  tinker_backend.py   # Tinker model + renderer adapters
  examples/
    gsm8k.py          # Single-turn math (GSM8K)
    musique_search.py # Multi-turn search agent (MuSiQue)

Examples

GSM8K — single-turn math

A one-shot prompt, a boxed-answer format, and math-verify for semantic grading. The simplest possible GRPO setup.

uv run python -m rolo.examples.gsm8k \
  --model-name meta-llama/Llama-3.2-1B \
  --batch-size 64 \
  --samples-per-prompt 8 \
  --max-steps 100 \
  --eval-on-start \
  --eval-every-steps 10 \
  --eval-limit 100 \
  --run-dir runs/gsm8k \
  --save-best-checkpoint

Results from a 100-step run on Llama-3.2-1B (base, not instruct):

Step	Eval accuracy	Pass@8
0	0.4%	3%
40	5.4%	26%
80	8.5%	37%

Format compliance (\boxed{}) saturates within the first few steps thanks to the one-shot example. Pass@8 (at least 1 of 8 samples correct) shows the model has latent capability on ~37% of problems despite low per-sample accuracy.

MuSiQue — multi-turn search agent

A local per-example knowledge base with one structured search tool, BM25 retrieval, and a reward model that scores answer correctness while penalizing tool-call loops. This is the smallest useful multi-turn agent task in the repo.

Requires a model with native tool-calling support (e.g. Qwen 3.5):

uv run python -m rolo.examples.musique_search \
  --model-name Qwen/Qwen3.5-4B \
  --batch-size 4 \
  --samples-per-prompt 2 \
  --max-turns 2 \
  --learning-rate 8e-5 \
  --max-tokens 256 \
  --train-limit 32 \
  --eval-limit 16 \
  --run-dir runs/musique_search

Tinker backend

The concrete Model and Renderer implementations use the Tinker service for remote LoRA training and sampling. Local training is not yet implemented.

TinkerModelConfig.project_id must be an existing Tinker project id if set.
TinkerRenderer uses tinker_cookbook tokenizer resolution and may fetch tokenizer files from the Hub on first use.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
rolo		rolo
tests		tests
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rolo

The loop

Three layers

Protocols

Package layout

Examples

GSM8K — single-turn math

MuSiQue — multi-turn search agent

Tinker backend

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rolo

The loop

Three layers

Protocols

Package layout

Examples

GSM8K — single-turn math

MuSiQue — multi-turn search agent

Tinker backend

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages