A minimal RL framework for LLMs.
RL fine-tuning of language models follows a straightforward loop: load
problems, roll out conversations, score the results, compute advantages,
and update the model weights. Each of these steps maps to a Python
protocol — DataLoader, Env, Generator, RewardModel, Model —
with minimal coupling between them. The two included examples (GSM8K
math and MuSiQue multi-hop search) are self-contained single-file
programs that wire these pieces together.
DataLoader → Env → Generator → RewardModel → GRPOTrainer → Model.update()
(problems) (rollouts) (trajectories) (scores) (advantages) (weight update)
One GRPO step:
DataLoaderyields a batch of examples.Generatorbuilds anEnvper example, samples from thePolicy, and collectsTrajectorygroups.RewardModelscores each trajectory.GRPOTrainermean-centers rewards within each group (the GRPO baseline) and callsModel.update().
The framework keeps three layers separate, each with its own types:
- Messages —
Message,ToolCall. The chat protocol. - RL —
State,Action,Env,Step,Trajectory,Reward. Environments and rollouts. - Tokens — prompt ids, completion ids, logprobs. The
Rendererbridges messages and tokens.
The Renderer is the seam. Envs work with messages. Training works with token ids.
The renderer translates between the two so envs stay model-agnostic.
Everything pluggable is a Python Protocol — no base classes, no registration:
| Protocol | Does |
|---|---|
Env |
start() → State, step(Action) → State | None |
EnvBuilder |
build(example) → Env |
Renderer |
messages ↔ tokens |
RewardModel |
score(example, Trajectory) → Reward |
Policy |
sample(prompt) → CompletionOutput |
Model |
policy() → Policy, update(batch) → UpdateResult |
Generator |
generate(examples, ...) → list[list[Trajectory]] |
DataLoader |
batches() → Iterator[list[Example]] |
rolo/
message.py # Message, ToolCall
rl.py # State, Action, Env, Step, Trajectory, Reward
rendering.py # Renderer, RenderedPrompt, ToolSpec
generation.py # Policy, Generator, RolloutGenerator
rewards.py # RewardModel
model.py # Model, TrainingBatch
training.py # GRPOTrainer, GRPOConfig, compute_advantages
data.py # DataLoader, HuggingFaceDataLoader
logging.py # MetricsLogger, TensorBoardLogger
tinker_backend.py # Tinker model + renderer adapters
examples/
gsm8k.py # Single-turn math (GSM8K)
musique_search.py # Multi-turn search agent (MuSiQue)
A one-shot prompt, a boxed-answer format, and math-verify for semantic grading. The simplest possible GRPO setup.
uv run python -m rolo.examples.gsm8k \
--model-name meta-llama/Llama-3.2-1B \
--batch-size 64 \
--samples-per-prompt 8 \
--max-steps 100 \
--eval-on-start \
--eval-every-steps 10 \
--eval-limit 100 \
--run-dir runs/gsm8k \
--save-best-checkpointResults from a 100-step run on Llama-3.2-1B (base, not instruct):
| Step | Eval accuracy | Pass@8 |
|---|---|---|
| 0 | 0.4% | 3% |
| 40 | 5.4% | 26% |
| 80 | 8.5% | 37% |
Format compliance (\boxed{}) saturates within the first few steps thanks to the
one-shot example. Pass@8 (at least 1 of 8 samples correct) shows the model has
latent capability on ~37% of problems despite low per-sample accuracy.
A local per-example knowledge base with one structured search tool, BM25 retrieval,
and a reward model that scores answer correctness while penalizing tool-call loops.
This is the smallest useful multi-turn agent task in the repo.
Requires a model with native tool-calling support (e.g. Qwen 3.5):
uv run python -m rolo.examples.musique_search \
--model-name Qwen/Qwen3.5-4B \
--batch-size 4 \
--samples-per-prompt 2 \
--max-turns 2 \
--learning-rate 8e-5 \
--max-tokens 256 \
--train-limit 32 \
--eval-limit 16 \
--run-dir runs/musique_searchThe concrete Model and Renderer implementations use the
Tinker service for remote
LoRA training and sampling. Local training is not yet implemented.
TinkerModelConfig.project_idmust be an existing Tinker project id if set.TinkerRendererusestinker_cookbooktokenizer resolution and may fetch tokenizer files from the Hub on first use.