grpo-trainer is a toolkit for fine-tuning LLMs to generate high-quality OCaml code. It supports supervised fine-tuning (SFT), RLVR/GRPO training, evaluation, and monitoring with feedback from the OCaml compiler and test suite.
- Nix with flakes enabled for the development shell.
- CUDA-capable Linux server for GPU training.
- OCaml toolchain and llama.cpp are provided by the Nix shell.
git clone https://github.com/nilenso/grpo-trainer.git
cd grpo-trainerFor local development, enter the Nix shell:
make shellOn Linux, make shell enters the CUDA-enabled shell (nix develop --impure .#cuda). On other platforms, it enters the default shell.
On a fresh Linux training server, run:
scripts/bootstrap.shThe bootstrap script:
- refuses to run as root;
- installs Nix if
nixis not already available onPATH; - persists the Nix profile setup in
~/.bashrc; - enables Nix flakes and
nix-command; - links CUDA driver libraries into
.cuda-driver/; - enters the CUDA dev shell and installs CUDA-enabled Python dependencies.
If you are setting up manually instead, install CUDA-enabled Python dependencies inside the Nix shell:
uv sync --extra cudaThe Nix shell includes llama.cpp. To start a server from a Hugging Face GGUF model:
llama-server -hf unsloth/Qwen2.5-Coder-1.5B-Instruct-GGUF:F16 -c 4096 -ngl -1You can also use the Makefile helpers:
make llama-server-hf
make vllm-server model=<model_id>Training and evaluation configuration is read from .envrc via direnv. Keep local secrets and machine-specific settings out of source control.
Common settings include model IDs, dataset paths, output directories, LoRA parameters, batch sizes, and evaluation server configuration.
Train the base model with GRPO:
make rlvr-trainThis starts training with the default OCaml training dataset, writes logs to training.log, and stores run artifacts in grpo_runs/.
Training uses a graduated reward system with learning signals from multiple stages:
| Stage | Max Score | Description |
|---|---|---|
| Type check | 0.25 | Graduated credit based on error count; zero errors receives full credit. |
| Compilation | 0.10 | Full credit for successful compilation; partial credit for type-checked code that fails compilation. |
| Tests | 0.65 | Graduated credit: 0.65 × (passed_assertions / total_assertions). |
Correct but verbose solutions may receive a style penalty of up to 0.10:
- Extra code blocks:
0.02per block beyond the first. - Trailing prose after the final code fence:
0.03.
Completions with repetitive content, low code ratio, code block spam, or stub implementations are flagged as degenerate and penalized.
Training logs are written to training.log. Structured metrics are written to grpo_runs/:
| File | Description |
|---|---|
metrics.jsonl |
Step-level learning metrics: epoch, loss, grad norm, learning rate, and reward statistics. |
batch_metrics.jsonl |
Batch-level reward statistics. |
{reward_name}.jsonl |
Per-completion reward outcomes. |
For detailed metric descriptions, see doc/metrics.md.
| Service | URL | Description |
|---|---|---|
| Dashboard | http://localhost:8080 | Real-time GRPO training metrics. |
| Completions Viewer | http://localhost:8080/completions | Browse model completions. |
| TensorBoard | http://localhost:6006 | Loss curves, learning rate, and gradients. |
Pre-train the base model on OCaml code examples before RLVR:
make sft-trainSFT uses TRL's SFTTrainer with LoRA on the OCaml SFT dataset.
SFT uses completion-only training with TRL's prompt-completion dataset format:
- Prompt: problem description and function signature, masked from loss.
- Completion: OCaml code in markdown blocks, trained on.
When SFTTrainer receives prompt and completion columns, it masks prompt tokens from the loss (completion_only_loss=True by default). This teaches the model to generate OCaml code blocks without spending capacity on predicting prompt text.
SFT artifacts are written to:
sft_training.log— SFT training output.sft_runs/— SFT metrics and checkpoints.
| Service | URL | Description |
|---|---|---|
| SFT Dashboard | http://localhost:8080/sft | SFT training metrics. |
| TensorBoard | http://localhost:6006 | Loss curves, learning rate, and gradients. |
After training, merge the LoRA adapter into the base model to create a standalone model:
make merge-adapter path=<adapter-path>This requires BASE_MODEL_ID to be set in .envrc. The merged model is saved to merged_model/.
The Nix shell includes llama.cpp tools. To convert the merged model to GGUF:
# Enter the Nix shell if needed.
make shell
# Convert to GGUF. Requires llama.cpp Python dependencies.
pip install gguf
python -m llama_cpp.convert merged_model --outfile model.gguf
# Quantize for smaller inference artifacts.
llama-quantize model.gguf model-q4_k_m.gguf Q4_K_MRun evaluation against the configured test set:
make eval model=<model_name>Limit the number of examples when smoke testing:
make eval model=<model_name> limit=10| Command | Description |
|---|---|
make shell |
Enter the Nix development shell. |
make deps |
Install Python dependencies with uv sync --frozen. |
make rlvr-train |
Run RLVR/GRPO training. |
make sft-train |
Run supervised fine-tuning. |
make eval model=<name> |
Evaluate a model. |
make dashboard |
Start the dashboard server. |
make test |
Run the pytest suite. |
make lint |
Run Ruff on changed Python files. |
make fmt |
Format Python files with Ruff. |
grpo-trainer/
├── rlvr/ # RLVR/GRPO training module
│ ├── train.py # Main GRPO training script
│ ├── environment.py # Verifiers-compatible environment
│ ├── reward.py # Reward computation: type check, compile, tests
│ ├── config.py # Training configuration
│ ├── data.py # Dataset loading
│ └── logging.py # Metrics logging utilities
├── sft/ # Supervised fine-tuning module
│ ├── train.py # SFT training with TRL's SFTTrainer
│ ├── config.py # LoRA configuration
│ ├── data.py # Dataset loading from Hugging Face
│ └── logging.py # SFT metrics logging
├── eval/ # Evaluation module
│ ├── eval.py # Model evaluation script
│ ├── compare.py # Compare model outputs
│ └── metrics.py # Evaluation metrics
├── dashboard/ # Real-time training dashboard
│ ├── server.py # Dashboard backend
│ ├── index.html # GRPO metrics dashboard
│ ├── completions.html # Completions viewer
│ └── sft.html # SFT metrics dashboard
├── tests/ # Unit tests
│ ├── test_environment.py # Environment tests
│ ├── test_reward.py # Reward computation tests
│ └── test_partial_rewards.py
└── scripts/
├── run-sft.sh # SFT training launcher
├── run-rlvr-training.py # RLVR training launcher
├── run-eval.sh # Evaluation script
├── merge_adapter.py # Merge LoRA adapter into base model
├── start_vllm_server.sh # Start vLLM inference server
└── bootstrap.sh # Linux training server bootstrap script