Yan Sun1, Jia Guo2, Stanley Kok1, Zihao Wang2,
Zujie Wen2, Zhiqiang Zhang2
1National University of Singapore,
2Ant Group
TL;DR PREPO improves the data efficiency of RL with verifiable reward by leveraging intrinsic data properties.
- [2025.09.24] Paper accepted on NeurIPS 2025 Workshop on Efficient Reasoning.
- [2025.08.25] Blog post released: Stretching the Comfort Zone: Boost Data Efficiency for RL Training with PREPO!.
Figure 1: Overview of PREPO. The PREPO objective integrates perplexity-based schedule learning and sequence-level entropy weighting into a unified optimization scheme. On Qwen2.5-Math-7B, PREPO achieves higher performance while requiring only 41.2% of the rollouts used by random selection, showing improved efficiency.
Our implementation is based on volcengine/verl.
conda create -n prepo python==3.11
conda activate prepo
pip3 install -e .
pip3 install vllm==0.8.2
pip install tensordict==0.6.0
pip install flash-attn==2.7.4.post1 --no-build-isolation
pip install wandb IPython matplotlib ipdb latex2sympy2-extended math-verify torchdata pylatexenc
You can download the dataset using the following command:
# cd the project folder
conda activate prepo
export PYTHONPATH="$PYTHONPATH:$(pwd)"
bash scripts/generate_dataset.shExample: Train Qwen2.5-Math-7b with PREPO 32 GPUs:
bash scripts/qwen_math_7b_prepo.sh