Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration

Yan Sun¹, Jia Guo², Stanley Kok¹, Zihao Wang², Zujie Wen², Zhiqiang Zhang²
¹National University of Singapore, ²Ant Group

[Paper] | [Blog]

TL;DR PREPO improves the data efficiency of RL with verifiable reward by leveraging intrinsic data properties.

🗞️ News

[2025.09.24] Paper accepted on NeurIPS 2025 Workshop on Efficient Reasoning.
[2025.08.25] Blog post released: Stretching the Comfort Zone: Boost Data Efficiency for RL Training with PREPO!.

Figure 1: Overview of PREPO. The PREPO objective integrates perplexity-based schedule learning and sequence-level entropy weighting into a unified optimization scheme. On Qwen2.5-Math-7B, PREPO achieves higher performance while requiring only 41.2% of the rollouts used by random selection, showing improved efficiency.

Getting Started

Our implementation is based on volcengine/verl.

1. Environment Setup

conda create -n prepo python==3.11
conda activate prepo

pip3 install -e .
pip3 install vllm==0.8.2
pip install tensordict==0.6.0
pip install flash-attn==2.7.4.post1 --no-build-isolation

pip install wandb IPython matplotlib ipdb latex2sympy2-extended math-verify torchdata pylatexenc

2. Prepare Data

You can download the dataset using the following command:

# cd the project folder
conda activate prepo
export PYTHONPATH="$PYTHONPATH:$(pwd)"

bash scripts/generate_dataset.sh

3. Training

Example: Train Qwen2.5-Math-7b with PREPO 32 GPUs:

bash scripts/qwen_math_7b_prepo.sh

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data/dapo_math		data/dapo_math
data_preprocess		data_preprocess
docs		docs
scripts		scripts
static		static
verl		verl
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration

🗞️ News

Getting Started

1. Environment Setup

2. Prepare Data

3. Training

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration

🗞️ News

Getting Started

1. Environment Setup

2. Prepare Data

3. Training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages