Skip to content

yan-sun-x/PREPO-preview

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration

Yan Sun1, Jia Guo2, Stanley Kok1, Zihao Wang2, Zujie Wen2, Zhiqiang Zhang2
1National University of Singapore, 2Ant Group

[Paper] | [Blog]

TL;DR PREPO improves the data efficiency of RL with verifiable reward by leveraging intrinsic data properties.

🗞️ News

PREPO Overview

Figure 1: Overview of PREPO. The PREPO objective integrates perplexity-based schedule learning and sequence-level entropy weighting into a unified optimization scheme. On Qwen2.5-Math-7B, PREPO achieves higher performance while requiring only 41.2% of the rollouts used by random selection, showing improved efficiency.

Getting Started

Our implementation is based on volcengine/verl.

1. Environment Setup

conda create -n prepo python==3.11
conda activate prepo

pip3 install -e .
pip3 install vllm==0.8.2
pip install tensordict==0.6.0
pip install flash-attn==2.7.4.post1 --no-build-isolation

pip install wandb IPython matplotlib ipdb latex2sympy2-extended math-verify torchdata pylatexenc

2. Prepare Data

You can download the dataset using the following command:

# cd the project folder
conda activate prepo
export PYTHONPATH="$PYTHONPATH:$(pwd)"

bash scripts/generate_dataset.sh

3. Training

Example: Train Qwen2.5-Math-7b with PREPO 32 GPUs:

bash scripts/qwen_math_7b_prepo.sh

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors