Skip to content

txy77/DEPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward (ICLR 2026)

😀 Contributions

  • 1️⃣ We are the first to integrate both offline and online data selection strategies to enhance data efficiency in RLVR training.
  • ⏰ In the offline phase, we employ a multi-dimensional data curation strategy based on diversity, influence, and difficulty. Then, during online training, we dynamically filter samples by their explorability and replay under-explored samples to further improve training efficiency.
  • 🧪 Extensive experiments across five reasoning datasets and three LLMs demonstrate the effectiveness and efficiency of our proposed method under both offline and online data selection scenarios.

🌟 Highlights

image

Overview of our approach DEPO. (a) Our approach improves the data efficiency in RLVR training via an end-to-end offline and online data selection strategy. (b) In the offline phase, we first construct a sample graph based on the representations, then apply PageRank-weighted Determinantal Point Process to select a diverse and influential subset, and finally sample from this subset with difficulty following a normal distribution. (c) In the online phase, we evaluate the explorability of each sample based on its historical training dynamics and retain high-explorability ones for rollout, and actively replay under-explored samples to ensure sufficient training of all samples.

image

Performance comparison of various data selection methods. “Offline” and “Online” refer to the offline and online data selection methods, respectively. “Ratio“, “Time”, and “RN” denote the ratio of selected data, total training time, and total rollout numbers, respectively. We highlight the best performance across different data selection methods.

image

About

About The official GitHub page for ''Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward'' Resources

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors