- 1️⃣ We are the first to integrate both offline and online data selection strategies to enhance data efficiency in RLVR training.
- ⏰ In the offline phase, we employ a multi-dimensional data curation strategy based on diversity, influence, and difficulty. Then, during online training, we dynamically filter samples by their explorability and replay under-explored samples to further improve training efficiency.
- 🧪 Extensive experiments across five reasoning datasets and three LLMs demonstrate the effectiveness and efficiency of our proposed method under both offline and online data selection scenarios.
Overview of our approach DEPO. (a) Our approach improves the data efficiency in RLVR training via an end-to-end offline and online data selection strategy. (b) In the offline phase, we first construct a sample graph based on the representations, then apply PageRank-weighted Determinantal Point Process to select a diverse and influential subset, and finally sample from this subset with difficulty following a normal distribution. (c) In the online phase, we evaluate the explorability of each sample based on its historical training dynamics and retain high-explorability ones for rollout, and actively replay under-explored samples to ensure sufficient training of all samples.
Performance comparison of various data selection methods. “Offline” and “Online” refer to the offline and online data selection methods, respectively. “Ratio“, “Time”, and “RN” denote the ratio of selected data, total training time, and total rollout numbers, respectively. We highlight the best performance across different data selection methods.