RESD

Learning from Rare Success and Rich Feedback via Reflection-Enhanced Self-Distillation

RESD is an implementation of on-policy self-distillation built on veRL and SDPO.

Different from original SDPO, RESD maintains two persistent contexts: a playbook, inspired by the broader idea from ACE, that stores reusable lessons distilled from previous failures, and an optional solution buffer that caches successful trajectories when available. At each training step, RESD first updates these contexts using the outcome of the current rollout. This is achieved by either removing playbook entries based on their utility and staleness, or adding new entries generated from reflections. Finally, the teacher model is synchronized with the student model via an EMA update and conditioned on the enriched context to produce token-level supervision.

RESD allows the model to actively interpret the feedback instead of passively receiving it, which we found to be a key design axis to improve performance.

News

[2026.05.12] Code released.

Framework Comparison

Key Features

Fast Playbook Curation & Concise

RESD reflects on the failed trajectories and curate playbook entries based on the reflections. To ensure a maximum number of entries, the playbook is concised before the curation based on entry utility and staleness. Checkout selfevolve/resd/context_updater/playbook_context_updater.py.
Interleaved Context Update & Model Update

RESD supports interleaved context update and model update. At each gradient step, the context is updated based on student rollouts, while model update is conducted afterwards. This design ensures the rollouts are always on-policy.
Stream Training

RESD can be used to perform streaming training where the model makes a single pass over the training data and each training example is seen at most once. For every incoming batch, the trainer executes an inner loop of up to K update iterations on the same set of prompts. Checkout selfevolve/resd/trainer/ppo/stream_trainer.py.
Customize Feedback Format

RESD allows to customize the teacher prompt structure. Checkout selfevolve/resd/context_updater/prompts.

Datasets

We evaluate on four tasks spanning program synthesis, physical reasoning, and financial NER. All tasks provide rich execution feedback (e.g., per-test-case pass/fail) despite using sparse binary rewards.

Task	Source	Train	Test	Description
Manufactoria-Has	RL-Grok	742	132	Write DSL programs to check input tape patterns
BouncingSim-Easy	RL-Grok	640	100	Simulate 2-D multi-object bouncing dynamics (easy)
BouncingSim-Medium	RL-Grok	320	100	Simulate 2-D multi-object bouncing dynamics (medium)
FiNER	ACE	1000	500	Tag financial named entities in SEC filings

Key characteristics:

RL-Grok tasks (Manufactoria, BouncingSim): Near-zero initial success rates with per-test-case pass/fail feedback. A natural testbed for learning from failure feedback, requiring the model to synthesize executable programs from scratch.
FiNER: Higher initial success rate with per-entity correctness feedback. Assesses whether RESD benefits regimes where successful demonstrations are more accessible.

Results

⚠️ Note: There might be variations of performance between runs due to rollout quality.

Comparison with SDPO

Comparison with GRPO

Installation

You can choose to install from conda env config file or simply pull our pre-built docker image.

Install via conda

conda env create -f environment.yml

Docker Environment

docker run --gpus all --shm-size=64g --rm -it --net=host \
 --entrypoint /usr/bin/bash \
 brandonzyw/resd:v2

Run Examples

We provide out-of-the-box scripts in the 'RESD/' directory for training with different settings.

Before running, set your Weights & Biases API key:

export WANDB_API_KEY=<your-wandb-api-key>

SDPO

bash selfevolve/resd/run_manufactoria_has_sdpo_stream_qwen3_4b_fsdp.sh # Manufactoria-Has

bash selfevolve/resd/run_finer_sdpo_stream_qwen3_4b_fsdp.sh # FiNER

bash selfevolve/resd/run_bouncingsim_multiobj_easy_sdpo_stream_qwen3_4b_fsdp.sh # BouncingSim-Easy

bash selfevolve/resd/run_bouncingsim_multiobj_medium_sdpo_stream_qwen3_30b_fsdp.sh # BouncingSim-Medium

GRPO

bash selfevolve/resd/run_manufactoria_has_grpo_stream_qwen3_4b_fsdp.sh # Manufactoria-Has

bash selfevolve/resd/run_finer_grpo_stream_qwen3_4b_fsdp.sh # FiNER

bash selfevolve/resd/run_bouncingsim_multiobj_easy_grpo_stream_qwen3_4b_fsdp.sh # BouncingSim-Easy

bash selfevolve/resd/run_bouncingsim_multiobj_medium_grpo_stream_qwen3_30b_fsdp.sh # BouncingSim-Medium

RESD (Ours)

bash selfevolve/resd/run_manufactoria_has_resd_stream_qwen3_4b_fsdp.sh # Manufactoria-Has

bash selfevolve/resd/run_finer_resd_stream_qwen3_4b_fsdp.sh # FiNER

bash selfevolve/resd/run_bouncingsim_multiobj_easy_resd_stream_qwen3_4b_fsdp.sh # BouncingSim-Easy

bash selfevolve/resd/run_bouncingsim_multiobj_medium_resd_stream_qwen3_30b_fsdp.sh # BouncingSim-Medium

Run Logs

Dataset	W&B
Manufactoria-Has
BouncingSim-Easy
BouncingSim-Medium
FiNER

Acknowledgement

We gratefully acknowledge the contributions of the veRL team for providing a solid RL infrastructure.

Special thanks to the SDPO and ACE project for their codebase, which inspired early design choices during the development of RESD.

We also thank the developers of RL-Grok for providing the data source.

Citation

If you find RESD useful in your research or applications, we would appreciate it if you could cite our work:

@misc{zhang2026learningraresuccessrich,
      title={Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation}, 
      author={Yuwei Zhang and Sha Li and Changlong Yu and Qin Lu and Shuowei Jin and Chengyu Dong and Haoran Liu and Ilgee Hong and Xintong Li and Zhenyu Shi and Bing Yin and Jingbo Shang},
      year={2026},
      eprint={2605.12741},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.12741}, 
}

We're excited to share our early results and welcome feedback from the community as we continue to refine and expand RESD’s capabilities. If you have any questions or feedback, please feel free to contact us at yuz163@ucsd.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 2,120 Commits
.github		.github
docker		docker
docs		docs
scripts		scripts
selfevolve		selfevolve
tests		tests
verl		verl
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
environment.yml		environment.yml
paper.pdf		paper.pdf
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements-npu.txt		requirements-npu.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RESD

News

Framework Comparison

Table of Contents

Key Features

Datasets

Results

Comparison with SDPO

Comparison with GRPO

Installation

Install via conda

Docker Environment

Run Examples

SDPO

GRPO

RESD (Ours)

Run Logs

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

RESD

News

Framework Comparison

Table of Contents

Key Features

Datasets

Results

Comparison with SDPO

Comparison with GRPO

Installation

Install via conda

Docker Environment

Run Examples

SDPO

GRPO

RESD (Ours)

Run Logs

Acknowledgement

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages