Skip to content

horizon-llm/RESD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2,120 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RESD

Learning from Rare Success and Rich Feedback via Reflection-Enhanced Self-Distillation

Notion Blog   arXiv Paper   GitHub Project   X Channel

RESD is an implementation of on-policy self-distillation built on veRL and SDPO.

Different from original SDPO, RESD maintains two persistent contexts: a playbook, inspired by the broader idea from ACE, that stores reusable lessons distilled from previous failures, and an optional solution buffer that caches successful trajectories when available. At each training step, RESD first updates these contexts using the outcome of the current rollout. This is achieved by either removing playbook entries based on their utility and staleness, or adding new entries generated from reflections. Finally, the teacher model is synchronized with the student model via an EMA update and conditioned on the enriched context to produce token-level supervision.

RESD allows the model to actively interpret the feedback instead of passively receiving it, which we found to be a key design axis to improve performance.

News

  • [2026.05.12] Code released.

Framework Comparison

framework

Table of Contents

Key Features

  • Fast Playbook Curation & Concise

    RESD reflects on the failed trajectories and curate playbook entries based on the reflections. To ensure a maximum number of entries, the playbook is concised before the curation based on entry utility and staleness. Checkout selfevolve/resd/context_updater/playbook_context_updater.py.

  • Interleaved Context Update & Model Update

    RESD supports interleaved context update and model update. At each gradient step, the context is updated based on student rollouts, while model update is conducted afterwards. This design ensures the rollouts are always on-policy.

  • Stream Training

    RESD can be used to perform streaming training where the model makes a single pass over the training data and each training example is seen at most once. For every incoming batch, the trainer executes an inner loop of up to K update iterations on the same set of prompts. Checkout selfevolve/resd/trainer/ppo/stream_trainer.py.

  • Customize Feedback Format

    RESD allows to customize the teacher prompt structure. Checkout selfevolve/resd/context_updater/prompts.

Datasets

We evaluate on four tasks spanning program synthesis, physical reasoning, and financial NER. All tasks provide rich execution feedback (e.g., per-test-case pass/fail) despite using sparse binary rewards.

Task Source Train Test Description
Manufactoria-Has RL-Grok 742 132 Write DSL programs to check input tape patterns
BouncingSim-Easy RL-Grok 640 100 Simulate 2-D multi-object bouncing dynamics (easy)
BouncingSim-Medium RL-Grok 320 100 Simulate 2-D multi-object bouncing dynamics (medium)
FiNER ACE 1000 500 Tag financial named entities in SEC filings

Key characteristics:

  • RL-Grok tasks (Manufactoria, BouncingSim): Near-zero initial success rates with per-test-case pass/fail feedback. A natural testbed for learning from failure feedback, requiring the model to synthesize executable programs from scratch.
  • FiNER: Higher initial success rate with per-entity correctness feedback. Assesses whether RESD benefits regimes where successful demonstrations are more accessible.

Results

⚠️ Note: There might be variations of performance between runs due to rollout quality.

Comparison with SDPO

framework

Comparison with GRPO

framework

Installation

You can choose to install from conda env config file or simply pull our pre-built docker image.

Install via conda

conda env create -f environment.yml

Docker Environment

docker run --gpus all --shm-size=64g --rm -it --net=host \
 --entrypoint /usr/bin/bash \
 brandonzyw/resd:v2

Run Examples

We provide out-of-the-box scripts in the 'RESD/' directory for training with different settings.

Before running, set your Weights & Biases API key:

export WANDB_API_KEY=<your-wandb-api-key>

SDPO

bash selfevolve/resd/run_manufactoria_has_sdpo_stream_qwen3_4b_fsdp.sh # Manufactoria-Has
bash selfevolve/resd/run_finer_sdpo_stream_qwen3_4b_fsdp.sh # FiNER
bash selfevolve/resd/run_bouncingsim_multiobj_easy_sdpo_stream_qwen3_4b_fsdp.sh # BouncingSim-Easy
bash selfevolve/resd/run_bouncingsim_multiobj_medium_sdpo_stream_qwen3_30b_fsdp.sh # BouncingSim-Medium

GRPO

bash selfevolve/resd/run_manufactoria_has_grpo_stream_qwen3_4b_fsdp.sh # Manufactoria-Has
bash selfevolve/resd/run_finer_grpo_stream_qwen3_4b_fsdp.sh # FiNER
bash selfevolve/resd/run_bouncingsim_multiobj_easy_grpo_stream_qwen3_4b_fsdp.sh # BouncingSim-Easy
bash selfevolve/resd/run_bouncingsim_multiobj_medium_grpo_stream_qwen3_30b_fsdp.sh # BouncingSim-Medium

RESD (Ours)

bash selfevolve/resd/run_manufactoria_has_resd_stream_qwen3_4b_fsdp.sh # Manufactoria-Has
bash selfevolve/resd/run_finer_resd_stream_qwen3_4b_fsdp.sh # FiNER
bash selfevolve/resd/run_bouncingsim_multiobj_easy_resd_stream_qwen3_4b_fsdp.sh # BouncingSim-Easy
bash selfevolve/resd/run_bouncingsim_multiobj_medium_resd_stream_qwen3_30b_fsdp.sh # BouncingSim-Medium

Run Logs

Dataset W&B
Manufactoria-Has wandb
BouncingSim-Easy wandb
BouncingSim-Medium wandb
FiNER wandb

Acknowledgement

We gratefully acknowledge the contributions of the veRL team for providing a solid RL infrastructure.

Special thanks to the SDPO and ACE project for their codebase, which inspired early design choices during the development of RESD.

We also thank the developers of RL-Grok for providing the data source.

Citation

If you find RESD useful in your research or applications, we would appreciate it if you could cite our work:

@misc{zhang2026learningraresuccessrich,
      title={Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation}, 
      author={Yuwei Zhang and Sha Li and Changlong Yu and Qin Lu and Shuowei Jin and Chengyu Dong and Haoran Liu and Ilgee Hong and Xintong Li and Zhenyu Shi and Bing Yin and Jingbo Shang},
      year={2026},
      eprint={2605.12741},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.12741}, 
}

We're excited to share our early results and welcome feedback from the community as we continue to refine and expand RESD’s capabilities. If you have any questions or feedback, please feel free to contact us at yuz163@ucsd.edu.

About

[arXiv 2026] Learning from Rare Success and Rich Feedback via Reflection-Enhanced Self-Distillation

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors