Learning from Rare Success and Rich Feedback via Reflection-Enhanced Self-Distillation
RESD is an implementation of on-policy self-distillation built on veRL and SDPO.
Different from original SDPO, RESD maintains two persistent contexts: a playbook, inspired by the broader idea from ACE, that stores reusable lessons distilled from previous failures, and an optional solution buffer that caches successful trajectories when available. At each training step, RESD first updates these contexts using the outcome of the current rollout. This is achieved by either removing playbook entries based on their utility and staleness, or adding new entries generated from reflections. Finally, the teacher model is synchronized with the student model via an EMA update and conditioned on the enriched context to produce token-level supervision.
RESD allows the model to actively interpret the feedback instead of passively receiving it, which we found to be a key design axis to improve performance.
- [2026.05.12] Code released.
-
Fast Playbook Curation & Concise
RESDreflects on the failed trajectories and curate playbook entries based on the reflections. To ensure a maximum number of entries, the playbook is concised before the curation based on entry utility and staleness. Checkoutselfevolve/resd/context_updater/playbook_context_updater.py. -
Interleaved Context Update & Model Update
RESDsupports interleaved context update and model update. At each gradient step, the context is updated based on student rollouts, while model update is conducted afterwards. This design ensures the rollouts are always on-policy. -
Stream Training
RESDcan be used to perform streaming training where the model makes a single pass over the training data and each training example is seen at most once. For every incoming batch, the trainer executes an inner loop of up to K update iterations on the same set of prompts. Checkoutselfevolve/resd/trainer/ppo/stream_trainer.py. -
Customize Feedback Format
RESDallows to customize the teacher prompt structure. Checkoutselfevolve/resd/context_updater/prompts.
We evaluate on four tasks spanning program synthesis, physical reasoning, and financial NER. All tasks provide rich execution feedback (e.g., per-test-case pass/fail) despite using sparse binary rewards.
| Task | Source | Train | Test | Description |
|---|---|---|---|---|
| Manufactoria-Has | RL-Grok | 742 | 132 | Write DSL programs to check input tape patterns |
| BouncingSim-Easy | RL-Grok | 640 | 100 | Simulate 2-D multi-object bouncing dynamics (easy) |
| BouncingSim-Medium | RL-Grok | 320 | 100 | Simulate 2-D multi-object bouncing dynamics (medium) |
| FiNER | ACE | 1000 | 500 | Tag financial named entities in SEC filings |
Key characteristics:
- RL-Grok tasks (Manufactoria, BouncingSim): Near-zero initial success rates with per-test-case pass/fail feedback. A natural testbed for learning from failure feedback, requiring the model to synthesize executable programs from scratch.
- FiNER: Higher initial success rate with per-entity correctness feedback. Assesses whether RESD benefits regimes where successful demonstrations are more accessible.
⚠️ Note: There might be variations of performance between runs due to rollout quality.
You can choose to install from conda env config file or simply pull our pre-built docker image.
conda env create -f environment.ymldocker run --gpus all --shm-size=64g --rm -it --net=host \
--entrypoint /usr/bin/bash \
brandonzyw/resd:v2
We provide out-of-the-box scripts in the 'RESD/' directory for training with different settings.
Before running, set your Weights & Biases API key:
export WANDB_API_KEY=<your-wandb-api-key>bash selfevolve/resd/run_manufactoria_has_sdpo_stream_qwen3_4b_fsdp.sh # Manufactoria-Hasbash selfevolve/resd/run_finer_sdpo_stream_qwen3_4b_fsdp.sh # FiNERbash selfevolve/resd/run_bouncingsim_multiobj_easy_sdpo_stream_qwen3_4b_fsdp.sh # BouncingSim-Easybash selfevolve/resd/run_bouncingsim_multiobj_medium_sdpo_stream_qwen3_30b_fsdp.sh # BouncingSim-Mediumbash selfevolve/resd/run_manufactoria_has_grpo_stream_qwen3_4b_fsdp.sh # Manufactoria-Hasbash selfevolve/resd/run_finer_grpo_stream_qwen3_4b_fsdp.sh # FiNERbash selfevolve/resd/run_bouncingsim_multiobj_easy_grpo_stream_qwen3_4b_fsdp.sh # BouncingSim-Easybash selfevolve/resd/run_bouncingsim_multiobj_medium_grpo_stream_qwen3_30b_fsdp.sh # BouncingSim-Mediumbash selfevolve/resd/run_manufactoria_has_resd_stream_qwen3_4b_fsdp.sh # Manufactoria-Hasbash selfevolve/resd/run_finer_resd_stream_qwen3_4b_fsdp.sh # FiNERbash selfevolve/resd/run_bouncingsim_multiobj_easy_resd_stream_qwen3_4b_fsdp.sh # BouncingSim-Easybash selfevolve/resd/run_bouncingsim_multiobj_medium_resd_stream_qwen3_30b_fsdp.sh # BouncingSim-Medium| Dataset | W&B |
|---|---|
| Manufactoria-Has | |
| BouncingSim-Easy | |
| BouncingSim-Medium | |
| FiNER |
We gratefully acknowledge the contributions of the veRL team for providing a solid RL infrastructure.
Special thanks to the SDPO and ACE project for their codebase, which inspired early design choices during the development of RESD.
We also thank the developers of RL-Grok for providing the data source.
If you find RESD useful in your research or applications, we would appreciate it if you could cite our work:
@misc{zhang2026learningraresuccessrich,
title={Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation},
author={Yuwei Zhang and Sha Li and Changlong Yu and Qin Lu and Shuowei Jin and Chengyu Dong and Haoran Liu and Ilgee Hong and Xintong Li and Zhenyu Shi and Bing Yin and Jingbo Shang},
year={2026},
eprint={2605.12741},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.12741},
}
We're excited to share our early results and welcome feedback from the community as we continue to refine and expand RESD’s capabilities. If you have any questions or feedback, please feel free to contact us at yuz163@ucsd.edu.


