Reference implementation for the paper:
Self-Trained Verification for Training- and Test-Time Self-Improvement
Chen Henry Wu, Aditi Raghunathan
Preprint 2026
Project page: https://ar-forum.github.io/stv-webpage/
verl/ is a customized fork of verl; set it up
the same way you would set up upstream verl (PyTorch with a CUDA build, vLLM,
flash-attn, and pip install -r requirements.txt). Training assumes a single
8×GPU node.
The pipeline starts from DAPO math rollouts binned by the base generator's (post-trained Qwen-38B) pass@1. Point --data_dir at a directory of parquet files named
{split}_{bin}.parquet (default bins 0.0_0.0 = Hardest, 0.0_0.2 = Hard),
each row having:
prompt:list[{"role": ..., "content": ...}]reward_model:{"ground_truth": "<answer>"}data_source: the bin identifier
The data-prep scripts write their outputs to verifier/data/ by default. The
distill step (prepare_verifier_distill_data.py) reads oracle reference solutions
for the reference-conditioned teacher from --train_problems_dir /
--val_problems_dir as {split}_problems_{bin}.jsonl with a problem field and an
oracle-solution field.
# (a) DAPO rollouts -> RL verifier parquets
python verifier/prepare_verifier_data.py \
--data_dir <your_dapo_rollouts> --output_dir verifier/data \
--bins 0.0_0.0,0.0_0.2 --rollout_n 8
# (b) add the reference-conditioned teacher prompt (ref_prompt column)
python verifier/prepare_verifier_distill_data.py \
--rl_train_file verifier/data/dapo_rl_train.parquet \
--rl_val_file verifier/data/dapo_rl_val.parquet
# (c) hybrid OPD + RL verifier training
TRAIN_FILES="['verifier/data/dapo_distill_train.parquet']" \
VAL_FILES="['verifier/data/dapo_rl_val.parquet']" \
./verifier/run_opd_rl_verifier_hard.shThe student verifier sees problem + solution; the teacher reads a
privileged ref_prompt (problem + solution + verdict + reference solution). The
objective is OPD_COEF * opd_loss + PG_COEF * policy_gradient_loss.
# (a) build multi-turn [problem, attempt, verifier-feedback] prompts
python verifier/prepare_generator_feedback_data.py \
--data_dir <your_dapo_rollouts> --output_dir verifier/data \
--verifier_model <STV_verifier_checkpoint> --rollout_n 8
# (b) continue training the converged RLVR generator inside the V-R loop
# NOTE: this step assumes you have a generator trained with RLVR
TRAIN_FILES="['verifier/data/dapo_generator_feedback_train.parquet']" \
VAL_FILES="['verifier/data/dapo_generator_feedback_val.parquet']" \
./verifier/run_grpo_generator_feedback_hard_qwen3_8b.shpython verifier/run_experiment0.py \
--generator_model <generator_checkpoint> \
--verifier_mode self \
--max_rounds 20 --problems_per_bin 150 --chains_per_problem 32@article{wu2026stv,
title = {Self-Trained Verification for Training- and Test-Time Self-Improvement},
author = {Wu, Chen Henry and Raghunathan, Aditi},
journal = {arXiv preprint arXiv:2605.30290},
year = {2026},
}