(ICLR 2026 Oral) Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
This repository contains the official implementation of T3 as described in the paper Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents by Deyu Zou, Yongqiang Chen, Jianxiang Wang, Garry YANG, Mufei Li, Qing Da, James Cheng, Pan Li, Yu Gong, which has been selected as ICLR 2026 Oral Presentation.
This repository contains code for the core T3 method, preprocessing, training/evaluation pipelines and scripts, and experiment setups from the paper. We have been continuously extending this repository to support more general, popular, and realistic agentic scenarios, so that T3 can be studied in broader interactive reasoning settings.
- In our new work, we identify a unique mechanism, information self-locking, under multi-turn agentic reasoning and propose AREW to fix that. The corresponding code will be merged into this repository in a future update.
- We have applied T3 and AREW to tau2-bench and release the code and results in this repo. Refer to this section: Applicability to General Agentic Scenarios. Results on the effectiveness of T3 and AREW over Deep-Research and SWE settings will be released.
- TODOs
- Environment Setup
- Data Preparation
- Training
- Evaluation and Reproduction
- Applicability to General Agentic Scenarios
- Extending the Repository
- Repository Structure
- Citation
The packaged code lives under verl/, so installation is done from that subdirectory.
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e ./verlFor the broader dependency set used by the bundled verl fork:
pip install -r verl/requirements.txtNotes: the following is the version of key packages in the environment we are currently using:
- Python 3.11
- PyTorch 2.7.1
- vLLM 0.10.1
- Ray 2.10.0.19
- Transformers 4.55.4
- flash-attn 2.8.0.post2
- accelerate 1.10.1
The released preprocessing scripts write parquet files in the format expected by main_ppo.py. The dataset (parquet formats) can be found here.
Default raw file location:
<workspace>/CircuitDecoding/cd_raw_2_circuits_<cand>_cand.jsonl
Conversion:
python3 verl/preprocess/data_process/cd.py \
--local_dir /path/to/workspaceThis produces:
CircuitDecoding/train__cand_20.parquetCircuitDecoding/val__cand_20.parquet
The script also supports --input_file, --train_size, --val_size, --train_output, and --val_output.
Default raw file location:
<workspace>/MovieRec/mr_seen_10_un_10_attr_8.jsonl
Conversion:
python3 verl/preprocess/data_process/mr.py \
--local_dir /path/to/workspaceThis produces:
MovieRec/train_seen_10_un_10_attr_8_variant.parquetMovieRec/val_seen_10_un_10_attr_8_variant.parquet
The script also supports overriding the input path, output path, split sizes, data_source, and controller variant.
Tau2Bench data is generated from the task definitions under verl/search_r1/tau2_adapter/.
Example:
python3 verl/preprocess/data_process/tau2.py \
--local_dir /path/to/workspace \
--domain telecom \
--train_split train \
--val_split test \
--enable_think \
--think_mode shortThis writes parquet files under:
/path/to/workspace/Tau2Bench/telecom/
If you need filenames aligned with a specific training wrapper, set --train_output and --val_output, or override TRAIN_FILE and VAL_FILE when launching training.
The canonical entry point is:
python3 -m verl.trainer.main_ppo ...The example scripts under verl/cmd/ are thin wrappers around this command.
bash verl/cmd/cd/ppo.shKey environment overrides:
PROJECT_ROOTDATA_DIRBASE_MODELOUTPUT_ROOTNUM_GPUSTRAIN_FILEVAL_FILE
bash verl/cmd/mrv/ppo.shThe structure is the same as CircuitDecoding, with MovieRec-specific default parquet names.
bash verl/cmd/tau2/ppo1.1.shverl/cmd/tau2/ppo.shfor a PPO-style baselineverl/cmd/tau2/ppo1.1.shand related variants for T3-enabled settingsverl/cmd/tau2/ppo1.2.shand related variants for AREW-enabled settings
Evaluation is run through the same PPO entry point in validation-only mode, eg,
bash verl/cmd/cd/eval.shThe script merges FSDP checkpoints to Hugging Face format before validation.
T3 is intended to be applicable beyond a single benchmark or environment family.
We evaluate on Tau2Bench-Telecom, a multi-turn tool-use benchmark where the agent must resolve realistic customer-service tickets by interacting with an environment through API-like tools. In our experiments, we use the solo mode setting, i.e., we disable the LLM-simulated user and let the policy interact directly with the environment/tool interface.
For this setting, we derive simple step-level signals directly from the online interaction trace: a step is labeled positive if it increases the number of matched expected actions in the benchmark evaluator, negative if it corresponds to an obvious failure such as a tool error, invalid or malformed action, repeated action, or a write that has no effect, and neutral otherwise. AREW uses these labels to perform within-trajectory advantage redistribution. T3 uses the same signals for trajectory truncation; in our current Tau2Bench setup we use a conservative soft truncation policy with trunc_strength = 8 and set the hard truncation threshold to 999, which effectively disables hard truncation. See details in verl/search_r1/tau2_adapter.
Comparing vanilla PPO with PPO equipped with T3
Comparing vanilla PPO with PPO equipped with AREW
The default data format consumed by create_rl_dataset() in verl/verl/trainer/main_ppo.py expects records with fields such as:
promptanswerdata_sourceabilityreward_modelextra_info
If the task also needs custom environment metadata or reward-time controller information, include a controller field as done by the released T3 datasets.
For non-standard loading logic, you can either:
- emit the same parquet schema used by the existing preprocessing scripts
- provide a custom dataset through
data.custom_clsin the Hydra config
For example, Tau2-style tasks are organized under verl/search_r1/tau2_adapter/. The main extension points are:
- add task data under
verl/search_r1/tau2_adapter/data/domains/<domain>/ - implement domain environments and tools under
verl/search_r1/tau2_adapter/domains/<domain>/ - register the environment in
verl/search_r1/tau2_adapter/loader/registry.py - keep the rollout contract compatible with
Tau2SoloSpaceinverl/search_r1/tau2_adapter/space.py
.
├── 8182_Reducing_Belief_Deviation.pdf
├── README.md
└── verl/
├── cmd/ # training and evaluation wrappers
├── preprocess/ # data conversion scripts
├── search_r1/ # interactive environments and rollout helpers
└── verl/
└── trainer/
├── main_ppo.py
└── ppo/ray_trainer.py
If you use this repository, please cite the T3 paper. If your use also depends on the underlying framework components, please additionally cite verl.
@inproceedings{zoureducing,
title={Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents},
author={Zou, Deyu and Chen, Yongqiang and Wang, Jianxiang and YANG, Garry and Li, Mufei and Da, Qing and Cheng, James and Li, Pan and Gong, Yu},
booktitle={The Fourteenth International Conference on Learning Representations}
}

