Skip to content

LFhase/T3

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 

Repository files navigation

(ICLR 2026 Oral) Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

This repository contains the official implementation of T3 as described in the paper Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents by Deyu Zou, Yongqiang Chen, Jianxiang Wang, Garry YANG, Mufei Li, Qing Da, James Cheng, Pan Li, Yu Gong, which has been selected as ICLR 2026 Oral Presentation.

This repository contains code for the core T3 method, preprocessing, training/evaluation pipelines and scripts, and experiment setups from the paper. We have been continuously extending this repository to support more general, popular, and realistic agentic scenarios, so that T3 can be studied in broader interactive reasoning settings.

TODOs

  • In our new work, we identify a unique mechanism, information self-locking, under multi-turn agentic reasoning and propose AREW to fix that. The corresponding code will be merged into this repository in a future update.
  • We have applied T3 and AREW to tau2-bench and release the code and results in this repo. Refer to this section: Applicability to General Agentic Scenarios. Results on the effectiveness of T3 and AREW over Deep-Research and SWE settings will be released.

Table of Contents

Environment Setup

The packaged code lives under verl/, so installation is done from that subdirectory.

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e ./verl

For the broader dependency set used by the bundled verl fork:

pip install -r verl/requirements.txt

Notes: the following is the version of key packages in the environment we are currently using:

- Python 3.11
- PyTorch 2.7.1
- vLLM 0.10.1
- Ray 2.10.0.19
- Transformers 4.55.4
- flash-attn 2.8.0.post2
- accelerate 1.10.1

Data Preparation

The released preprocessing scripts write parquet files in the format expected by main_ppo.py. The dataset (parquet formats) can be found here.

CircuitDecoding

Default raw file location:

<workspace>/CircuitDecoding/cd_raw_2_circuits_<cand>_cand.jsonl

Conversion:

python3 verl/preprocess/data_process/cd.py \
  --local_dir /path/to/workspace

This produces:

  • CircuitDecoding/train__cand_20.parquet
  • CircuitDecoding/val__cand_20.parquet

The script also supports --input_file, --train_size, --val_size, --train_output, and --val_output.

MovieRec

Default raw file location:

<workspace>/MovieRec/mr_seen_10_un_10_attr_8.jsonl

Conversion:

python3 verl/preprocess/data_process/mr.py \
  --local_dir /path/to/workspace

This produces:

  • MovieRec/train_seen_10_un_10_attr_8_variant.parquet
  • MovieRec/val_seen_10_un_10_attr_8_variant.parquet

The script also supports overriding the input path, output path, split sizes, data_source, and controller variant.

Tau2Bench

Tau2Bench data is generated from the task definitions under verl/search_r1/tau2_adapter/.

Example:

python3 verl/preprocess/data_process/tau2.py \
  --local_dir /path/to/workspace \
  --domain telecom \
  --train_split train \
  --val_split test \
  --enable_think \
  --think_mode short

This writes parquet files under:

/path/to/workspace/Tau2Bench/telecom/

If you need filenames aligned with a specific training wrapper, set --train_output and --val_output, or override TRAIN_FILE and VAL_FILE when launching training.

Training

Core Entry Point

The canonical entry point is:

python3 -m verl.trainer.main_ppo ...

The example scripts under verl/cmd/ are thin wrappers around this command.

CircuitDecoding

bash verl/cmd/cd/ppo.sh

Key environment overrides:

  • PROJECT_ROOT
  • DATA_DIR
  • BASE_MODEL
  • OUTPUT_ROOT
  • NUM_GPUS
  • TRAIN_FILE
  • VAL_FILE

MovieRec

bash verl/cmd/mrv/ppo.sh

The structure is the same as CircuitDecoding, with MovieRec-specific default parquet names.

Tau2Bench

bash verl/cmd/tau2/ppo1.1.sh
  • verl/cmd/tau2/ppo.sh for a PPO-style baseline
  • verl/cmd/tau2/ppo1.1.sh and related variants for T3-enabled settings
  • verl/cmd/tau2/ppo1.2.sh and related variants for AREW-enabled settings

Evaluation and Reproduction

Evaluation is run through the same PPO entry point in validation-only mode, eg,

bash verl/cmd/cd/eval.sh

The script merges FSDP checkpoints to Hugging Face format before validation.

Applicability to General Agentic Scenarios

T3 is intended to be applicable beyond a single benchmark or environment family.

1. Tau2Bench

We evaluate on Tau2Bench-Telecom, a multi-turn tool-use benchmark where the agent must resolve realistic customer-service tickets by interacting with an environment through API-like tools. In our experiments, we use the solo mode setting, i.e., we disable the LLM-simulated user and let the policy interact directly with the environment/tool interface.

For this setting, we derive simple step-level signals directly from the online interaction trace: a step is labeled positive if it increases the number of matched expected actions in the benchmark evaluator, negative if it corresponds to an obvious failure such as a tool error, invalid or malformed action, repeated action, or a write that has no effect, and neutral otherwise. AREW uses these labels to perform within-trajectory advantage redistribution. T3 uses the same signals for trajectory truncation; in our current Tau2Bench setup we use a conservative soft truncation policy with trunc_strength = 8 and set the hard truncation threshold to 999, which effectively disables hard truncation. See details in verl/search_r1/tau2_adapter.

Comparing vanilla PPO with PPO equipped with T3

paper image

Comparing vanilla PPO with PPO equipped with AREW

paper image

Extending the Repository

Adding a New Dataset

The default data format consumed by create_rl_dataset() in verl/verl/trainer/main_ppo.py expects records with fields such as:

  • prompt
  • answer
  • data_source
  • ability
  • reward_model
  • extra_info

If the task also needs custom environment metadata or reward-time controller information, include a controller field as done by the released T3 datasets.

For non-standard loading logic, you can either:

  • emit the same parquet schema used by the existing preprocessing scripts
  • provide a custom dataset through data.custom_cls in the Hydra config

Adding a New Interactive Scenario

For example, Tau2-style tasks are organized under verl/search_r1/tau2_adapter/. The main extension points are:

  • add task data under verl/search_r1/tau2_adapter/data/domains/<domain>/
  • implement domain environments and tools under verl/search_r1/tau2_adapter/domains/<domain>/
  • register the environment in verl/search_r1/tau2_adapter/loader/registry.py
  • keep the rollout contract compatible with Tau2SoloSpace in verl/search_r1/tau2_adapter/space.py

Repository Structure

.
├── 8182_Reducing_Belief_Deviation.pdf
├── README.md
└── verl/
    ├── cmd/                     # training and evaluation wrappers
    ├── preprocess/              # data conversion scripts
    ├── search_r1/               # interactive environments and rollout helpers
    └── verl/
        └── trainer/
            ├── main_ppo.py
            └── ppo/ray_trainer.py

Citation

If you use this repository, please cite the T3 paper. If your use also depends on the underlying framework components, please additionally cite verl.

@inproceedings{zoureducing,
  title={Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents},
  author={Zou, Deyu and Chen, Yongqiang and Wang, Jianxiang and YANG, Garry and Li, Mufei and Da, Qing and Cheng, James and Li, Pan and Gong, Yu},
  booktitle={The Fourteenth International Conference on Learning Representations}
}

About

(ICLR 2026 Oral) Code for the paper: Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 94.0%
  • Shell 6.0%