GitHub - hu-my/RedundancyBench: A benchmark for redundant trajectory detection

Important

If you find this project helpful, please consider giving us a ⭐️!

🧐 Overview

This repository provides the implementation of "Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories", which introduces the task of redundant step detection in LLM agent trajectories. Given a successful trajectory, the goal is to automatically identify steps that consume computational resources but contribute little to task completion.

Redundant step detection offers several key advantages:

Enables fine-grained evaluation: quanify the step quality in agent trajectories beyond coarse-grained efficiency metrics (e.g. step length or total cost).
Supports trajectory optimization: detect low-vauled steps to optimize agent behavior, constructing more compact and effective agent trajectories
Improves agent efficiency: build as a solid foundation and support further research on efficient agent design to improve effiency

🔧 RedundancyBench:

A benchmark for agent trajectory redundancy detection:

200 annotated successful trajectories collected from tau2-bench across 3 realistic domains:
- ✈️ Airline — Flight booking, cancellation, and modification tasks.
- 🛒 Retail — Product ordering, return, and customer service tasks.
- 📞 Telecom — Mobile plan inquiry, troubleshooting, and account management tasks.
Fine-grained annotations for each trajectory, including:
- The specific redundant step indices,
- The redundancy type for each flagged step,
- Confidence scores and natural-language explanations from human experts.
4 redundancy categories: Abnormal Step, Duplicated Step, Incorrect Step, and Exploratory Step.

The benchmark is constructed through a three-round annotation procedure involving heuristic pre-annotation, expert refinement, and cross-validation to ensure high-quality labels. It serves as a foundational resource for developing and evaluating methods that aim to automatically detect redundancy in complex agent trajectories.

The dataset and annotations are available in data/domain/. Check out the paper for more details on the annotation guidelines.

💡 Methods and Evaluation

We develop and evaluate 3 representative LLM-as-a-Judge strategies for redundant step detection, each with a different receptive field. Both inference and evaluation are unified into a single script with command-line arguments.

Strategy	Description
One-to-One	Examines a single step without cross-step context.
Window-to-One	Examines a target step with a local window (±3 neighboring steps).
All-to-All	Examines the full trajectory to predict all redundant steps at once.

Requirements

To install requirements:

pip install -r requirements.txt

Note: The main dependency is openai. Ensure you have a compatible LLM API endpoint configured.

Inference

Configure your LLM API credentials in LLM_judge/judge.py before running:

base_url = "YOUR_BASE_URL"
api_key = "YOUR_API_KEY"

Run

cd LLM_judge

# All-to-All strategy
python judge.py --strategy all_all --domain airline --model gpt-4o

# Window-to-One strategy
python judge.py --strategy window_one --domain telecom --model gpt-4o

# One-to-One strategy
python judge.py --strategy one_one --domain all --model gpt-4o

Supported arguments:

Argument	Options	Description
`--strategy`	`all_all`, `one_one`, `window_one`	Judging strategy to use.
`--domain`	`airline`, `retail`, `telecom`, `all`	Domain to evaluate; `all` runs all three sequentially.
`--model`	any model name	Model name used for judging and output filename.

Output files follow the naming convention: {model}_{strategy}_{domain}_results.json.

Evaluation

After generating predictions, evaluate the results against ground-truth annotations:

cd LLM_judge

# Evaluate All-to-All predictions
python evaluate.py --strategy all_all --domain airline --model gpt-4o

# Evaluate One-to-One predictions
python evaluate.py --strategy one_one --domain all --model gpt-4o

# Evaluate Window-to-One predictions
python evaluate.py --strategy window_one --domain telecom --model gpt-4o

Supported arguments:

Argument	Options	Description
`--strategy`	`all_all`, `one_one`, `window_one`	Strategy of the predictions to evaluate.
`--domain`	`airline`, `retail`, `telecom`, `all`	Domain to evaluate; `all` evaluates all three sequentially.
`--model`	model name used during inference	Model name prefix of the prediction filename.

Evaluation results are saved as evaluation_{model}_{strategy}_{domain}.json with per-task details and summary metrics.

Metrics

We adopt 2 complementary evaluation metrics:

Trajectory-Level Score: Binary classification accuracy. A trajectory is redundant if it contains at least one redundant step. Measures the ability of methods that correctly identify whether redundancy exists.
Step-Level Score: F1 score over individual steps. Balances precision (non-redundant steps incorrectly flagged) and recall (missed redundant steps).

🧪 Experimental Results

Our experiments reveal that redundant step detection remains highly challenging for current LLMs:

Context is crucial: The One-to-One strategy (no context) significantly underperforms, indicating that single-step inspection is insufficient.
Inverted-U relationship for context size: Moderate context (Window-to-One) often strikes the best balance between informative signals and noise, outperforming both minimal and full-context approaches.
Coarse vs. fine-grained gap: LLMs achieve reasonable trajectory-level detection (~70%), but step-level F1 remains low (~25% at best), highlighting the difficulty of fine-grained redundancy identification.

Strategy	Trajectory-Level (Avg.)	Step-Level (Avg.)
One-to-One	~47%	~8%
Window-to-One	~68%	~20%
All-to-All	~59%	~18%

More results could be found in the paper.

📚 Reference

@article{redundancybench2025,
  title={Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories},
  year={2025}
}

License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
LLM_judge		LLM_judge
assets		assets
data/domain		data/domain
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧐 Overview

🔧 RedundancyBench:

💡 Methods and Evaluation

Requirements

Inference

Run

Evaluation

Metrics

🧪 Experimental Results

📚 Reference

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧐 Overview

🔧 RedundancyBench:

💡 Methods and Evaluation

Requirements

Inference

Run

Evaluation

Metrics

🧪 Experimental Results

📚 Reference

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages