Skip to content

hu-my/RedundancyBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RedundancyBench

LicensePaperDataset

Important

If you find this project helpful, please consider giving us a ⭐️!

🧐 Overview

This repository provides the implementation of "Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories", which introduces the task of redundant step detection in LLM agent trajectories. Given a successful trajectory, the goal is to automatically identify steps that consume computational resources but contribute little to task completion.

Redundant step detection offers several key advantages:

  • Enables fine-grained evaluation: quanify the step quality in agent trajectories beyond coarse-grained efficiency metrics (e.g. step length or total cost).
  • Supports trajectory optimization: detect low-vauled steps to optimize agent behavior, constructing more compact and effective agent trajectories
  • Improves agent efficiency: build as a solid foundation and support further research on efficient agent design to improve effiency

🔧 RedundancyBench:

A benchmark for agent trajectory redundancy detection:

  • 200 annotated successful trajectories collected from tau2-bench across 3 realistic domains:
    • ✈️ Airline — Flight booking, cancellation, and modification tasks.
    • 🛒 Retail — Product ordering, return, and customer service tasks.
    • 📞 Telecom — Mobile plan inquiry, troubleshooting, and account management tasks.
  • Fine-grained annotations for each trajectory, including:
    • The specific redundant step indices,
    • The redundancy type for each flagged step,
    • Confidence scores and natural-language explanations from human experts.
  • 4 redundancy categories: Abnormal Step, Duplicated Step, Incorrect Step, and Exploratory Step.

The benchmark is constructed through a three-round annotation procedure involving heuristic pre-annotation, expert refinement, and cross-validation to ensure high-quality labels. It serves as a foundational resource for developing and evaluating methods that aim to automatically detect redundancy in complex agent trajectories.

The dataset and annotations are available in data/domain/. Check out the paper for more details on the annotation guidelines.

💡 Methods and Evaluation

We develop and evaluate 3 representative LLM-as-a-Judge strategies for redundant step detection, each with a different receptive field. Both inference and evaluation are unified into a single script with command-line arguments.

Strategy Description
One-to-One Examines a single step without cross-step context.
Window-to-One Examines a target step with a local window (±3 neighboring steps).
All-to-All Examines the full trajectory to predict all redundant steps at once.

Requirements

To install requirements:

pip install -r requirements.txt

Note: The main dependency is openai. Ensure you have a compatible LLM API endpoint configured.

Inference

Configure your LLM API credentials in LLM_judge/judge.py before running:

base_url = "YOUR_BASE_URL"
api_key = "YOUR_API_KEY"

Run

cd LLM_judge

# All-to-All strategy
python judge.py --strategy all_all --domain airline --model gpt-4o

# Window-to-One strategy
python judge.py --strategy window_one --domain telecom --model gpt-4o

# One-to-One strategy
python judge.py --strategy one_one --domain all --model gpt-4o

Supported arguments:

Argument Options Description
--strategy all_all, one_one, window_one Judging strategy to use.
--domain airline, retail, telecom, all Domain to evaluate; all runs all three sequentially.
--model any model name Model name used for judging and output filename.

Output files follow the naming convention: {model}_{strategy}_{domain}_results.json.

Evaluation

After generating predictions, evaluate the results against ground-truth annotations:

cd LLM_judge

# Evaluate All-to-All predictions
python evaluate.py --strategy all_all --domain airline --model gpt-4o

# Evaluate One-to-One predictions
python evaluate.py --strategy one_one --domain all --model gpt-4o

# Evaluate Window-to-One predictions
python evaluate.py --strategy window_one --domain telecom --model gpt-4o

Supported arguments:

Argument Options Description
--strategy all_all, one_one, window_one Strategy of the predictions to evaluate.
--domain airline, retail, telecom, all Domain to evaluate; all evaluates all three sequentially.
--model model name used during inference Model name prefix of the prediction filename.

Evaluation results are saved as evaluation_{model}_{strategy}_{domain}.json with per-task details and summary metrics.

Metrics

We adopt 2 complementary evaluation metrics:

  • Trajectory-Level Score: Binary classification accuracy. A trajectory is redundant if it contains at least one redundant step. Measures the ability of methods that correctly identify whether redundancy exists.
  • Step-Level Score: F1 score over individual steps. Balances precision (non-redundant steps incorrectly flagged) and recall (missed redundant steps).

🧪 Experimental Results

Our experiments reveal that redundant step detection remains highly challenging for current LLMs:

  • Context is crucial: The One-to-One strategy (no context) significantly underperforms, indicating that single-step inspection is insufficient.
  • Inverted-U relationship for context size: Moderate context (Window-to-One) often strikes the best balance between informative signals and noise, outperforming both minimal and full-context approaches.
  • Coarse vs. fine-grained gap: LLMs achieve reasonable trajectory-level detection (~70%), but step-level F1 remains low (~25% at best), highlighting the difficulty of fine-grained redundancy identification.
Strategy Trajectory-Level (Avg.) Step-Level (Avg.)
One-to-One ~47% ~8%
Window-to-One ~68% ~20%
All-to-All ~59% ~18%

More results could be found in the paper.

📚 Reference

@article{redundancybench2025,
  title={Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories},
  year={2025}
}

License

This project is licensed under the MIT License. See LICENSE for details.

About

A benchmark for redundant trajectory detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages