Important
If you find this project helpful, please consider giving us a ⭐️!
This repository provides the implementation of "Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories", which introduces the task of redundant step detection in LLM agent trajectories. Given a successful trajectory, the goal is to automatically identify steps that consume computational resources but contribute little to task completion.
Redundant step detection offers several key advantages:
- Enables fine-grained evaluation: quanify the step quality in agent trajectories beyond coarse-grained efficiency metrics (e.g. step length or total cost).
- Supports trajectory optimization: detect low-vauled steps to optimize agent behavior, constructing more compact and effective agent trajectories
- Improves agent efficiency: build as a solid foundation and support further research on efficient agent design to improve effiency
A benchmark for agent trajectory redundancy detection:
- 200 annotated successful trajectories collected from tau2-bench across 3 realistic domains:
✈️ Airline — Flight booking, cancellation, and modification tasks.- 🛒 Retail — Product ordering, return, and customer service tasks.
- 📞 Telecom — Mobile plan inquiry, troubleshooting, and account management tasks.
- Fine-grained annotations for each trajectory, including:
- The specific redundant step indices,
- The redundancy type for each flagged step,
- Confidence scores and natural-language explanations from human experts.
- 4 redundancy categories: Abnormal Step, Duplicated Step, Incorrect Step, and Exploratory Step.
The benchmark is constructed through a three-round annotation procedure involving heuristic pre-annotation, expert refinement, and cross-validation to ensure high-quality labels. It serves as a foundational resource for developing and evaluating methods that aim to automatically detect redundancy in complex agent trajectories.
The dataset and annotations are available in
data/domain/. Check out the paper for more details on the annotation guidelines.
We develop and evaluate 3 representative LLM-as-a-Judge strategies for redundant step detection, each with a different receptive field. Both inference and evaluation are unified into a single script with command-line arguments.
| Strategy | Description |
|---|---|
| One-to-One | Examines a single step without cross-step context. |
| Window-to-One | Examines a target step with a local window (±3 neighboring steps). |
| All-to-All | Examines the full trajectory to predict all redundant steps at once. |
To install requirements:
pip install -r requirements.txtNote: The main dependency is
openai. Ensure you have a compatible LLM API endpoint configured.
Configure your LLM API credentials in LLM_judge/judge.py before running:
base_url = "YOUR_BASE_URL"
api_key = "YOUR_API_KEY"cd LLM_judge
# All-to-All strategy
python judge.py --strategy all_all --domain airline --model gpt-4o
# Window-to-One strategy
python judge.py --strategy window_one --domain telecom --model gpt-4o
# One-to-One strategy
python judge.py --strategy one_one --domain all --model gpt-4oSupported arguments:
| Argument | Options | Description |
|---|---|---|
--strategy |
all_all, one_one, window_one |
Judging strategy to use. |
--domain |
airline, retail, telecom, all |
Domain to evaluate; all runs all three sequentially. |
--model |
any model name | Model name used for judging and output filename. |
Output files follow the naming convention: {model}_{strategy}_{domain}_results.json.
After generating predictions, evaluate the results against ground-truth annotations:
cd LLM_judge
# Evaluate All-to-All predictions
python evaluate.py --strategy all_all --domain airline --model gpt-4o
# Evaluate One-to-One predictions
python evaluate.py --strategy one_one --domain all --model gpt-4o
# Evaluate Window-to-One predictions
python evaluate.py --strategy window_one --domain telecom --model gpt-4oSupported arguments:
| Argument | Options | Description |
|---|---|---|
--strategy |
all_all, one_one, window_one |
Strategy of the predictions to evaluate. |
--domain |
airline, retail, telecom, all |
Domain to evaluate; all evaluates all three sequentially. |
--model |
model name used during inference | Model name prefix of the prediction filename. |
Evaluation results are saved as evaluation_{model}_{strategy}_{domain}.json with per-task details and summary metrics.
We adopt 2 complementary evaluation metrics:
- Trajectory-Level Score: Binary classification accuracy. A trajectory is redundant if it contains at least one redundant step. Measures the ability of methods that correctly identify whether redundancy exists.
- Step-Level Score: F1 score over individual steps. Balances precision (non-redundant steps incorrectly flagged) and recall (missed redundant steps).
Our experiments reveal that redundant step detection remains highly challenging for current LLMs:
- Context is crucial: The One-to-One strategy (no context) significantly underperforms, indicating that single-step inspection is insufficient.
- Inverted-U relationship for context size: Moderate context (Window-to-One) often strikes the best balance between informative signals and noise, outperforming both minimal and full-context approaches.
- Coarse vs. fine-grained gap: LLMs achieve reasonable trajectory-level detection (~70%), but step-level F1 remains low (~25% at best), highlighting the difficulty of fine-grained redundancy identification.
| Strategy | Trajectory-Level (Avg.) | Step-Level (Avg.) |
|---|---|---|
| One-to-One | ~47% | ~8% |
| Window-to-One | ~68% | ~20% |
| All-to-All | ~59% | ~18% |
More results could be found in the paper.
@article{redundancybench2025,
title={Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories},
year={2025}
}This project is licensed under the MIT License. See LICENSE for details.
