Benchmarking on TRAIL

TRAIL(Trace Reasoning and Agentic Issue Localization) is a benchmark dataset of 148 annotated AI agent execution traces containing 841 errors across reasoning, execution, and planning categories. Created from real-world software engineering and information retrieval tasks, it challenges even state-of-the-art LLMs, with the best model achieving only 11% accuracy, highlighting the difficulty of trace debugging for complex agent workflows.

Installation

Create a virtual environment and install the required packages as follows:

pip install -r requirements.txt

Usage

python run_eval.py --model=[your_litellm_compatible_model_id] --data_dir="data/" --output_dir="results/" --max_workers=[integer_number_of_workers] --split=["GAIA"|"SWE Bench"]

This will produce a result evals in the results/ directory. You can then run:

python calculate_scores.py --results_dir="results/"

This will create and store a .txt file in the same results directory with the calculated scores for each model.

Citation

If you use this code or the dataset, please cite the following paper:

@misc{deshpande2025trail,
      title={TRAIL: Trace Reasoning and Agentic Issue Localization},
      author={Darshan Deshpande and Varun Gangal and Hersh Mehta and Jitin Krishnan and Anand Kannappan and Rebecca Qian},
      year={2025},
      eprint={2505.08638},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.08638},
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmarking		benchmarking
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Benchmarking on TRAIL

Installation

Usage

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Benchmarking on TRAIL

Installation

Usage

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages