Skip to content

ruc-datalab/CoDA-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

CoDA-Bench

arXiv homepage data python license wechat


Code and data for [ICML 2026] CoDA-Bench: Can Code Agents Handle Data-Intensive Tasks?

Authors: Yuxin Zhang, Ju Fan, Meihao Fan, Shaolei Zhang*, Xiaoyong Du

Renmin University of China

๐Ÿ‘‹ Overview

CoDA-Bench (Code and Data-intensive Benchmark) is a benchmark for evaluating AI agents on data-intensive analytical tasks. Given a natural language question and access to a Linux sandbox containing hundreds of data files, an agent must discover relevant data, write code, and produce the correct answer.

demo_video_compressed.mp4

โ–ถ๏ธ Demo video. If it does not play inline, download/watch it here.

Unlike existing benchmarks that provide oracle data directly, CoDA-Bench requires agents to:

  • ๐Ÿ” Discover relevant data among hundreds of semantically similar files
  • ๐Ÿ—‚๏ธ Navigate complex file hierarchies in a Linux sandbox
  • ๐Ÿ”— Integrate information from multiple heterogeneous data sources
  • ๐Ÿ’ป Generate correct code for data-driven analytical tasks

๐Ÿ“ฐ News

  • [Jun. 8, 2026]: CoDA-Bench v1.0 released with Docker evaluation!
  • [Jun. 1, 2026]: CoDA-Bench paper accepted at ICML 2026!

๐Ÿ“Š Dataset Statistics

Metric Value
Total Tasks 1,009 (full) / 119 (hard subset)
Communities 31 (full) / 15 (hard subset)
Source Datasets 199 Kaggle datasets
Avg Files per Task ~980 files
Total Size ~43 GB (compressed)

๐Ÿš€ Quick Start

Installation

git clone https://github.com/ruc-datalab/CoDA-Bench.git
cd CoDA-Bench
pip install -e .

Download Dataset

# One-command setup: downloads and extracts everything
python scripts/setup_dataset.py --data-dir ./datasets

This downloads:

  • Benchmark files (coda_bench.json, coda_bench_hard.json)
  • Community data archives (31 communities, ~43 GB)
  • Automatically extracts all archives

Run Evaluation (Docker Mode)

Step 1: Build Docker Image

cd docker
./build_all.sh
cd ..

Step 2: Set API Credentials

export LLM_API_KEY="your-api-key"
export LLM_BASE_URL="https://api.openai.com/v1"  # Optional

Step 3: Run Evaluation

# Quick test (4 instances)
python scripts/run_evaluation.py \
    --model gpt-5.5 \
    --instances 0 1 2 3 \
    --output results/test

# Full evaluation
python scripts/run_evaluation.py \
    --model gpt-5.5 \
    --output results/full \
    --workers 8

Step 4: Evaluate Results

python -m coda_bench.cli evaluate \
    --pred results/test/predictions.jsonl \
    --gold datasets/coda_bench.json \
    --out results/test/scores.json

Why Docker?

Docker mode provides secure isolation:

  • โœ… Agents cannot access benchmark answers
  • โœ… Network-restricted environment (only LLM API accessible)
  • โœ… Resource limits (memory, CPU, timeout)
  • โœ… Reproducible across different machines

See docker/README.md for detailed Docker documentation.

๐Ÿ’ฝ Manual Evaluation

If you prefer manual control:

1. Load Dataset

import json

with open('datasets/coda_bench.json') as f:
    tasks = json.load(f)

# Each task contains:
# - instance_id: unique identifier
# - question: natural language question
# - answer: expected answer
# - release_community: community data directory

2. Run Your Agent

# Your agent should:
# 1. Access community data at: datasets/communities/{release_community}/full_community
# 2. Explore files and write code to answer the question
# 3. Output final answer

3. Evaluate Predictions

python -m coda_bench.cli evaluate \
    --pred predictions.jsonl \
    --gold datasets/coda_bench.json \
    --out results.json

Prediction format (JSONL):

{"instance_id": 0, "prediction": "38%"}
{"instance_id": 1, "prediction": "150"}

๐Ÿ† Leaderboard

Current state-of-the-art results (as of paper publication):

System Model EA (Full) EA (Hard) DA (Full)
Mini-SWE-Agent GPT-5.5 61.1% 49.6% 52.3%
Codex CLI GPT-5.5 60.3% 47.9% 51.7%
OpenHands GPT-5.5 59.7% 44.5% 48.9%
Claude Code Sonnet-4.6 53.8% 42.9% 47.2%

EA = Execution Accuracy, DA = Discovery Accuracy

๐Ÿ’ก Example Task

{
  "instance_id": 0,
  "question": "What is the percentage of missing values in the RBC feature?",
  "answer": "38%",
  "release_community": "community_26"
}

The agent needs to:

  1. Navigate to datasets/communities/community_26/full_community/
  2. Find ckdisease/source/kidney_disease.csv among ~980 files
  3. Load and analyze the data
  4. Calculate missing value percentage for RBC column
  5. Format answer as "38%"

๐Ÿ“š Documentation

๐Ÿ—บ๏ธ Roadmap

v1.0 (Current - June 2026)

  • โœ… OpenHands agent with Docker isolation
  • โœ… Full dataset (1,009 tasks) + hard subset (119 tasks)
  • โœ… Secure evaluation environment
  • โœ… Complete documentation

Coming Soon

  • ๐Ÿšง Direct mode - Quick testing without Docker (simple, no isolation)
  • ๐Ÿšง Additional agents - Claude Code, Codex, Mini-SWE-Agent Docker support
  • ๐Ÿšง Better logging - Real-time progress tracking
  • ๐Ÿšง Performance optimizations - Faster evaluation

๐Ÿ’ซ Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Ways to contribute:

  • ๐Ÿ› Report bugs or issues
  • ๐Ÿ’ก Suggest new features
  • ๐Ÿ“ Improve documentation
  • ๐Ÿงช Add agent implementations
  • ๐Ÿ“Š Share evaluation results

โœ๏ธ Citation

@inproceedings{zhang2026codabench,
  title={CoDA-BENCH: Can Code Agents Handle Data-Intensive Tasks?},
  author={Zhang, Yuxin and Fan, Ju and Fan, Meihao and Zhang, Shaolei and Du, Xiaoyong},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning},
  year={2026},
  organization={PMLR}
}

๐Ÿ“„ License

MIT License. See LICENSE for details.

Individual Kaggle datasets may have their own licenses.

๐Ÿ“ง Contact

๐Ÿ’ฌ WeChat Group

Welcome to join the CoDA-Bench WeChat group, chat and share ideas with others!

CoDA-Bench WeChat

If you like CoDA-Bench, give it a GitHub Star โญ.

About

CoDA-Bench is a benchmark for code agents on data-intensive tasks. ๐ŸŽˆไปฃ็ ๆ™บ่ƒฝไฝ“่ƒฝๆžๅฎšๆ•ฐๆฎๅฏ†้›†ๅž‹ไปปๅŠกๅ—?

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors