GitHub - ruc-datalab/CoDA-Bench: CoDA-Bench is a benchmark for code agents on data-intensive tasks. 🎈代码智能体能搞定数据密集型任务吗?

Code and data for [ICML 2026] CoDA-Bench: Can Code Agents Handle Data-Intensive Tasks?

Authors: Yuxin Zhang, Ju Fan, Meihao Fan, Shaolei Zhang*, Xiaoyong Du

Renmin University of China

👋 Overview

CoDA-Bench (Code and Data-intensive Benchmark) is a benchmark for evaluating AI agents on data-intensive analytical tasks. Given a natural language question and access to a Linux sandbox containing hundreds of data files, an agent must discover relevant data, write code, and produce the correct answer.

demo_video_compressed.mp4

▶️ Demo video. If it does not play inline, download/watch it here.

Unlike existing benchmarks that provide oracle data directly, CoDA-Bench requires agents to:

🔍 Discover relevant data among hundreds of semantically similar files
🗂️ Navigate complex file hierarchies in a Linux sandbox
🔗 Integrate information from multiple heterogeneous data sources
💻 Generate correct code for data-driven analytical tasks

📰 News

[Jun. 8, 2026]: CoDA-Bench v1.0 released with Docker evaluation!
[Jun. 1, 2026]: CoDA-Bench paper accepted at ICML 2026!

📊 Dataset Statistics

Metric	Value
Total Tasks	1,009 (full) / 119 (hard subset)
Communities	31 (full) / 15 (hard subset)
Source Datasets	199 Kaggle datasets
Avg Files per Task	~980 files
Total Size	~43 GB (compressed)

🚀 Quick Start

Installation

git clone https://github.com/ruc-datalab/CoDA-Bench.git
cd CoDA-Bench
pip install -e .

Download Dataset

# One-command setup: downloads and extracts everything
python scripts/setup_dataset.py --data-dir ./datasets

This downloads:

Benchmark files (coda_bench.json, coda_bench_hard.json)
Community data archives (31 communities, ~43 GB)
Automatically extracts all archives

Run Evaluation (Docker Mode)

Step 1: Build Docker Image

cd docker
./build_all.sh
cd ..

Step 2: Set API Credentials

export LLM_API_KEY="your-api-key"
export LLM_BASE_URL="https://api.openai.com/v1"  # Optional

Step 3: Run Evaluation

# Quick test (4 instances)
python scripts/run_evaluation.py \
    --model gpt-5.5 \
    --instances 0 1 2 3 \
    --output results/test

# Full evaluation
python scripts/run_evaluation.py \
    --model gpt-5.5 \
    --output results/full \
    --workers 8

Step 4: Evaluate Results

python -m coda_bench.cli evaluate \
    --pred results/test/predictions.jsonl \
    --gold datasets/coda_bench.json \
    --out results/test/scores.json

Why Docker?

Docker mode provides secure isolation:

✅ Agents cannot access benchmark answers
✅ Network-restricted environment (only LLM API accessible)
✅ Resource limits (memory, CPU, timeout)
✅ Reproducible across different machines

See docker/README.md for detailed Docker documentation.

💽 Manual Evaluation

If you prefer manual control:

1. Load Dataset

import json

with open('datasets/coda_bench.json') as f:
    tasks = json.load(f)

# Each task contains:
# - instance_id: unique identifier
# - question: natural language question
# - answer: expected answer
# - release_community: community data directory

2. Run Your Agent

# Your agent should:
# 1. Access community data at: datasets/communities/{release_community}/full_community
# 2. Explore files and write code to answer the question
# 3. Output final answer

3. Evaluate Predictions

python -m coda_bench.cli evaluate \
    --pred predictions.jsonl \
    --gold datasets/coda_bench.json \
    --out results.json

Prediction format (JSONL):

{"instance_id": 0, "prediction": "38%"}
{"instance_id": 1, "prediction": "150"}

🏆 Leaderboard

Current state-of-the-art results (as of paper publication):

System	Model	EA (Full)	EA (Hard)	DA (Full)
Mini-SWE-Agent	GPT-5.5	61.1%	49.6%	52.3%
Codex CLI	GPT-5.5	60.3%	47.9%	51.7%
OpenHands	GPT-5.5	59.7%	44.5%	48.9%
Claude Code	Sonnet-4.6	53.8%	42.9%	47.2%

EA = Execution Accuracy, DA = Discovery Accuracy

💡 Example Task

{
  "instance_id": 0,
  "question": "What is the percentage of missing values in the RBC feature?",
  "answer": "38%",
  "release_community": "community_26"
}

The agent needs to:

Navigate to datasets/communities/community_26/full_community/
Find ckdisease/source/kidney_disease.csv among ~980 files
Load and analyze the data
Calculate missing value percentage for RBC column
Format answer as "38%"

📚 Documentation

Docker Evaluation Guide - Secure Docker-based evaluation
Evaluation Protocol - Scoring and metrics
Data Format - Dataset schema
Baseline Agents - Reference implementations

🗺️ Roadmap

v1.0 (Current - June 2026)

✅ OpenHands agent with Docker isolation
✅ Full dataset (1,009 tasks) + hard subset (119 tasks)
✅ Secure evaluation environment
✅ Complete documentation

Coming Soon

🚧 Direct mode - Quick testing without Docker (simple, no isolation)
🚧 Additional agents - Claude Code, Codex, Mini-SWE-Agent Docker support
🚧 Better logging - Real-time progress tracking
🚧 Performance optimizations - Faster evaluation

💫 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Ways to contribute:

🐛 Report bugs or issues
💡 Suggest new features
📝 Improve documentation
🧪 Add agent implementations
📊 Share evaluation results

✍️ Citation

@inproceedings{zhang2026codabench,
  title={CoDA-BENCH: Can Code Agents Handle Data-Intensive Tasks?},
  author={Zhang, Yuxin and Fan, Ju and Fan, Meihao and Zhang, Shaolei and Du, Xiaoyong},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning},
  year={2026},
  organization={PMLR}
}

📄 License

MIT License. See LICENSE for details.

Individual Kaggle datasets may have their own licenses.

📧 Contact

GitHub Issues: github.com/ruc-datalab/CoDA-Bench/issues
Email: yuxin.zhang@ruc.edu.cn

💬 WeChat Group

Welcome to join the CoDA-Bench WeChat group, chat and share ideas with others!

If you like CoDA-Bench, give it a GitHub Star ⭐.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
assets		assets
coda_bench		coda_bench
data		data
docker		docker
docs		docs
examples		examples
scripts		scripts
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
RELEASE_v1.0.0.md		RELEASE_v1.0.0.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

👋 Overview

📰 News

📊 Dataset Statistics

🚀 Quick Start

Installation

Download Dataset

Run Evaluation (Docker Mode)

Why Docker?

💽 Manual Evaluation

🏆 Leaderboard

💡 Example Task

📚 Documentation

🗺️ Roadmap

💫 Contributing

✍️ Citation

📄 License

📧 Contact

💬 WeChat Group

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

👋 Overview

📰 News

📊 Dataset Statistics

🚀 Quick Start

Installation

Download Dataset

Run Evaluation (Docker Mode)

Why Docker?

💽 Manual Evaluation

🏆 Leaderboard

💡 Example Task

📚 Documentation

🗺️ Roadmap

💫 Contributing

✍️ Citation

📄 License

📧 Contact

💬 WeChat Group

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages