Code and data for [ICML 2026] CoDA-Bench: Can Code Agents Handle Data-Intensive Tasks?
Authors: Yuxin Zhang, Ju Fan, Meihao Fan, Shaolei Zhang*, Xiaoyong Du
Renmin University of China
CoDA-Bench (Code and Data-intensive Benchmark) is a benchmark for evaluating AI agents on data-intensive analytical tasks. Given a natural language question and access to a Linux sandbox containing hundreds of data files, an agent must discover relevant data, write code, and produce the correct answer.
demo_video_compressed.mp4
โถ๏ธ Demo video. If it does not play inline, download/watch it here.
Unlike existing benchmarks that provide oracle data directly, CoDA-Bench requires agents to:
- ๐ Discover relevant data among hundreds of semantically similar files
- ๐๏ธ Navigate complex file hierarchies in a Linux sandbox
- ๐ Integrate information from multiple heterogeneous data sources
- ๐ป Generate correct code for data-driven analytical tasks
- [Jun. 8, 2026]: CoDA-Bench v1.0 released with Docker evaluation!
- [Jun. 1, 2026]: CoDA-Bench paper accepted at ICML 2026!
| Metric | Value |
|---|---|
| Total Tasks | 1,009 (full) / 119 (hard subset) |
| Communities | 31 (full) / 15 (hard subset) |
| Source Datasets | 199 Kaggle datasets |
| Avg Files per Task | ~980 files |
| Total Size | ~43 GB (compressed) |
git clone https://github.com/ruc-datalab/CoDA-Bench.git
cd CoDA-Bench
pip install -e .# One-command setup: downloads and extracts everything
python scripts/setup_dataset.py --data-dir ./datasetsThis downloads:
- Benchmark files (
coda_bench.json,coda_bench_hard.json) - Community data archives (31 communities, ~43 GB)
- Automatically extracts all archives
Step 1: Build Docker Image
cd docker
./build_all.sh
cd ..Step 2: Set API Credentials
export LLM_API_KEY="your-api-key"
export LLM_BASE_URL="https://api.openai.com/v1" # OptionalStep 3: Run Evaluation
# Quick test (4 instances)
python scripts/run_evaluation.py \
--model gpt-5.5 \
--instances 0 1 2 3 \
--output results/test
# Full evaluation
python scripts/run_evaluation.py \
--model gpt-5.5 \
--output results/full \
--workers 8Step 4: Evaluate Results
python -m coda_bench.cli evaluate \
--pred results/test/predictions.jsonl \
--gold datasets/coda_bench.json \
--out results/test/scores.jsonDocker mode provides secure isolation:
- โ Agents cannot access benchmark answers
- โ Network-restricted environment (only LLM API accessible)
- โ Resource limits (memory, CPU, timeout)
- โ Reproducible across different machines
See docker/README.md for detailed Docker documentation.
If you prefer manual control:
1. Load Dataset
import json
with open('datasets/coda_bench.json') as f:
tasks = json.load(f)
# Each task contains:
# - instance_id: unique identifier
# - question: natural language question
# - answer: expected answer
# - release_community: community data directory2. Run Your Agent
# Your agent should:
# 1. Access community data at: datasets/communities/{release_community}/full_community
# 2. Explore files and write code to answer the question
# 3. Output final answer3. Evaluate Predictions
python -m coda_bench.cli evaluate \
--pred predictions.jsonl \
--gold datasets/coda_bench.json \
--out results.jsonPrediction format (JSONL):
{"instance_id": 0, "prediction": "38%"}
{"instance_id": 1, "prediction": "150"}Current state-of-the-art results (as of paper publication):
| System | Model | EA (Full) | EA (Hard) | DA (Full) |
|---|---|---|---|---|
| Mini-SWE-Agent | GPT-5.5 | 61.1% | 49.6% | 52.3% |
| Codex CLI | GPT-5.5 | 60.3% | 47.9% | 51.7% |
| OpenHands | GPT-5.5 | 59.7% | 44.5% | 48.9% |
| Claude Code | Sonnet-4.6 | 53.8% | 42.9% | 47.2% |
EA = Execution Accuracy, DA = Discovery Accuracy
{
"instance_id": 0,
"question": "What is the percentage of missing values in the RBC feature?",
"answer": "38%",
"release_community": "community_26"
}The agent needs to:
- Navigate to
datasets/communities/community_26/full_community/ - Find
ckdisease/source/kidney_disease.csvamong ~980 files - Load and analyze the data
- Calculate missing value percentage for RBC column
- Format answer as "38%"
- Docker Evaluation Guide - Secure Docker-based evaluation
- Evaluation Protocol - Scoring and metrics
- Data Format - Dataset schema
- Baseline Agents - Reference implementations
v1.0 (Current - June 2026)
- โ OpenHands agent with Docker isolation
- โ Full dataset (1,009 tasks) + hard subset (119 tasks)
- โ Secure evaluation environment
- โ Complete documentation
Coming Soon
- ๐ง Direct mode - Quick testing without Docker (simple, no isolation)
- ๐ง Additional agents - Claude Code, Codex, Mini-SWE-Agent Docker support
- ๐ง Better logging - Real-time progress tracking
- ๐ง Performance optimizations - Faster evaluation
We welcome contributions! See CONTRIBUTING.md for guidelines.
Ways to contribute:
- ๐ Report bugs or issues
- ๐ก Suggest new features
- ๐ Improve documentation
- ๐งช Add agent implementations
- ๐ Share evaluation results
@inproceedings{zhang2026codabench,
title={CoDA-BENCH: Can Code Agents Handle Data-Intensive Tasks?},
author={Zhang, Yuxin and Fan, Ju and Fan, Meihao and Zhang, Shaolei and Du, Xiaoyong},
booktitle={Proceedings of the 43rd International Conference on Machine Learning},
year={2026},
organization={PMLR}
}MIT License. See LICENSE for details.
Individual Kaggle datasets may have their own licenses.
- GitHub Issues: github.com/ruc-datalab/CoDA-Bench/issues
- Email: yuxin.zhang@ruc.edu.cn
Welcome to join the CoDA-Bench WeChat group, chat and share ideas with others!
If you like CoDA-Bench, give it a GitHub Star โญ.

