DataExplainer: Comprehensible Agentic Data Science

This repository contains the complete source code for the paper: "DataExplainer: Comprehensible Agentic Data Science with Consumer-Grade Hardware." We provide this code to ensure transparency, allow for thorough auditing by peers, and facilitate the replication of our experimental results. DataExplainer adapts the DataInterpreter framework to enable data science tasks on accessible hardware while improving comprehensibility.

📂 Repository Structure

The project is organized into three primary functional areas:

Component	Directory / File	Description
Framework	`MetaGPT-DataExplainer/`	An adapted version of MetaGPT featuring our custom agentic logic.
Data Pipeline	`Data/Scripts/`	Utilities for preprocessing, metadata extraction, and post-processing.
Generated Notebooks	`Generation\ Examples/`	Examples of Jupyter notebooks generated by DataExplainer with identifiable information removed.
Execution	Root Directory (`.py`, `.sh`)	Main entry points for running experiments and evaluation wrappers.

Key Files in Root

run_experiments-explainer-base.py: Main execution script for the Explainer agent.
run_experiments-interpreter-base.py: Main execution script for the Interpreter agent.
current-comp.json & solved_competitions.json: Tracking files for experiments states.

🚀 Getting Started (Linux)

1. Prerequisites

Python: 3.9 <= version < 3.12
Kaggle API: Ensure your kaggle.json is configured in ~/.kaggle/.

2. Data Acquisition

Download the necessary competition data and leaderboard benchmarks:

cd Data/Scripts/
python3 download_competitions.py
python3 download_leaderboards.py
cd ../../

3. Framework Setup

Install the adapted MetaGPT framework:

cd MetaGPT-DataExplainer/
pip install --upgrade -e .
metagpt --init-config
cd ..

You can learn more about the installation process in MetaGPT's repository

4. LLM Configuration

DataExplainer requires a connection to a Large Language Model.

Configure API Keys: Edit ~/.metagpt/config2.yaml with your provider details (OpenAI, Anthropic, or local endpoints like Ollama).
Set Model Variables: Update the CURRENT_MODEL string inside the experiment scripts:

run_experiments-explainer-base.py
run_experiments-interpreter-base.py

🧪 Running Experiments

Execute the evaluations for both the Explainer and Interpreter agents:

# Run the Explainer Base experiments
python3 run_experiments-explainer-base.py

# Run the Interpreter Base experiments
python3 run_experiments-interpreter-base.py

Note: These scripts use the wrappers (run_wrapper.sh) to manage environment state between iterations.

📊 Results and Submission

After inference is complete, use the following pipeline to evaluate performance:

Got to the Scripts directory: cd Data/Scripts
Submit to Kaggle: python3 submit_to_kaggle.py
Retrieve Scores: python3 get_scores.py
Finalize Data: python3 correct_scores.py

The final consolidated results will be available at: 📂 /Submissions/submission_scores.csv

📝 Citation

If you use this code or our findings in your research, please cite:

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Data		Data
Generation Examples		Generation Examples
MetaGPT-DataExplainer		MetaGPT-DataExplainer
Prompts		Prompts
.gitignore		.gitignore
README.md		README.md
current-comp.json		current-comp.json
run_experiments-explainer-base.py		run_experiments-explainer-base.py
run_experiments-interpreter-base.py		run_experiments-interpreter-base.py
run_wrapper-interpreter.sh		run_wrapper-interpreter.sh
run_wrapper.sh		run_wrapper.sh
solved_competitions.json		solved_competitions.json
wrapper-interpreter.py		wrapper-interpreter.py
wrapper.py		wrapper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataExplainer: Comprehensible Agentic Data Science

📂 Repository Structure

Key Files in Root

🚀 Getting Started (Linux)

1. Prerequisites

2. Data Acquisition

3. Framework Setup

4. LLM Configuration

🧪 Running Experiments

📊 Results and Submission

📝 Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataExplainer: Comprehensible Agentic Data Science

📂 Repository Structure

Key Files in Root

🚀 Getting Started (Linux)

1. Prerequisites

2. Data Acquisition

3. Framework Setup

4. LLM Configuration

🧪 Running Experiments

📊 Results and Submission

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages