Skip to content

joaopaulo7/DataExplainer-paper

Repository files navigation

DataExplainer: Comprehensible Agentic Data Science

This repository contains the complete source code for the paper: "DataExplainer: Comprehensible Agentic Data Science with Consumer-Grade Hardware." We provide this code to ensure transparency, allow for thorough auditing by peers, and facilitate the replication of our experimental results. DataExplainer adapts the DataInterpreter framework to enable data science tasks on accessible hardware while improving comprehensibility.

📂 Repository Structure

The project is organized into three primary functional areas:

Component Directory / File Description
Framework MetaGPT-DataExplainer/ An adapted version of MetaGPT featuring our custom agentic logic.
Data Pipeline Data/Scripts/ Utilities for preprocessing, metadata extraction, and post-processing.
Generated Notebooks Generation\ Examples/ Examples of Jupyter notebooks generated by DataExplainer with identifiable information removed.
Execution Root Directory (.py, .sh) Main entry points for running experiments and evaluation wrappers.

Key Files in Root

  • run_experiments-explainer-base.py: Main execution script for the Explainer agent.
  • run_experiments-interpreter-base.py: Main execution script for the Interpreter agent.
  • current-comp.json & solved_competitions.json: Tracking files for experiments states.

🚀 Getting Started (Linux)

1. Prerequisites

  • Python: 3.9 <= version < 3.12
  • Kaggle API: Ensure your kaggle.json is configured in ~/.kaggle/.

2. Data Acquisition

Download the necessary competition data and leaderboard benchmarks:

cd Data/Scripts/
python3 download_competitions.py
python3 download_leaderboards.py
cd ../../

3. Framework Setup

Install the adapted MetaGPT framework:

cd MetaGPT-DataExplainer/
pip install --upgrade -e .
metagpt --init-config
cd ..

You can learn more about the installation process in MetaGPT's repository

4. LLM Configuration

DataExplainer requires a connection to a Large Language Model.

  1. Configure API Keys: Edit ~/.metagpt/config2.yaml with your provider details (OpenAI, Anthropic, or local endpoints like Ollama).
  2. Set Model Variables: Update the CURRENT_MODEL string inside the experiment scripts:
  • run_experiments-explainer-base.py
  • run_experiments-interpreter-base.py

🧪 Running Experiments

Execute the evaluations for both the Explainer and Interpreter agents:

# Run the Explainer Base experiments
python3 run_experiments-explainer-base.py

# Run the Interpreter Base experiments
python3 run_experiments-interpreter-base.py

Note: These scripts use the wrappers (run_wrapper.sh) to manage environment state between iterations.


📊 Results and Submission

After inference is complete, use the following pipeline to evaluate performance:

  1. Got to the Scripts directory: cd Data/Scripts
  2. Submit to Kaggle: python3 submit_to_kaggle.py
  3. Retrieve Scores: python3 get_scores.py
  4. Finalize Data: python3 correct_scores.py

The final consolidated results will be available at: 📂 /Submissions/submission_scores.csv


📝 Citation

If you use this code or our findings in your research, please cite:

TBD

About

This repository contains the complete source code for the paper: "DataExplainer: Comprehensible Agentic Data Science with Consumer-Grade Hardware."

Resources

Stars

Watchers

Forks

Contributors