This repository contains the complete source code for the paper: "DataExplainer: Comprehensible Agentic Data Science with Consumer-Grade Hardware." We provide this code to ensure transparency, allow for thorough auditing by peers, and facilitate the replication of our experimental results. DataExplainer adapts the DataInterpreter framework to enable data science tasks on accessible hardware while improving comprehensibility.
The project is organized into three primary functional areas:
| Component | Directory / File | Description |
|---|---|---|
| Framework | MetaGPT-DataExplainer/ |
An adapted version of MetaGPT featuring our custom agentic logic. |
| Data Pipeline | Data/Scripts/ |
Utilities for preprocessing, metadata extraction, and post-processing. |
| Generated Notebooks | Generation\ Examples/ |
Examples of Jupyter notebooks generated by DataExplainer with identifiable information removed. |
| Execution | Root Directory (.py, .sh) |
Main entry points for running experiments and evaluation wrappers. |
run_experiments-explainer-base.py: Main execution script for the Explainer agent.run_experiments-interpreter-base.py: Main execution script for the Interpreter agent.current-comp.json&solved_competitions.json: Tracking files for experiments states.
- Python:
3.9 <= version < 3.12 - Kaggle API: Ensure your
kaggle.jsonis configured in~/.kaggle/.
Download the necessary competition data and leaderboard benchmarks:
cd Data/Scripts/
python3 download_competitions.py
python3 download_leaderboards.py
cd ../../
Install the adapted MetaGPT framework:
cd MetaGPT-DataExplainer/
pip install --upgrade -e .
metagpt --init-config
cd ..You can learn more about the installation process in MetaGPT's repository
DataExplainer requires a connection to a Large Language Model.
- Configure API Keys: Edit
~/.metagpt/config2.yamlwith your provider details (OpenAI, Anthropic, or local endpoints like Ollama). - Set Model Variables: Update the
CURRENT_MODELstring inside the experiment scripts:
run_experiments-explainer-base.pyrun_experiments-interpreter-base.py
Execute the evaluations for both the Explainer and Interpreter agents:
# Run the Explainer Base experiments
python3 run_experiments-explainer-base.py
# Run the Interpreter Base experiments
python3 run_experiments-interpreter-base.py
Note: These scripts use the wrappers (
run_wrapper.sh) to manage environment state between iterations.
After inference is complete, use the following pipeline to evaluate performance:
- Got to the Scripts directory:
cd Data/Scripts - Submit to Kaggle:
python3 submit_to_kaggle.py - Retrieve Scores:
python3 get_scores.py - Finalize Data:
python3 correct_scores.py
The final consolidated results will be available at:
📂 /Submissions/submission_scores.csv
If you use this code or our findings in your research, please cite:
TBD