A graph-based system for analyzing the impact of code changes in software repositories.
It extracts structural information from the code, builds a dependency graph, and identifies components affected by a given change.
The system integrates machine learning to classify and rank impacted components, helping developers focus testing efforts and reduce regression risks.
- Fetch commits from a GitHub repository
- Select and compare two commits
- Analyze code structure using AST
- Build dependency graph of functions/modules
- Predict impact severity using ML
-
Clone the repository
-
Create a virtual environment:
python -m venv venv
-
Activate it:
-
Linux/macOS:
source venv/bin/activate -
Windows:
venv\Scripts\activate
-
-
Install dependencies:
pip install -r requirements.txt
-
Create a
.envfile:FLASK_DEBUG=1 MONGO_URI=<MongoDB connection string> GITHUB_TOKEN=<GitHub personal access token> HF_TOKEN=<HuggingFace access token> GEMINI_API_KEY=<Gemini API key> GEMINI_MODEL=gemini-2.5-flash-lite
Optional:
# Disable all ML tagging entirely. DISABLE_ML_TAGGER=1 # Or disable only the local model path. # With USE_LOCAL_MODEL=0, the app can still use Hugging Face Spaces. # If the Space is unavailable, it will not fall back to the local model. DISABLE_ML_TAGGER=local_only # Defaults to local model inference. # Set to 0/false to use the Hugging Face Spaces Gradio client instead. USE_LOCAL_MODEL=1 # Optional when USE_LOCAL_MODEL=0 HF_SPACE_ID=VantaTree/MLCodeTagger
python ml_tagger/build_raw_dataset.py
python ml_tagger/dataset_builder.pyaccelerate configThis machine
No distributed training
Do you want to run on CPU only? → NO
torch dynamo → NO
DeepSpeed → NO
GPU ids → 0
NUMA efficiency → NO
Mixed precision → fp16
This machine
No distributed training
Run on CPU only → YES
python ml_tagger/train.pyModel and dataset are stored locally and are ignored by git.
python app.pyOpen: http://127.0.0.1:5000
ml_tagger/data/andmodel/are excluded from git- Dataset is generated locally for reproducibility
- Only the final trained model is required for inference
The analysis page can now generate an optional AI summary on top of the deterministic graph analysis.
How it works:
services/analyzer.pybuilds the normal impact resultservices/ai_summary.pyconverts that result into a compact LLM prompt- If
GEMINI_API_KEYis configured, the app requests a structured summary from the Gemini API - The summary is rendered in the
AI Summarysection of the analysis page
If AI config is missing or the API call fails, the rest of the analysis still works normally.