A supervised machine learning classification pipeline designed to detect fake Instagram accounts based on profile attributes and behavioral patterns.
Fake, spam, or automated bot accounts on Instagram lead to misinformation, artificial engagement inflation, and increased security risks. This project aims to accurately classify accounts into Genuine (0) or Fake (1) using profile features, providing a production-ready model for spam detection.
The model is trained on a balanced dataset of 576 user profiles (288 genuine, 288 fake).
- Categorical / Binary Features:
profile pic(Presence of profile picture)private(Account privacy setting)external URL(Presence of an external URL in the bio)
- Textual Features:
description length(Length of biography in characters)fullname words(Word count of the account holder's name)
- Numeric Features:
#followers(Number of followers)#follows(Number of accounts followed)#posts(Number of media posts)
- Target Variable:
fake(0 = genuine profile, 1 = fake/spam profile)
- Analyzed profile metrics using distribution plots, countplots, and log-scaled box plots.
- Constructed a correlation heatmap to analyze multicollinearity.
- Engineered a custom feature:
follower_following_ratioto quantify profile interactions. - Applied standardization using
StandardScalerto handle magnitude discrepancies in numeric attributes (e.g., follower counts vs. word counts).
Multiple classification algorithms were trained and cross-validated:
- Decision Tree Classifier
- Random Forest Classifier (Selected as the final model due to optimal generalization)
| Model | Training Accuracy | Test Accuracy |
|---|---|---|
| Decision Tree | 91.2% | - |
| Random Forest | 93.1% | 95.0% |
The project is structured into 6 step-wise Jupyter Notebooks:
- 01_Data_Understanding.ipynb
- Initial dataset ingestion, attribute exploration, and schema verification.
- 02_EDA_Feature_Insight.ipynb
- In-depth statistical plotting, distribution analysis, and correlation profiling.
- 03_Feature_Engineering_and_Scaling.ipynb
- Derivation of the interaction ratio and scaling attributes.
- 04_Model_Training_and_Evaluation.ipynb
- Model construction, hyperparameter configuration, training, evaluation, and serialization.
- 05_Streamlit_Data_Preparation.ipynb
- Structuring final processed datasets for interactive web dashboards (like Streamlit).
- 06_Test_Evaluation_and_Prediction.ipynb
- Model evaluation on unseen test partitions and outputting predictions.
INSTA-PROJECT/
│
├── data/
│ ├── train.csv # Raw training dataset
│ ├── processed_train.csv # Scaled training features
│ ├── test.csv # Unseen raw test dataset
│ ├── test_predictions.csv # Predictions on test data (model output)
│ └── dashboard_data.csv # Prepped data structure for Streamlit dashboard
│
├── models/
│ ├── best_rf_case2.pkl # Serialized Random Forest model weights
│ └── scaler_case_2.pkl # Serialized StandardScaler fitted parameters
│
├── notebooks/ # Jupyter notebooks (01 to 06)
│
├── .gitignore # Version control file exclusions
├── app.py # Interactive Streamlit Web Dashboard & Classifier
└── README.md # Project documentation
Install the required dependencies:
pip install numpy pandas scikit-learn jupyter joblib matplotlib seaborn streamlit plotlyTo run the interactive Streamlit dashboard locally:
streamlit run app.pyNavigate to the directory and run Jupyter:
jupyter notebookExecute the notebooks sequentially from 01_Data_Understanding.ipynb to 06_Test_Evaluation_and_Prediction.ipynb to reproduce the training pipeline and prediction outputs.