📌 Instagram Fake Account Detection

A supervised machine learning classification pipeline designed to detect fake Instagram accounts based on profile attributes and behavioral patterns.

📊 Problem Statement

Fake, spam, or automated bot accounts on Instagram lead to misinformation, artificial engagement inflation, and increased security risks. This project aims to accurately classify accounts into Genuine (0) or Fake (1) using profile features, providing a production-ready model for spam detection.

📁 Dataset Overview

The model is trained on a balanced dataset of 576 user profiles (288 genuine, 288 fake).

Profile Features Analyzed:

Categorical / Binary Features:
- profile pic (Presence of profile picture)
- private (Account privacy setting)
- external URL (Presence of an external URL in the bio)
Textual Features:
- description length (Length of biography in characters)
- fullname words (Word count of the account holder's name)
Numeric Features:
- #followers (Number of followers)
- #follows (Number of accounts followed)
- #posts (Number of media posts)
Target Variable:
- fake (0 = genuine profile, 1 = fake/spam profile)

🧠 Machine Learning Pipeline

1. Data Understanding & Exploratory Data Analysis (EDA)

Analyzed profile metrics using distribution plots, countplots, and log-scaled box plots.
Constructed a correlation heatmap to analyze multicollinearity.

2. Feature Engineering & Scaling

Engineered a custom feature: follower_following_ratio to quantify profile interactions.
Applied standardization using StandardScaler to handle magnitude discrepancies in numeric attributes (e.g., follower counts vs. word counts).

3. Model Training & Evaluation

Multiple classification algorithms were trained and cross-validated:

Decision Tree Classifier
Random Forest Classifier (Selected as the final model due to optimal generalization)

Model Performance:

Model	Training Accuracy	Test Accuracy
Decision Tree	91.2%	-
Random Forest	93.1%	95.0%

🛠️ Step-by-Step Implementation Pipeline

The project is structured into 6 step-wise Jupyter Notebooks:

01_Data_Understanding.ipynb
- Initial dataset ingestion, attribute exploration, and schema verification.
02_EDA_Feature_Insight.ipynb
- In-depth statistical plotting, distribution analysis, and correlation profiling.
03_Feature_Engineering_and_Scaling.ipynb
- Derivation of the interaction ratio and scaling attributes.
04_Model_Training_and_Evaluation.ipynb
- Model construction, hyperparameter configuration, training, evaluation, and serialization.
05_Streamlit_Data_Preparation.ipynb
- Structuring final processed datasets for interactive web dashboards (like Streamlit).
06_Test_Evaluation_and_Prediction.ipynb
- Model evaluation on unseen test partitions and outputting predictions.

✅ Deliverables & Directory Structure

INSTA-PROJECT/
│
├── data/
│   ├── train.csv                # Raw training dataset
│   ├── processed_train.csv      # Scaled training features
│   ├── test.csv                 # Unseen raw test dataset
│   ├── test_predictions.csv     # Predictions on test data (model output)
│   └── dashboard_data.csv       # Prepped data structure for Streamlit dashboard
│
├── models/
│   ├── best_rf_case2.pkl        # Serialized Random Forest model weights
│   └── scaler_case_2.pkl        # Serialized StandardScaler fitted parameters
│
├── notebooks/                   # Jupyter notebooks (01 to 06)
│
├── .gitignore                   # Version control file exclusions
├── app.py                       # Interactive Streamlit Web Dashboard & Classifier
└── README.md                    # Project documentation

🚀 Getting Started

Prerequisites

Install the required dependencies:

pip install numpy pandas scikit-learn jupyter joblib matplotlib seaborn streamlit plotly

Running the App Dashboard

To run the interactive Streamlit dashboard locally:

streamlit run app.py

Running the Notebook Pipeline

Navigate to the directory and run Jupyter:

jupyter notebook

Execute the notebooks sequentially from 01_Data_Understanding.ipynb to 06_Test_Evaluation_and_Prediction.ipynb to reproduce the training pipeline and prediction outputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📌 Instagram Fake Account Detection

📊 Problem Statement

📁 Dataset Overview

Profile Features Analyzed:

🧠 Machine Learning Pipeline

1. Data Understanding & Exploratory Data Analysis (EDA)

2. Feature Engineering & Scaling

3. Model Training & Evaluation

Model Performance:

🛠️ Step-by-Step Implementation Pipeline

✅ Deliverables & Directory Structure

🚀 Getting Started

Prerequisites

Running the App Dashboard

Running the Notebook Pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
models		models
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
app.py		app.py

Folders and files

Latest commit

History

Repository files navigation

📌 Instagram Fake Account Detection

📊 Problem Statement

📁 Dataset Overview

Profile Features Analyzed:

🧠 Machine Learning Pipeline

1. Data Understanding & Exploratory Data Analysis (EDA)

2. Feature Engineering & Scaling

3. Model Training & Evaluation

Model Performance:

🛠️ Step-by-Step Implementation Pipeline

✅ Deliverables & Directory Structure

🚀 Getting Started

Prerequisites

Running the App Dashboard

Running the Notebook Pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages