Skip to content

rahul256812/InstaVerify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📌 Instagram Fake Account Detection

Python Version Machine Learning Status

A supervised machine learning classification pipeline designed to detect fake Instagram accounts based on profile attributes and behavioral patterns.


📊 Problem Statement

Fake, spam, or automated bot accounts on Instagram lead to misinformation, artificial engagement inflation, and increased security risks. This project aims to accurately classify accounts into Genuine (0) or Fake (1) using profile features, providing a production-ready model for spam detection.


📁 Dataset Overview

The model is trained on a balanced dataset of 576 user profiles (288 genuine, 288 fake).

Profile Features Analyzed:

  • Categorical / Binary Features:
    • profile pic (Presence of profile picture)
    • private (Account privacy setting)
    • external URL (Presence of an external URL in the bio)
  • Textual Features:
    • description length (Length of biography in characters)
    • fullname words (Word count of the account holder's name)
  • Numeric Features:
    • #followers (Number of followers)
    • #follows (Number of accounts followed)
    • #posts (Number of media posts)
  • Target Variable:
    • fake (0 = genuine profile, 1 = fake/spam profile)

🧠 Machine Learning Pipeline

1. Data Understanding & Exploratory Data Analysis (EDA)

  • Analyzed profile metrics using distribution plots, countplots, and log-scaled box plots.
  • Constructed a correlation heatmap to analyze multicollinearity.

2. Feature Engineering & Scaling

  • Engineered a custom feature: follower_following_ratio to quantify profile interactions.
  • Applied standardization using StandardScaler to handle magnitude discrepancies in numeric attributes (e.g., follower counts vs. word counts).

3. Model Training & Evaluation

Multiple classification algorithms were trained and cross-validated:

  • Decision Tree Classifier
  • Random Forest Classifier (Selected as the final model due to optimal generalization)

Model Performance:

Model Training Accuracy Test Accuracy
Decision Tree 91.2% -
Random Forest 93.1% 95.0%

🛠️ Step-by-Step Implementation Pipeline

The project is structured into 6 step-wise Jupyter Notebooks:

  1. 01_Data_Understanding.ipynb
    • Initial dataset ingestion, attribute exploration, and schema verification.
  2. 02_EDA_Feature_Insight.ipynb
    • In-depth statistical plotting, distribution analysis, and correlation profiling.
  3. 03_Feature_Engineering_and_Scaling.ipynb
    • Derivation of the interaction ratio and scaling attributes.
  4. 04_Model_Training_and_Evaluation.ipynb
    • Model construction, hyperparameter configuration, training, evaluation, and serialization.
  5. 05_Streamlit_Data_Preparation.ipynb
    • Structuring final processed datasets for interactive web dashboards (like Streamlit).
  6. 06_Test_Evaluation_and_Prediction.ipynb
    • Model evaluation on unseen test partitions and outputting predictions.

✅ Deliverables & Directory Structure

INSTA-PROJECT/
│
├── data/
│   ├── train.csv                # Raw training dataset
│   ├── processed_train.csv      # Scaled training features
│   ├── test.csv                 # Unseen raw test dataset
│   ├── test_predictions.csv     # Predictions on test data (model output)
│   └── dashboard_data.csv       # Prepped data structure for Streamlit dashboard
│
├── models/
│   ├── best_rf_case2.pkl        # Serialized Random Forest model weights
│   └── scaler_case_2.pkl        # Serialized StandardScaler fitted parameters
│
├── notebooks/                   # Jupyter notebooks (01 to 06)
│
├── .gitignore                   # Version control file exclusions
├── app.py                       # Interactive Streamlit Web Dashboard & Classifier
└── README.md                    # Project documentation

🚀 Getting Started

Prerequisites

Install the required dependencies:

pip install numpy pandas scikit-learn jupyter joblib matplotlib seaborn streamlit plotly

Running the App Dashboard

To run the interactive Streamlit dashboard locally:

streamlit run app.py

Running the Notebook Pipeline

Navigate to the directory and run Jupyter:

jupyter notebook

Execute the notebooks sequentially from 01_Data_Understanding.ipynb to 06_Test_Evaluation_and_Prediction.ipynb to reproduce the training pipeline and prediction outputs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors