Skip to content

CANDRA2006/simple_spam_detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Simple Spam Detector

An AI-powered email and message classification system built with Machine Learning. This project uses Natural Language Processing (NLP) and Naive Bayes classifier to detect spam messages with high accuracy.

Python Flask scikit-learn License

Features

  • πŸ€– Machine Learning Powered: Uses Multinomial Naive Bayes with TF-IDF vectorization
  • 🎯 High Accuracy: Achieves excellent performance on spam detection
  • 🌐 Web Interface: Beautiful and interactive web UI built with Flask
  • πŸ“Š Detailed Analytics: Provides confidence scores and probability distributions
  • πŸš€ Easy to Use: Simple command-line interface and web application
  • πŸ“ˆ Model Evaluation: Comprehensive evaluation with confusion matrix and ROC curve
  • πŸ”„ Batch Prediction: Support for multiple message predictions

Project Structure

simple_spam_detector/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ static/
β”‚   β”‚   β”œβ”€β”€ css/
β”‚   β”‚   β”‚   └── style.css
β”‚   β”‚   └── js/
β”‚   β”‚       └── animations.js
β”‚   β”œβ”€β”€ templates/
β”‚   β”‚   └── index.html
β”‚   └── app.py
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/
β”‚   β”‚   └── spam.csv (not included - see Dataset section)
β”‚   └── processed/
β”‚       β”œβ”€β”€ X_train.pkl
β”‚       β”œβ”€β”€ X_test.pkl
β”‚       β”œβ”€β”€ y_train.pkl
β”‚       └── y_test.pkl
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ spam_model.pkl
β”‚   └── vectorizer.pkl
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ preprocess.py
β”‚   β”œβ”€β”€ train.py
β”‚   β”œβ”€β”€ evaluate.py
β”‚   └── predict.py
β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
└── run_all.py

Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package installer)

Setup Steps

  1. Clone the repository

    git clone https://github.com/candra2006/simple_spam_detector.git
    cd simple_spam_detector
  2. Create a virtual environment (recommended)

    python -m venv venv
    
    # On Windows
    venv\Scripts\activate
    
    # On macOS/Linux
    source venv/bin/activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Download NLTK data (automatic on first run)

    python -c "import nltk; nltk.download('stopwords')"

Dataset

This project requires a spam dataset in CSV format. The dataset should have two columns:

  • Category: Label (spam/ham)
  • Message: The message text

Dataset Format Example

Category,Message
spam,"Congratulations! You've won $1000. Click here now!"
ham,"Hey, are we still meeting for lunch tomorrow?"
spam,"URGENT: Your account has been compromised!"
ham,"Thanks for dinner last night!"

Where to Get the Dataset

You can use publicly available spam datasets such as:

Setup Dataset

  1. Download your preferred spam dataset
  2. Place the CSV file in the data/raw/ folder
  3. Rename it to spam.csv or update the path in src/preprocess.py

Usage

Quick Start (Recommended)

Use the automated script to run the complete pipeline:

python run_all.py

This will show a menu with options:

  1. Run FULL PIPELINE (Preprocess β†’ Train β†’ Evaluate)
  2. Run PREPROCESS only
  3. Run TRAIN only
  4. Run EVALUATE only
  5. Run WEB APP only
  6. Run TEST PREDICTION

Manual Step-by-Step

1. Data Preprocessing

Clean and prepare the dataset:

python src/preprocess.py

This will:

  • Load the raw dataset
  • Clean text (remove URLs, special characters, etc.)
  • Apply stemming and remove stopwords
  • Split into train/test sets (80/20)
  • Save processed data to data/processed/

2. Model Training

Train the spam detection model:

python src/train.py

This will:

  • Load preprocessed data
  • Convert text to TF-IDF features
  • Train Naive Bayes classifier
  • Save model and vectorizer to model/

3. Model Evaluation

Evaluate model performance:

python src/evaluate.py

This will:

  • Load test data and trained model
  • Calculate metrics (accuracy, precision, recall, F1)
  • Generate confusion matrix
  • Plot ROC curve
  • Save visualizations as PNG files

4. Run Web Application

Start the Flask web server:

cd app
python app.py

Then open your browser and navigate to:

http://localhost:5000

5. Command-Line Prediction

Test predictions from the command line:

python src/predict.py

Model Details

Text Preprocessing

  • Lowercasing: Convert all text to lowercase
  • URL Removal: Remove HTTP/HTTPS URLs
  • Email Removal: Remove email addresses
  • Special Characters: Remove non-alphabetic characters
  • Stopword Removal: Remove common English stopwords
  • Stemming: Apply Porter Stemmer algorithm

Feature Extraction

  • TF-IDF Vectorization
    • Max features: 3000
    • N-gram range: (1, 2)
    • Min document frequency: 2
    • Max document frequency: 0.95

Classification Algorithm

  • Multinomial Naive Bayes
    • Alpha (smoothing parameter): 0.1
    • Suitable for text classification
    • Fast training and prediction
    • Probabilistic output

API Endpoints

/predict (POST)

Classify a single message.

Request:

{
  "text": "Congratulations! You've won a prize!"
}

Response:

{
  "success": true,
  "prediction": "Spam",
  "is_spam": true,
  "confidence": 95.67,
  "spam_probability": 95.67,
  "ham_probability": 4.33
}

/batch-predict (POST)

Classify multiple messages.

Request:

{
  "texts": [
    "Message 1",
    "Message 2"
  ]
}

Response:

{
  "success": true,
  "results": [
    {
      "prediction": "Spam",
      "confidence": 92.5,
      ...
    },
    ...
  ]
}

Performance Metrics

Expected performance on standard spam datasets:

  • Accuracy: ~96-98%
  • Precision: ~95-97%
  • Recall: ~94-96%
  • F1 Score: ~95-97%
  • ROC AUC: ~98-99%

Note: Actual performance depends on the dataset used.

Web Interface Features

  • 🎨 Modern and responsive design
  • ⚑ Real-time prediction
  • πŸ“Š Confidence meters and probability bars
  • πŸ’‘ Example messages for testing
  • ⚠️ Warning indicators for spam messages
  • 🎭 Animated UI elements
  • πŸ“± Mobile-friendly

Customization

Change Machine Learning Model

Edit src/train.py and modify the model_type parameter:

trainer.train_pipeline(model_type='logistic_regression')
# Options: 'naive_bayes', 'logistic_regression', 'svm', 'random_forest'

Adjust TF-IDF Parameters

Modify vectorizer settings in src/train.py:

self.vectorizer = TfidfVectorizer(
    max_features=5000,  # Increase feature count
    ngram_range=(1, 3),  # Use trigrams
    min_df=3,
    max_df=0.90
)

Modify Train/Test Split

Change the split ratio in src/preprocess.py:

test_size=0.3  # 30% for testing, 70% for training

Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a new branch (git checkout -b feature/improvement)
  3. Make your changes
  4. Commit your changes (git commit -am 'Add new feature')
  5. Push to the branch (git push origin feature/improvement)
  6. Create a Pull Request

Acknowledgments

  • Dataset providers and Kaggle community
  • scikit-learn for machine learning tools
  • Flask for web framework
  • NLTK for natural language processing

Support

If you find this project helpful, please give it a ⭐️!

For issues and questions, please open an issue on GitHub.

Troubleshooting

Common Issues

Issue: ModuleNotFoundError: No module named 'flask'

  • Solution: Install dependencies with pip install -r requirements.txt

Issue: FileNotFoundError: spam.csv not found

  • Solution: Place your dataset in data/raw/spam.csv

Issue: NLTK stopwords not found

  • Solution: Run python -c "import nltk; nltk.download('stopwords')"

Issue: Port 5000 already in use

  • Solution: Change port in app/app.py: app.run(port=5001)

Author

Candra

About

Machine Learning spam detection. Uses Naive Bayes, TF-IDF, and NLTK for accurate email/SMS classification. Built with Python, Flask, and scikit-learn. Confidence scores, and interactive UI. Great for ML beginners.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors