GitHub - CANDRA2006/simple_spam_detector: Machine Learning spam detection. Uses Naive Bayes, TF-IDF, and NLTK for accurate email/SMS classification. Built with Python, Flask, and scikit-learn. Confidence scores, and interactive UI. Great for ML beginners.

Simple Spam Detector

An AI-powered email and message classification system built with Machine Learning. This project uses Natural Language Processing (NLP) and Naive Bayes classifier to detect spam messages with high accuracy.

Features

🤖 Machine Learning Powered: Uses Multinomial Naive Bayes with TF-IDF vectorization
🎯 High Accuracy: Achieves excellent performance on spam detection
🌐 Web Interface: Beautiful and interactive web UI built with Flask
📊 Detailed Analytics: Provides confidence scores and probability distributions
🚀 Easy to Use: Simple command-line interface and web application
📈 Model Evaluation: Comprehensive evaluation with confusion matrix and ROC curve
🔄 Batch Prediction: Support for multiple message predictions

Project Structure

simple_spam_detector/
├── app/
│   ├── static/
│   │   ├── css/
│   │   │   └── style.css
│   │   └── js/
│   │       └── animations.js
│   ├── templates/
│   │   └── index.html
│   └── app.py
├── data/
│   ├── raw/
│   │   └── spam.csv (not included - see Dataset section)
│   └── processed/
│       ├── X_train.pkl
│       ├── X_test.pkl
│       ├── y_train.pkl
│       └── y_test.pkl
├── model/
│   ├── spam_model.pkl
│   └── vectorizer.pkl
├── src/
│   ├── preprocess.py
│   ├── train.py
│   ├── evaluate.py
│   └── predict.py
├── .gitignore
├── LICENSE
├── README.md
├── requirements.txt
└── run_all.py

Installation

Prerequisites

Python 3.8 or higher
pip (Python package installer)

Setup Steps

Clone the repository

git clone https://github.com/candra2006/simple_spam_detector.git
cd simple_spam_detector

Create a virtual environment (recommended)

python -m venv venv

# On Windows
venv\Scripts\activate

# On macOS/Linux
source venv/bin/activate

Install dependencies
```
pip install -r requirements.txt
```

Download NLTK data (automatic on first run)

python -c "import nltk; nltk.download('stopwords')"

Dataset

This project requires a spam dataset in CSV format. The dataset should have two columns:

Category: Label (spam/ham)
Message: The message text

Dataset Format Example

Category,Message
spam,"Congratulations! You've won $1000. Click here now!"
ham,"Hey, are we still meeting for lunch tomorrow?"
spam,"URGENT: Your account has been compromised!"
ham,"Thanks for dinner last night!"

Where to Get the Dataset

You can use publicly available spam datasets such as:

Setup Dataset

Download your preferred spam dataset
Place the CSV file in the data/raw/ folder
Rename it to spam.csv or update the path in src/preprocess.py

Usage

Quick Start (Recommended)

Use the automated script to run the complete pipeline:

python run_all.py

This will show a menu with options:

Run FULL PIPELINE (Preprocess → Train → Evaluate)
Run PREPROCESS only
Run TRAIN only
Run EVALUATE only
Run WEB APP only
Run TEST PREDICTION

Manual Step-by-Step

1. Data Preprocessing

Clean and prepare the dataset:

python src/preprocess.py

This will:

Load the raw dataset
Clean text (remove URLs, special characters, etc.)
Apply stemming and remove stopwords
Split into train/test sets (80/20)
Save processed data to data/processed/

2. Model Training

Train the spam detection model:

python src/train.py

This will:

Load preprocessed data
Convert text to TF-IDF features
Train Naive Bayes classifier
Save model and vectorizer to model/

3. Model Evaluation

Evaluate model performance:

python src/evaluate.py

This will:

Load test data and trained model
Calculate metrics (accuracy, precision, recall, F1)
Generate confusion matrix
Plot ROC curve
Save visualizations as PNG files

4. Run Web Application

Start the Flask web server:

cd app
python app.py

Then open your browser and navigate to:

http://localhost:5000

5. Command-Line Prediction

Test predictions from the command line:

python src/predict.py

Model Details

Text Preprocessing

Lowercasing: Convert all text to lowercase
URL Removal: Remove HTTP/HTTPS URLs
Email Removal: Remove email addresses
Special Characters: Remove non-alphabetic characters
Stopword Removal: Remove common English stopwords
Stemming: Apply Porter Stemmer algorithm

Feature Extraction

TF-IDF Vectorization
- Max features: 3000
- N-gram range: (1, 2)
- Min document frequency: 2
- Max document frequency: 0.95

Classification Algorithm

Multinomial Naive Bayes
- Alpha (smoothing parameter): 0.1
- Suitable for text classification
- Fast training and prediction
- Probabilistic output

API Endpoints

`/predict` (POST)

Classify a single message.

Request:

{
  "text": "Congratulations! You've won a prize!"
}

Response:

{
  "success": true,
  "prediction": "Spam",
  "is_spam": true,
  "confidence": 95.67,
  "spam_probability": 95.67,
  "ham_probability": 4.33
}

`/batch-predict` (POST)

Classify multiple messages.

Request:

{
  "texts": [
    "Message 1",
    "Message 2"
  ]
}

Response:

{
  "success": true,
  "results": [
    {
      "prediction": "Spam",
      "confidence": 92.5,
      ...
    },
    ...
  ]
}

Performance Metrics

Expected performance on standard spam datasets:

Accuracy: ~96-98%
Precision: ~95-97%
Recall: ~94-96%
F1 Score: ~95-97%
ROC AUC: ~98-99%

Note: Actual performance depends on the dataset used.

Web Interface Features

🎨 Modern and responsive design
⚡ Real-time prediction
📊 Confidence meters and probability bars
💡 Example messages for testing
⚠️ Warning indicators for spam messages
🎭 Animated UI elements
📱 Mobile-friendly

Customization

Change Machine Learning Model

Edit src/train.py and modify the model_type parameter:

trainer.train_pipeline(model_type='logistic_regression')
# Options: 'naive_bayes', 'logistic_regression', 'svm', 'random_forest'

Adjust TF-IDF Parameters

Modify vectorizer settings in src/train.py:

self.vectorizer = TfidfVectorizer(
    max_features=5000,  # Increase feature count
    ngram_range=(1, 3),  # Use trigrams
    min_df=3,
    max_df=0.90
)

Modify Train/Test Split

Change the split ratio in src/preprocess.py:

test_size=0.3  # 30% for testing, 70% for training

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a new branch (git checkout -b feature/improvement)
Make your changes
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/improvement)
Create a Pull Request

Acknowledgments

Dataset providers and Kaggle community
scikit-learn for machine learning tools
Flask for web framework
NLTK for natural language processing

Support

If you find this project helpful, please give it a ⭐️!

For issues and questions, please open an issue on GitHub.

Troubleshooting

Common Issues

Issue: ModuleNotFoundError: No module named 'flask'

Solution: Install dependencies with pip install -r requirements.txt

Issue: FileNotFoundError: spam.csv not found

Solution: Place your dataset in data/raw/spam.csv

Issue: NLTK stopwords not found

Solution: Run python -c "import nltk; nltk.download('stopwords')"

Issue: Port 5000 already in use

Solution: Change port in app/app.py: app.run(port=5001)

Author

Candra

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
app		app
simple_spam_detector		simple_spam_detector
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_all.py		run_all.py

Folders and files

Latest commit

History

Repository files navigation

Simple Spam Detector

Features

Project Structure

Installation

Prerequisites

Setup Steps

Dataset

Dataset Format Example

Where to Get the Dataset

Setup Dataset

Usage

Quick Start (Recommended)

Manual Step-by-Step

1. Data Preprocessing

2. Model Training

3. Model Evaluation

4. Run Web Application

5. Command-Line Prediction

Model Details

Text Preprocessing

Feature Extraction

Classification Algorithm

API Endpoints

/predict (POST)

/batch-predict (POST)

Performance Metrics

Web Interface Features

Customization

Change Machine Learning Model

Adjust TF-IDF Parameters

Modify Train/Test Split

Contributing

Acknowledgments

Support

Troubleshooting

Common Issues

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`/predict` (POST)

`/batch-predict` (POST)

Packages