An AI-powered email and message classification system built with Machine Learning. This project uses Natural Language Processing (NLP) and Naive Bayes classifier to detect spam messages with high accuracy.
- π€ Machine Learning Powered: Uses Multinomial Naive Bayes with TF-IDF vectorization
- π― High Accuracy: Achieves excellent performance on spam detection
- π Web Interface: Beautiful and interactive web UI built with Flask
- π Detailed Analytics: Provides confidence scores and probability distributions
- π Easy to Use: Simple command-line interface and web application
- π Model Evaluation: Comprehensive evaluation with confusion matrix and ROC curve
- π Batch Prediction: Support for multiple message predictions
simple_spam_detector/
βββ app/
β βββ static/
β β βββ css/
β β β βββ style.css
β β βββ js/
β β βββ animations.js
β βββ templates/
β β βββ index.html
β βββ app.py
βββ data/
β βββ raw/
β β βββ spam.csv (not included - see Dataset section)
β βββ processed/
β βββ X_train.pkl
β βββ X_test.pkl
β βββ y_train.pkl
β βββ y_test.pkl
βββ model/
β βββ spam_model.pkl
β βββ vectorizer.pkl
βββ src/
β βββ preprocess.py
β βββ train.py
β βββ evaluate.py
β βββ predict.py
βββ .gitignore
βββ LICENSE
βββ README.md
βββ requirements.txt
βββ run_all.py
- Python 3.8 or higher
- pip (Python package installer)
-
Clone the repository
git clone https://github.com/candra2006/simple_spam_detector.git cd simple_spam_detector -
Create a virtual environment (recommended)
python -m venv venv # On Windows venv\Scripts\activate # On macOS/Linux source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Download NLTK data (automatic on first run)
python -c "import nltk; nltk.download('stopwords')"
This project requires a spam dataset in CSV format. The dataset should have two columns:
Category: Label (spam/ham)Message: The message text
Category,Message
spam,"Congratulations! You've won $1000. Click here now!"
ham,"Hey, are we still meeting for lunch tomorrow?"
spam,"URGENT: Your account has been compromised!"
ham,"Thanks for dinner last night!"You can use publicly available spam datasets such as:
- Download your preferred spam dataset
- Place the CSV file in the
data/raw/folder - Rename it to
spam.csvor update the path insrc/preprocess.py
Use the automated script to run the complete pipeline:
python run_all.pyThis will show a menu with options:
- Run FULL PIPELINE (Preprocess β Train β Evaluate)
- Run PREPROCESS only
- Run TRAIN only
- Run EVALUATE only
- Run WEB APP only
- Run TEST PREDICTION
Clean and prepare the dataset:
python src/preprocess.pyThis will:
- Load the raw dataset
- Clean text (remove URLs, special characters, etc.)
- Apply stemming and remove stopwords
- Split into train/test sets (80/20)
- Save processed data to
data/processed/
Train the spam detection model:
python src/train.pyThis will:
- Load preprocessed data
- Convert text to TF-IDF features
- Train Naive Bayes classifier
- Save model and vectorizer to
model/
Evaluate model performance:
python src/evaluate.pyThis will:
- Load test data and trained model
- Calculate metrics (accuracy, precision, recall, F1)
- Generate confusion matrix
- Plot ROC curve
- Save visualizations as PNG files
Start the Flask web server:
cd app
python app.pyThen open your browser and navigate to:
http://localhost:5000
Test predictions from the command line:
python src/predict.py- Lowercasing: Convert all text to lowercase
- URL Removal: Remove HTTP/HTTPS URLs
- Email Removal: Remove email addresses
- Special Characters: Remove non-alphabetic characters
- Stopword Removal: Remove common English stopwords
- Stemming: Apply Porter Stemmer algorithm
- TF-IDF Vectorization
- Max features: 3000
- N-gram range: (1, 2)
- Min document frequency: 2
- Max document frequency: 0.95
- Multinomial Naive Bayes
- Alpha (smoothing parameter): 0.1
- Suitable for text classification
- Fast training and prediction
- Probabilistic output
Classify a single message.
Request:
{
"text": "Congratulations! You've won a prize!"
}Response:
{
"success": true,
"prediction": "Spam",
"is_spam": true,
"confidence": 95.67,
"spam_probability": 95.67,
"ham_probability": 4.33
}Classify multiple messages.
Request:
{
"texts": [
"Message 1",
"Message 2"
]
}Response:
{
"success": true,
"results": [
{
"prediction": "Spam",
"confidence": 92.5,
...
},
...
]
}Expected performance on standard spam datasets:
- Accuracy: ~96-98%
- Precision: ~95-97%
- Recall: ~94-96%
- F1 Score: ~95-97%
- ROC AUC: ~98-99%
Note: Actual performance depends on the dataset used.
- π¨ Modern and responsive design
- β‘ Real-time prediction
- π Confidence meters and probability bars
- π‘ Example messages for testing
β οΈ Warning indicators for spam messages- π Animated UI elements
- π± Mobile-friendly
Edit src/train.py and modify the model_type parameter:
trainer.train_pipeline(model_type='logistic_regression')
# Options: 'naive_bayes', 'logistic_regression', 'svm', 'random_forest'Modify vectorizer settings in src/train.py:
self.vectorizer = TfidfVectorizer(
max_features=5000, # Increase feature count
ngram_range=(1, 3), # Use trigrams
min_df=3,
max_df=0.90
)Change the split ratio in src/preprocess.py:
test_size=0.3 # 30% for testing, 70% for trainingContributions are welcome! Please follow these steps:
- Fork the repository
- Create a new branch (
git checkout -b feature/improvement) - Make your changes
- Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/improvement) - Create a Pull Request
- Dataset providers and Kaggle community
- scikit-learn for machine learning tools
- Flask for web framework
- NLTK for natural language processing
If you find this project helpful, please give it a βοΈ!
For issues and questions, please open an issue on GitHub.
Issue: ModuleNotFoundError: No module named 'flask'
- Solution: Install dependencies with
pip install -r requirements.txt
Issue: FileNotFoundError: spam.csv not found
- Solution: Place your dataset in
data/raw/spam.csv
Issue: NLTK stopwords not found
- Solution: Run
python -c "import nltk; nltk.download('stopwords')"
Issue: Port 5000 already in use
- Solution: Change port in
app/app.py:app.run(port=5001)
Candra