🎬 IMDB Sentiment Analysis

Classify movie-review sentiment (positive / negative) — full NLP pipeline, classical ML vs deep learning, and a type-your-own-review Streamlit demo.

Best model — TF-IDF + Logistic Regression · 90.9% accuracy · F1 0.910 · ROC-AUC 0.970

🎯 Overview

Given the text of a movie review, predict whether the sentiment is positive or negative — the canonical NLP text-classification task. Built on the IMDB Large Movie Review dataset: 50,000 reviews, perfectly balanced (25k positive / 25k negative).

TL;DR — After light text cleaning and TF-IDF (unigrams + bigrams), a Logistic Regression classifier reaches 90.9% accuracy — beating Naive Bayes, Linear SVM and an embedding neural network. The classic result: a well-tuned linear model is very hard to beat on IMDB bag-of-words features.

🏆 Results

Four models, one shared balanced 80/20 hold-out, ranked by accuracy:

Model	Accuracy	F1	ROC-AUC
TF-IDF + Logistic Regression ⭐	0.909	0.910	0.970
TF-IDF + Linear SVM	0.907	0.908	0.970
Neural Net (word embeddings)	0.896	0.897	0.959
TF-IDF + Multinomial NB	0.879	0.881	0.950

🎮 Live Demo

A Streamlit app classifies any review you type, with a confidence score.

pip install -r requirements.txt
streamlit run app/app.py

Deploy it free on Streamlit Community Cloud — see STEPS.md. ▶️ Live demo: https://sentiment-analysis-mrvnvm7o4iwbdv6zjgabg2.streamlit.app/

Sample predictions:

Review	Prediction
"An absolute masterpiece. Stunning performances…"	😊 Positive (0.99)
"A complete waste of time. The plot made no sense…"	😞 Negative (0.00)
"Gorgeous visuals but the pacing dragged and the ending fell flat."	😞 Negative (0.36)

📊 Exploratory Data Analysis

The word clouds and distinctive-term analysis are intuitive — great, excellent, wonderful dominate positive reviews; worst, waste, awful, boring dominate negative ones.

The classes are perfectly balanced (so accuracy is meaningful), and review length barely differs by sentiment.

🧠 Methodology highlights

Cleaning: strip HTML/<br> tags, lowercase, remove non-letters, de-duplicate (≈418 dupes removed).
Classical: TF-IDF with unigrams + bigrams (30k features) → Logistic Regression / Linear SVM / Multinomial NB.
Deep learning: a Keras word-embedding network (Embedding → GlobalAveragePooling → Dense).
Honest evaluation: accuracy, precision, recall, F1 and ROC-AUC on a shared split, plus a confusion matrix and an error analysis of the model's most confident mistakes (mostly sarcasm and mixed-sentiment reviews).
Deployment: the TF-IDF + Logistic Regression model is serialised with joblib and served via Streamlit.

🗂️ Repository structure

Ankit_Saxena_Sentiment_Analysis/
├── Sentiment_Analysis.ipynb       # Full notebook: cleaning → EDA → 4 models → error analysis
├── data/
│   └── IMDB-Dataset.csv.gz        # 50k labelled reviews (gzip; pandas reads it directly)
├── app/
│   ├── app.py                     # Streamlit demo
│   ├── tfidf_vectorizer.joblib    # Fitted TF-IDF vectoriser
│   ├── sentiment_model.joblib     # Trained Logistic Regression
│   └── model_meta.json
├── assets/                        # Figures (word clouds, comparison, …)
├── reports/                       # Written report (DOCX + PDF)
├── requirements.txt · STEPS.md · LICENSE · README.md

🚀 Run it yourself

git clone https://github.com/AnkitSaxena-AI/sentiment-analysis.git
cd sentiment-analysis
pip install -r requirements.txt
jupyter notebook Sentiment_Analysis.ipynb     # the analysis
streamlit run app/app.py                      # the demo

👤 Author

Ankit Saxena — @AnkitSaxena-AI

Dataset: IMDB Large Movie Review (Maas et al., 2011). For educational use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 IMDB Sentiment Analysis

🎯 Overview

🏆 Results

🎮 Live Demo

📊 Exploratory Data Analysis

🧠 Methodology highlights

🗂️ Repository structure

🚀 Run it yourself

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
assets		assets
data		data
reports		reports
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
STEPS.md		STEPS.md
Sentiment_Analysis.ipynb		Sentiment_Analysis.ipynb
requirements-notebook.txt		requirements-notebook.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🎬 IMDB Sentiment Analysis

🎯 Overview

🏆 Results

🎮 Live Demo

📊 Exploratory Data Analysis

🧠 Methodology highlights

🗂️ Repository structure

🚀 Run it yourself

👤 Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages