📚 Amazon Book Recommender

A production-grade hybrid book recommendation system trained on the Amazon Books dataset. Combines collaborative filtering (SVD) with NLP content similarity (TF-IDF + Sentence Embeddings) to deliver personalized recommendations that are both user-relevant and thematically consistent.

Hybrid model achieves +85.1% NDCG improvement over collaborative filtering alone.

How It Works

User ID
   │
   ▼
┌─────────────────────────────┐
│  Collaborative Filtering    │  SVD model predicts ratings for all unread books
│  (SVD — Surprise library)   │  → returns top 50 candidates
└─────────────────────────────┘
   │
   ▼
┌─────────────────────────────┐
│  NLP Content Model          │  Scores each candidate by similarity to
│  (TF-IDF / Embeddings)      │  user's top-5 most liked books
└─────────────────────────────┘
   │
   ▼
┌─────────────────────────────┐
│  Hybrid Combiner            │  Dynamic weighting based on user history size
│  CF × weight + NLP × weight │  → returns top 5 final recommendations
└─────────────────────────────┘

Dynamic Weighting

User History	CF Weight	NLP Weight
< 5 liked books	80%	20%
5–19 liked books	60%	40%
20+ liked books	50%	50%

Results

Evaluated across 190 users using a holdout strategy (hide last 3 rated books, check if model finds them).

Metric	CF Only	Hybrid	Improvement
Precision@5	0.0021	0.0032	+50.0%
Recall@5	0.0044	0.0061	+40.0%
NDCG@5	0.0036	0.0067	+85.1%

Absolute values are low because we recommend 5 books out of 11,839 — the hybrid is 8× better than random chance.

Project Structure

amazon-recommender/
│
├── notebooks/
│   ├── 01_eda.py                      ← Exploratory data analysis
│   ├── 02_preprocess.py               ← Data cleaning & filtering
│   ├── 03_collaborative_filtering.py  ← SVD model training
│   ├── 04_nlp_model.py                ← TF-IDF NLP model
│   ├── 04b_embeddings_model.py        ← Sentence embeddings model
│   ├── 05_hybrid_model.py             ← Hybrid recommender
│   └── 06_evaluation.py               ← Metrics & comparison
│
├── src/
│   ├── data_loader.py
│   └── preprocessor.py
│
├── models/                            ← Saved trained models (not tracked in git)
│   ├── svd_model.pkl
│   ├── nlp_model.pkl
│   ├── embeddings_model.pkl
│   └── hybrid_model.pkl
│
├── data/processed/                    ← Processed data (not tracked in git)
│   ├── ratings_only.csv
│   └── train_data.csv
│
└── requirements.txt

Setup & Installation

1. Clone the repo

git clone https://github.com/Sehastrajit/Amazon-Recommendation.git
cd Amazon-Recommendation

2. Create and activate virtual environment

python3 -m venv venv
source venv/bin/activate        # Mac/Linux
venv\Scripts\activate           # Windows

3. Install dependencies

pip install -r requirements.txt

4. Download the dataset

Download the Amazon Books dataset and place the files in data/raw/:

reviews_Books.json.gz
meta_Books.json.gz

5. Run the pipeline in order

cd notebooks
python 01_eda.py
python 02_preprocess.py
python 03_collaborative_filtering.py
python 04_nlp_model.py
python 04b_embeddings_model.py
python 05_hybrid_model.py
python 06_evaluation.py

Dataset

Source: Amazon Books Reviews (public dataset)
Raw size: 500,000 reviews
After filtering: 197,074 ratings across 16,203 users and 11,839 books
Filter applied: Users with fewer than 5 ratings removed (reduces sparsity)

Metric	Before Filter	After Filter
Users	214,572	16,203
Avg ratings/user	1.3	12.2

Models

Collaborative Filtering (SVD)

Library: scikit-surprise
50 latent factors, 20 epochs
RMSE: 0.91 | MAE: 0.70

NLP — TF-IDF

10,000 features, bigrams, English stopwords removed
Matrix: 11,839 × 10,000 (93.2% sparse)
Best for: series detection, shared vocabulary

NLP — Sentence Embeddings

Model: all-MiniLM-L6-v2
384-dimensional dense vectors
Best for: cross-author thematic similarity

Tech Stack

Layer	Technology
Data processing	pandas, numpy
Collaborative filtering	scikit-surprise
NLP vectorization	scikit-learn (TF-IDF)
Semantic embeddings	sentence-transformers
Similarity	scikit-learn (cosine similarity)
API	FastAPI
Containerization	Docker
Deployment	Hugging Face Spaces

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
Deployment pipeline		Deployment pipeline
Machine Learning		Machine Learning
Reports/Project proposal		Reports/Project proposal
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 Amazon Book Recommender

How It Works

Dynamic Weighting

Results

Project Structure

Setup & Installation

1. Clone the repo

2. Create and activate virtual environment

3. Install dependencies

4. Download the dataset

5. Run the pipeline in order

Dataset

Models

Collaborative Filtering (SVD)

NLP — TF-IDF

NLP — Sentence Embeddings

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📚 Amazon Book Recommender

How It Works

Dynamic Weighting

Results

Project Structure

Setup & Installation

1. Clone the repo

2. Create and activate virtual environment

3. Install dependencies

4. Download the dataset

5. Run the pipeline in order

Dataset

Models

Collaborative Filtering (SVD)

NLP — TF-IDF

NLP — Sentence Embeddings

Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages