Credit Card Fraud Detection API

XGBoost · SMOTE · SHAP · FastAPI deployment

The Problem

I started this project thinking fraud detection was a straightforward classification problem. It isn't.

The dataset has 284,807 transactions. 492 are fraud. That's 0.17%. A model that predicts "legit" for literally every transaction scores 99.83% accuracy - and catches zero frauds. Accuracy is useless here. The whole project is really about dealing with that one uncomfortable fact.

What I Built

A full pipeline from raw imbalanced data to a deployed REST API:

Handled class imbalance using SMOTE (not just oversampling - actually generating synthetic fraud examples by interpolating between real ones)
Trained XGBoost on 454K balanced training rows
Used SHAP to understand why the model flags specific transactions, not just that it does
Wrapped everything in a FastAPI app so the model is actually callable, not just a notebook

Results

Metric	Score
ROC-AUC	0.9776
PR-AUC	0.8663
Fraud Recall	90%
Fraud Precision	51%

PR-AUC is the right metric here, not accuracy. A random classifier on this dataset scores ~0.0017 PR-AUC. Getting to 0.8663 on genuinely imbalanced real-world data is the actual benchmark.

The 51% precision means roughly half the fraud alerts are false alarms - which sounds bad until you realize the alternative (missing a fraud) costs ~₹122 vs ₹5 for a false alarm. I ran an explicit business cost optimization to find the threshold that minimizes total expected loss. It converged at 0.50, which actually tells you something useful: the model's probability scores are well-separated enough that no threshold tricks are needed.

Live API

Three tiers: HIGH (≥0.80) blocks immediately, MEDIUM (0.40-0.79) triggers OTP, LOW approves.

Handling the Imbalance

The naive fix is to just duplicate fraud rows. SMOTE is better - it creates new synthetic fraud examples by picking two real fraud transactions and generating a point somewhere between them in feature space. Less memorization, more generalization.

One thing I was careful about: SMOTE only on training data. Never touch the test set. If you apply SMOTE before splitting, synthetic examples from the same neighbourhood end up in both train and test - that's data leakage and your evaluation numbers are lies.

Split	Rows	Fraud %
Train before SMOTE	227,845	0.17%
Train after SMOTE	454,902	50.0%
Test (real world)	56,962	0.17%

Model Performance

Confusion Matrix - HAL NS XGBoost Prediction

90% recall means the model catches 9 out of 10 real frauds. The 10% it misses are the expensive ones - that's where an ensemble approach (adding an Isolation Forest for anomaly detection) would help in a production setup.

SHAP - Understanding What the Model Actually Learned

This was the most interesting part. You can train a model and report metrics, but if you can't explain a specific decision you can't deploy it anywhere that matters. Banks in India (RBI guidelines) and Europe (GDPR) both require explainability for automated financial decisions.

V4 and V14 completely dominate. Their mean SHAP values are nearly 3× the next feature. Everything else is noise by comparison.

The beeswarm shows the direction. For V14: low values (blue dots) push hard toward fraud, high values push toward legit. It's an inverted relationship - which in the real world probably corresponds to something like "low transaction approval history" or "unusual merchant type," but the bank anonymized the raw features so we can't know for sure.

V17 has the same pattern. The model's primary fraud fingerprint is: V14 very low + V17 very low + V10 very low, all at the same time. Any one of them alone isn't enough. The combination is what triggers it.

For transaction 77348 specifically (model confidence 99.99%):

Baseline: −0.039 (slightly toward legit - makes sense, 99.83% of transactions are legit)
V14 alone: +4.91 push toward fraud
V17: +1.64
V10: +1.27
Final score: 8.836 → fraud

That's the kind of breakdown a risk officer can actually act on.

The API Internals

JSON transaction arrives
    → Pydantic validates all 30 fields (auto-rejects malformed requests)
    → StandardScaler normalizes Amount + Time (same scaler fitted on training data)
    → Features aligned to exact training column order (wrong order = garbage predictions)
    → XGBoost.predict_proba() → fraud probability
    → Business rule layer maps probability to verdict + action
    → JSON response

The scaler and feature order are both saved as artifacts alongside the model. Without them, inference is broken even if the model weights are correct.

Tech Stack

Data: UCI Credit Card Fraud dataset, 284,807 transactions, 30 features
Imbalance handling: SMOTE (imbalanced-learn)
Model: XGBoost Classifier
Explainability: SHAP (TreeExplainer)
API: FastAPI + uvicorn
Deployment (demo): ngrok tunnel, Google Colab

Project Structure

fraud-detection-api/
├── fraud_detection.ipynb    # Full pipeline notebook
├── main.py                  # FastAPI app
├── fraud_model.pkl          # Trained model
├── feature_names.json       # Column order for inference
├── scaler.pkl               # Fitted scaler
└── requirements.txt

Limitations & Next Steps

Dataset is from 2013 European cardholders - fraud patterns shift over time (concept drift), so a production model would need monthly, not annual, retraining
V1-V28 are PCA-anonymized - impossible to give feature names real business meaning without access to the original variables
ngrok URL is ephemeral - production deployment would use Docker + GCP Cloud Run or AWS Lambda
Logical next steps: walk-forward retraining pipeline, Isolation Forest ensemble for unseen fraud patterns, MLflow for experiment tracking, Docker for reproducibility

Data Source

UCI ML Repository - Credit Card Fraud Detection. Published by Dal Pozzolo et al., IEEE Symposium on Computational Intelligence, 2015. Features V1-V28 are PCA-transformed; raw features are confidential.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Card Fraud Detection API

XGBoost · SMOTE · SHAP · FastAPI deployment

The Problem

The dataset has 284,807 transactions. 492 are fraud. That's 0.17%. A model that predicts "legit" for literally every transaction scores 99.83% accuracy - and catches zero frauds. Accuracy is useless here. The whole project is really about dealing with that one uncomfortable fact.

What I Built

Results

Live API

Three tiers: HIGH (≥0.80) blocks immediately, MEDIUM (0.40-0.79) triggers OTP, LOW approves.

Handling the Imbalance

Model Performance

90% recall means the model catches 9 out of 10 real frauds. The 10% it misses are the expensive ones - that's where an ensemble approach (adding an Isolation Forest for anomaly detection) would help in a production setup.

SHAP - Understanding What the Model Actually Learned

That's the kind of breakdown a risk officer can actually act on.

The API Internals

The scaler and feature order are both saved as artifacts alongside the model. Without them, inference is broken even if the model weights are correct.

Tech Stack

Project Structure

Limitations & Next Steps

Data Source

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
feature_names.json		feature_names.json
fraud_detection.ipynb		fraud_detection.ipynb
fraud_model.pkl		fraud_model.pkl
main.py		main.py
scaler.pkl		scaler.pkl

Folders and files

Latest commit

History

Repository files navigation

Credit Card Fraud Detection API

XGBoost · SMOTE · SHAP · FastAPI deployment

The Problem

The dataset has 284,807 transactions. 492 are fraud. That's 0.17%. A model that predicts "legit" for literally every transaction scores 99.83% accuracy - and catches zero frauds. Accuracy is useless here. The whole project is really about dealing with that one uncomfortable fact.

What I Built

Results

Live API

Three tiers: HIGH (≥0.80) blocks immediately, MEDIUM (0.40-0.79) triggers OTP, LOW approves.

Handling the Imbalance

Model Performance

90% recall means the model catches 9 out of 10 real frauds. The 10% it misses are the expensive ones - that's where an ensemble approach (adding an Isolation Forest for anomaly detection) would help in a production setup.

SHAP - Understanding What the Model Actually Learned

That's the kind of breakdown a risk officer can actually act on.

The API Internals

The scaler and feature order are both saved as artifacts alongside the model. Without them, inference is broken even if the model weights are correct.

Tech Stack

Project Structure

Limitations & Next Steps

Data Source

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages