API URL
https://fraud-detection-api-t7f4.onrender.com
Interactive API Documentation
https://fraud-detection-api-t7f4.onrender.com/docs
Financial fraud causes significant monetary losses for digital payment platforms. The objective of this project is to build a machine learning system capable of identifying potentially fraudulent transactions in real time while minimizing false positives.
The dataset consists of synthetic mobile payment transactions generated by the PaySim simulator, which models financial activities of a mobile money service over a 30-day period.
The dataset contains transaction information such as:
- Transaction type
- Transaction amount
- Origin account balances
- Destination account balances
- Fraud labels
Fraud detection is a highly imbalanced classification problem.
Fraudulent transactions account for approximately 0.13% of all transactions, while legitimate transactions account for more than 99% of the data.
This imbalance makes traditional accuracy metrics misleading and requires careful model evaluation and threshold selection.
To capture suspicious transaction patterns beyond the raw dataset, several domain-inspired features were engineered:
Measures the consistency between destination account balances after considering the transferred amount.
Measures the expected balance change of the origin account after the transaction.
Represents the proportion of the account balance being transferred. Large ratios may indicate suspicious behavior.
Flags transactions where the origin account balance is zero, which may provide useful fraud signals.
Multiple machine learning algorithms were evaluated:
- Logistic Regression
- Random Forest
- XGBoost
Model performance was compared using classification metrics suitable for imbalanced datasets.
XGBoost achieved the best overall performance and was selected as the final production model.
Machine learning models typically use a default classification threshold of 0.5.
To improve fraud detection performance, threshold analysis was performed to evaluate the trade-off between precision and recall.
The optimal threshold was selected based on validation performance and business requirements, resulting in improved fraud identification compared to the default threshold.
The trained model was deployed as a REST API using FastAPI.
- User submits transaction data.
- FastAPI validates the request using Pydantic schemas.
- Feature engineering transformations are applied.
- Features are aligned with training-time feature columns.
- The XGBoost model generates fraud probabilities.
- The optimized threshold converts probabilities into final predictions.
- The API returns both fraud probability and fraud prediction.
{
"fraud_probability": 0.87,
"prediction": 1
}Where:
- prediction = 1 indicates fraud
- prediction = 0 indicates non-fraud
flowchart LR
A[Client Request] --> B[FastAPI]
B --> C[Feature Engineering]
C --> D[XGBoost Model]
D --> E[Threshold Optimization]
E --> F[Prediction Response]
- Python
- Pandas
- NumPy
- Scikit-learn
- XGBoost
- Joblib
- FastAPI
- Pydantic
- Model monitoring and drift detection
- Automated retraining pipeline
- Experiment tracking using MLflow
- Real-time streaming predictions
- CI/CD integration using GitHub Actions
- Kubernetes deployment for scalability
The dataset is highly imbalanced, with fraudulent transactions representing approximately 0.13% of all observations.
This means traditional accuracy metrics are not sufficient for evaluating model performance, and greater emphasis should be placed on precision, recall, and F1-score.
Feature importance analysis showed that the following variables contributed significantly to fraud detection:
- newbalanceOrig
- amount_to_balance_ratio
- balancediff_Org_including_amount
These features capture abnormal balance movements and unusual transaction behavior that are commonly associated with fraudulent activity.
The selected production model is XGBoost.
Performance at the selected threshold of 0.7:
| Metric | Value |
|---|---|
| Precision | 95% |
| Recall | 94% |
| F1 Score | 94% |
- The model correctly identifies approximately 94% of fraudulent transactions.
- A precision of 95% indicates that most transactions flagged as fraud are truly fraudulent, reducing unnecessary investigations.
- The selected threshold balances fraud detection capability with operational efficiency.
Deploy the XGBoost model as an initial fraud screening layer and route flagged transactions for additional verification before approval.
- As fraud patterns evolve, model performance may degrade over time if retraining is not performed regularly.
- Customer transaction behavior may change due to :
- New payment methods
- Economic conditions
- Seasonal effects
- Products changes
- The PaySlim dataset is a simulated representation of financial transactions. Although it captures many realistic fraud patterns, real world transaction data may contain additional complexities not represented in the dataset.
- Despite strong models performance, some fradulent transactions may still remain undetected due to class imbalance.
- Model depends on the chosen classification model. Different business objectives may require adjusting the threshold to prioritize either fraud detection(higher recall) or fewer false alarms (higher precision).