An industry-grade machine learning system designed to automate credit risk adjudication while maintaining high auditability and statistical rigor. This project features a modular Python architecture, automated training pipelines, and a production-ready REST API.
- Scientific Decisioning: Evaluate the statistical significance of model improvements (XGBoost vs. Baseline) to ensure deployment is justified.
- Auditability: Maintain an immutable record of model versions, training policies, and feature schemas.
- Risk-Centric Optimization: Model performance is optimized for the detection of high-risk applicants to minimize potential defaults.
CreditRisk/
├── frontend/ # React + TypeScript inference UI
├── artifacts/ # Saved model pipelines and manifest metadata
├── data/ # Training and test CSV files
├── notebooks/ # Exploration, comparison, and analysis notebooks
├── src/creditrisk/ # Application package
│ ├── main.py # Training pipeline entrypoint
│ ├── api.py # FastAPI inference service
│ ├── data.py # Data loading and type casting
│ ├── preprocess.py # Feature engineering and preprocessing
│ ├── train.py # Model training helpers
│ ├── evaluate.py # Evaluation logic
│ ├── artifacts.py # Artifact persistence and versioning
│ └── logger.py # Logging setup
└── tests/ # API and behavior tests
Before moving to production, we performed a comparative analysis between the baseline (Logistic Regression) and the challenger (XGBoost) to ensure the performance gain was not due to random noise.
| Metric | Logistic Regression | XGBoost | Delta |
|---|---|---|---|
| Accuracy | 0.8897 | 0.9266 | +3.7% |
| ROC-AUC | 0.9626 | 0.9836 | +0.21 |
| F1-Score | 0.88 | 0.92 | +4.5% |
Note: F1-score is reported for Class 0 (Loan Rejected). In a credit risk context, we prioritize the precision and recall of high-risk identifications to minimize financial exposure.
- McNemar’s Test: A McNemar’s test on the paired model errors yielded a
$p$ -value of$4.1463e-38$ . This confirms that the error distributions differ significantly at the$\alpha = 0.05$ level. - Bootstrap Analysis: Conducted
$10,000$ bootstrap iterations to calculate 95% Confidence Intervals for the accuracy. he CI of the difference in accuracy between XGBoost and Logistic Regression is$[0.0315, 0.0423]$ , with an effect size of$3.67%$ . Since the interval does not contain zero, the improvement is statistically robust.
- Data & Storage
Retrieval:
data.pyimplements memory-optimized type casting and schema validation for CSV/SQL sources.
Artifacts: Fully fitted sklearn pipelines (preprocessor + classifier) are serialized with metadata manifests to artifacts/.
- The Training Pipeline (main.py) Decoupled training logic allows for high-velocity experimentation:
# Example: Retraining due to data drift
uv run python -m creditrisk.main --model xgb --dataset-version v1 --training-data-policy initial --notes "Adjusting for drift"- Reliability & Evaluation
API Tests: Automated tests in
tests/test_api.pyvalidate endpoint responses and input edge cases.
Observability: Structured JSON logging in api.py provides an audit trail for every prediction, including the model version and probability score.
The training pipeline is the same code used by Docker Compose. Run it directly when you want to retrain or experiment with different arguments:
uv run python -m creditrisk.mainPass the same CLI flags you would pass in Compose:
uv run python -m creditrisk.main \
--model xgb \
--data data_v1.csv \
--dataset-version data_v1 \
--training-data-policy combined \
--feature-schema-version v1 \
--notes "manual retrain"Common options:
--data: Training CSV file name underdata/--model:lr,svc, orxgb--dataset-version: Version label recorded in the manifest--training-data-policy:initial,combined, ornew_only--feature-schema-version: Feature-engineering version label--notes: Free-text run metadata--no-persist: Train without writing artifacts
This project is built to eliminate "it works on my machine" issues while avoiding unnecessary retraining cost in cloud environments.
API Container: Starts independently and serves the latest saved artifact from artifacts/.
Pipeline Container (optional): Runs only when you explicitly request training.
# Start API only (default; no training)
docker compose up --build api
# Run one-off training only when needed
docker compose --profile train run --rm pipeline
# Run one-off training with custom CLI args (same flags as local main.py)
docker compose --profile train run --rm pipeline --model lr --data data_v1.csv
# Example with additional metadata flags
docker compose --profile train run --rm pipeline \
--model xgb \
--data data_v1.csv \
--dataset-version data_v1 \
--training-data-policy combined \
--feature-schema-version v1 \
--notes "manual retrain before release"The repo includes a Vite + React + TypeScript UI for single-applicant inference and explanation review.
- Submit borrower attributes via a validated form (same payload shape as API request model).
- Call
POST /explainand display:- prediction summary (
pred,proba,model_version) - grouped SHAP attributions (
feature_explanations) - transformed-column SHAP attributions (
transformed_feature_explanations)
- prediction summary (
# Build and run API + frontend together
docker compose up --build api frontendThe frontend runs as a Vite dev server inside Docker, and Vite proxies /explain to the API container over the Compose network.
Direct browser access:
- Frontend: http://localhost:5173
- API: http://localhost:8000
If you want to bypass the proxy and call the API directly, set:
VITE_API_BASE=http://localhost:8000 npm run devFor local development, this repo can run frontend and backend together in Docker Compose.
For production, they are usually split:
- Frontend: static hosting or an edge platform like Vercel/Netlify
- Backend/core logic: API service on a container host or serverless runtime such as Cloud Run, ECS/Fargate, Fly.io, or Render
The frontend should only talk to the backend API. Model inference, SHAP explanation, and any training logic stay in the backend layer or in separate worker jobs.
Cloud Migration: Transitioning compute to AWS Fargate and storage to S3 for higher scalability.
Observability: Integrating Prometheus metrics to track real-time drift in the debt_to_income_ratio feature.
Explainability: Expose SHAP-based feature attributions through a dedicated /explain endpoint for adjudicator transparency.
Example /explain response:
{
"pred": 1,
"proba": 0.92,
"model_version": "20260413T060654Z-data_v1",
"expected_value": -0.41,
"feature_explanations": [
{"feature": "loan_amount", "shap_value": 0.18, "abs_shap_value": 0.18},
{"feature": "occupation_status", "shap_value": -0.11, "abs_shap_value": 0.11}
],
"transformed_feature_explanations": [
{"feature": "categorical__occupation_status_employed", "shap_value": -0.09, "abs_shap_value": 0.09},
{"feature": "categorical__occupation_status_self_employed", "shap_value": -0.02, "abs_shap_value": 0.02}
],
"transformed_feature_count": 31
}Production Reliability: Calculate inference latency (
Core: Python 3.11, uv, scikit-learn, XGBoost, shap.
Serving: FastAPI, Uvicorn, Pydantic.
DevOps: Docker, Docker Compose, Pytest.