Skip to content

Arkam11/smartflow-data-platform

Repository files navigation

SmartFlow Data Platform

CI/CD

An end-to-end data engineering platform built with modern tools and best practices.

What this project covers

  • ETL/ELT pipeline with Apache Spark (Bronze → Silver → Gold)
  • Data Mart with star schema (dbt + PostgreSQL)
  • ML pipeline for customer churn prediction (scikit-learn)
  • AI-powered data annotation using LLM APIs
  • Data quality validation with Great Expectations
  • Pipeline orchestration with Apache Airflow
  • Containerisation with Docker and Docker Compose
  • CI/CD with GitHub Actions
  • Unit testing with pytest

Tech stack

Python 3.10 · PySpark 4.1 · PostgreSQL · dbt · Airflow · Docker · GitHub Actions

Getting started

git clone https://github.com/YOUR_USERNAME/smartflow-data-platform.git
cd smartflow-data-platform
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env

Project structure

See STRUCTURE.md for full folder explanation.

About

End-to-end data engineering platform with ETL, ML, AI annotation, Data Mart, Airflow orchestration and CI/CD

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages