This project demonstrates a complete ETL (Extract, Transform, Load) pipeline, following a Data Lake architecture with Bronze, Silver and Gold layers, built using Python, Pandas and PostgreSQL.
The goal of this project is to practice real-world Data Engineering concepts, including data ingestion, normalization, storage optimization, relational modeling and analytical consumption.
The pipeline processes data from multiple sources and evolves it across layers:
- Bronze: Raw data ingestion (CSV and JSON)
- Silver: Data normalization and optimization (Parquet)
- Gold: Relational modeling and analytical queries (PostgreSQL)
The final output is a clean, enriched and queryable dataset, ready for analytics and visualization.
Purpose:
Store raw data with minimal transformation, preserving original structure.
- Read user data from CSV files
- Consume external API (ViaCEP) to enrich data with address information
- Handle API errors (timeouts, invalid CEPs, connection issues)
- Persist raw outputs into the Bronze layer
Technologies:
- Python
- Pandas
- Requests
- REST API (ViaCEP)
Purpose:
Clean, standardize and optimize data for analytical processing.
- Automatic ingestion of CSV and JSON files from Bronze
- Conversion to columnar format (Parquet)
- Removal of duplicate records
- Handling of complex columns (lists converted to strings)
- Generic pipeline capable of processing multiple files
Benefits:
- Improved performance and compression
- Consistent schemas
- Data ready for relational modeling
Technologies:
- Python
- Pandas
- Parquet
Purpose:
Provide data ready for business analysis and visualization.
- Load Parquet files from Silver into PostgreSQL
- Dynamic table creation based on DataFrame schema
- Programmatic data insertion using
psycopg2 - Relational modeling using SQL joins
- Analytical query joining users with address data
- Enrich user data with geographic attributes (city, state, region)
- Use of
LEFT JOINto preserve all user records - Removal of duplicates with
DISTINCT
- Jupyter Notebook for data exploration and visualization
- Ready for dashboards or BI tools
Technologies:
- PostgreSQL
- SQL
- psycopg2
- Pandas