A professional-grade Data Engineering pipeline that extracts, transforms, and visualizes global weather and air pollution data in real-time.
This project implements a complete ETL (Extract, Transform, Load) pipeline designed for a portfolio-ready data engineering solution. It pulls data from multiple OpenWeatherMap API endpoints, processes it using pandas, ensures data quality through custom validation, and serves it via a premium Streamlit dashboard.
- Automated ETL Pipeline: Full lifecycle from API ingestion to local storage.
- SQL & Big Data Formats: Simultaneously stores data in a SQLite database (
.db) and an industry-standard Parquet file (.parquet). - Data Quality Assurance: Integrated validation layer that checks for missing values, out-of-range temperatures, and corrupted data.
- Real-time Dashboard: Premium Streamlit UI featuring interactive maps, air quality health insights, and comparative analysis charts.
- Autonomous Scheduling: A background scheduler that triggers the pipeline every hour to keep data fresh.
- Professional Logging: Dual-stream logging (Console + File) to track pipeline health and performance.
- Extract: Python
requestscalls to OpenWeatherMap (Weather and Pollution APIs). - Transform:
pandasmerge and normalization. Temperature conversion (Kelvin to Celsius) and AQI categorization. - Validate: Logic checks to prevent "garbage-in, garbage-out" scenarios.
- Load:
- SQLite: For relational queries and dashboard serving.
- Parquet: For high-performance analytical storage (Data Lake style).
- Visualize: Streamlit & Plotly interactive interface.
aqt/
├── fetch_data.py # API Extraction logic
├── transform_data.py # Merging, Data Cleaning & Validation
├── database_manager.py # Loading logic (SQL & Parquet)
├── run_pipeline.py # Main ETL orchestrator with Logging
├── auto_scheduler.py # Background automation script
├── dashboard.py # Streamlit visualization app
└── requirements.txt # Project dependencies
- Python 3.8+
- OpenWeatherMap API Key
pip install -r requirements.txt- Run a single data fetch:
python run_pipeline.py
- Start the automated scheduler:
python auto_scheduler.py
- Launch the Dashboard:
streamlit run dashboard.py
The dashboard provides a health-centric view of global cities:
- AQI Level 1 (Good): "Air quality is satisfactory."
- AQI Level 5 (Very Poor): "Health alert: stay indoors!"
Naga Mohan Madicharla A Data Engineering beginner project exploring APIs, Pandas, and Automation.