I am a Junior Data Engineer focused on data quality, reliability, and governance in Big Data environments.
I work mainly with Python and PySpark, handling large-scale data processing in Data Lakes using Bronze, Silver, and Gold layers.
I am especially interested in building reliable and auditable data pipelines, ensuring data consistency from ingestion to analytical consumption.
- Data Engineering fundamentals
- Data validation and profiling
- Data comparison between heterogeneous sources
- Schema normalization and data type alignment
- Data quality checks and reconciliation
- Python
- SQL
- Apache Spark / PySpark
- Azure Synapse Analytics
- Azure Data Lake Storage (ADLS Gen2)
- AWS (S3, basic services and concepts)
- Databricks (fundamentals and notebooks)
- Delta Lake
- Parquet
- CSV
- Git & GitHub
- Jupyter Notebooks
- Kubernetes (basic concepts)
- Generic notebooks for data validation and table comparison
- Data profiling scripts
- Comparisons between CSV, Parquet, and Delta datasets
- Automated Excel reports for data conformity and discrepancies
- Practical projects focused on data quality and governance
To grow as a Data Engineer, strengthening my skills in distributed data processing, modern data architectures, and cloud-based data platforms, while contributing to reliable and scalable data solutions.
- πΌ LinkedIn: Miguel Ozana
- π§ Email: miguelozana@gmail.com
β If you find something useful here, feel free to star the repository!
