This repository contains code and configuration for a pyspark data pipeline using AWS cloud infrastructure
Repo structure:
main.py: entrypoint of the spark application
mbdw: package containing application logic
|-- model: contains all aspects of the data model
|-- data_quality: contains functionality to assess data quality (no implemented)
|-- storage_interface.py: an interface for loading data and writing data to / from storage
Andreas Kreitschmann, e-mail: a.kreitschmann@gmail.com
For an architecture overview and details on the data model, refer to the docs folder.
This project is a python project and can thus be built with standard python tooling. The tested approaches include:
python -m build
Since this application was built to be run on AWS infrastructure, the aws-emr-cli Python library offers a convenient alternative for packaging the application:
emr package --entry-point main.py
Further information on how the second option works in detail, visit this page
Tests are implemented using the pytest library. As usual, they reside in the tests folder.
Execute all tests by running pytest -vvv in the projects root directory.
This repo comes with a pre-configured Github actions workflow that executes all tests automatically.
For details see the .github/workflows folder.
This project loads, transforms and writes data using Apache Spark. It can be run locally in standalone mode (during development). To run the application for production loads, it should be run on a Spark cluster instead.
In order to run Spark workloads on AWS (tested with different flavours of the AWS EMR service), artefacts need to be first deployed to a S3 bucket from which the Spark cluster can read. This can be either achieved manually or using the aws-emr-cli tool mentioned above. In this repo's CI/CD workflow, a deploy step is included.
You can simply edit the file .github/workflows/python-app.yml to adjust the target S3_BUCKET. See steps starting with Deploy package...
Once your pyspark package is deployed to S3 (see previous point), you can run the application on any of the available AWS EMR flavours. During development, the infrastructure for this project has been provisioned manually. A future improvement of this package is to include automated provisioning of infrastructure, e.g. an AWS EMR Serverless instance / application (or AWS on EC2). Details on the setup of an AWS EMR Serverless application can be found here
Once the Spark cluster is available, an easy way to start the Spark is using the aws-emr-cli, like so:
emr run --application-id <your EMR serverless application ID> \
--job-role <job role created during setup of EMR serverless application>
--s3-code-uri s3://<S3 Bucket>/<prefix where the CI/CD pipeline deployed the code>/master/
--entry-point main.py
--job-name pr_issues_transform
--wait
Note that this command can be run from any environment (also from outside of AWS). All you need is to configure the respective AWS access, e.g. via aws-cli. More on this here. This can be useful during development e.g. in a local environment.
For productive workloads, a scheduler like Apache Airflow should be used.
Upon successful termination, the application writes parquet files back to the specified S3 bucket. These can now be loaded into a Data Warehouse, e.g. AWS Redshift using a simple COPY command.
- implement data quality check module
- load final parquet data into Redshift table & expose SQL endpoint
- set up automatic provisioning of AWS EMR Serverless application (has been provisioned manually so far)