mb-data-warehouse

About

This repository contains code and configuration for a pyspark data pipeline using AWS cloud infrastructure

Repo structure:

main.py: entrypoint of the spark application
mbdw: package containing application logic
|-- model: contains all aspects of the data model
|-- data_quality: contains functionality to assess data quality (no implemented)
|-- storage_interface.py: an interface for loading data and writing data to / from storage

Maintainer

Andreas Kreitschmann, e-mail: a.kreitschmann@gmail.com

Background

For an architecture overview and details on the data model, refer to the docs folder.

Building & testing the project

Build

This project is a python project and can thus be built with standard python tooling. The tested approaches include:

python -m build

Since this application was built to be run on AWS infrastructure, the aws-emr-cli Python library offers a convenient alternative for packaging the application:

emr package --entry-point main.py

Further information on how the second option works in detail, visit this page

Test

Tests are implemented using the pytest library. As usual, they reside in the tests folder. Execute all tests by running pytest -vvv in the projects root directory.

This repo comes with a pre-configured Github actions workflow that executes all tests automatically. For details see the .github/workflows folder.

How to use the project

This project loads, transforms and writes data using Apache Spark. It can be run locally in standalone mode (during development). To run the application for production loads, it should be run on a Spark cluster instead.

Deploy the app

In order to run Spark workloads on AWS (tested with different flavours of the AWS EMR service), artefacts need to be first deployed to a S3 bucket from which the Spark cluster can read. This can be either achieved manually or using the aws-emr-cli tool mentioned above. In this repo's CI/CD workflow, a deploy step is included.

You can simply edit the file .github/workflows/python-app.yml to adjust the target S3_BUCKET. See steps starting with Deploy package...

Run the app

Once your pyspark package is deployed to S3 (see previous point), you can run the application on any of the available AWS EMR flavours. During development, the infrastructure for this project has been provisioned manually. A future improvement of this package is to include automated provisioning of infrastructure, e.g. an AWS EMR Serverless instance / application (or AWS on EC2). Details on the setup of an AWS EMR Serverless application can be found here

Once the Spark cluster is available, an easy way to start the Spark is using the aws-emr-cli, like so:

    emr run --application-id <your EMR serverless application ID> \
         --job-role <job role created during setup of EMR serverless application>
         --s3-code-uri s3://<S3 Bucket>/<prefix where the CI/CD pipeline deployed the code>/master/
         --entry-point main.py
         --job-name pr_issues_transform
         --wait

Note that this command can be run from any environment (also from outside of AWS). All you need is to configure the respective AWS access, e.g. via aws-cli. More on this here. This can be useful during development e.g. in a local environment.

For productive workloads, a scheduler like Apache Airflow should be used.

Upon successful termination, the application writes parquet files back to the specified S3 bucket. These can now be loaded into a Data Warehouse, e.g. AWS Redshift using a simple COPY command.

Open Issues

implement data quality check module
load final parquet data into Redshift table & expose SQL endpoint
set up automatic provisioning of AWS EMR Serverless application (has been provisioned manually so far)

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
.idea		.idea
docs		docs
mbdw		mbdw
scripts		scripts
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conf.py		conf.py
emr_run.py		emr_run.py
index.rst		index.rst
main.py		main.py
make.bat		make.bat
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mb-data-warehouse

About

Maintainer

Background

Building & testing the project

Build

Test

How to use the project

Deploy the app

Run the app

Open Issues

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mb-data-warehouse

About

Maintainer

Background

Building & testing the project

Build

Test

How to use the project

Deploy the app

Run the app

Open Issues

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages