GitHub - speropoulos/airflow_project

Apache Airflow ETL Data Pipeline

My workplaces current CRM is poorly managed and myself and some coworkers wanted to create our own dashboards so that we could understand our data from working with clients and stores much better. So I set off on building the data pipeline myself.

Diagram showing the different technologies used:

Each batch is then sent to the designated Amazon S3 bucket to update the preexisting table.

Webscraping

webscraping_code

I am webscraping certain tables that i choose from the ShareCRM that my workplace uses. I am extracting the data using selenium in python to automate this scraping.

Airflow DAG code

airflow_dag_code

Then I wrote python code that transformed the data response from json to . I deployed and scheduled that code using Apache Airflow running on an Amazon EC2 Ubuntu machine.

Amazon S3 Bucket

After airflow runs successfully, we will have a new updated object in our S3 bucket. Now we can take this data and load it into any data warehouse you enjoy!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
scrape_visit_info.py		scrape_visit_info.py
scrape_visit_info_dag.py		scrape_visit_info_dag.py
sharecrm pipeline diagram.png		sharecrm pipeline diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Airflow ETL Data Pipeline

Diagram showing the different technologies used:

Webscraping

Airflow DAG code

Amazon S3 Bucket

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Apache Airflow ETL Data Pipeline

Diagram showing the different technologies used:

Webscraping

Airflow DAG code

Amazon S3 Bucket

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages