- The goal of this project was to create an ETL-Query (Extract, Transform, Load, Query) pipeline using Python and Databricks to join and query individual-level voter registration and vote history data for North Carolina.
- I extracted data from the NC State Board of Elections website, cleaned and transformed the data into a standard UTF-16 tab-separated CSV format, and loaded the data into a Databricks warehouse.
- Finally, I was able to establish a connection to Databricks and query the data directly to perform joins, aggregations, and sorting operations.
This project uses data on voter registration and vote history for individuals in Durham County, North Carolina.
More information and a link to the data is available at:
- https://www.ncsbe.gov/results-data/voter-registration-data
- https://www.ncsbe.gov/results-data/voter-history-data
- In the
mylibdirectory,extract.pyextracts the raw data from the link to the NCSBE website.- The
extract_zip()function takes in a URL and a destination directory name and creates a filepath for a zipped file, downloads the zipped file, and extracts the file to the named destination directory. - The
test_encoding_zippedfile()function takes in a URL for a zipped.txtfile, downloads the file, and checks the encoding. Thecheck_txtencoding_directory()function takes a directory and checks the encoding for each.txtfile in the directory. This function can be easily updated to check the encoding of.csvfiles.
- The
- The
transform_load.pyscript performs multiple cleaning and transforming operations to prepare the data to be loaded into the Databricks database.- Both functions
transform_voterreg()andtransform_votehistory()take in their respective downloaded datasets as arguments, along with the name of a county, the date of transformation, and the destination directory. The functions check the column names against a pre-defined list of expected column names, and transforms the datasets into UTF-16 encoded tab-separated CSVs. - The above step was necessary because voter registration data in North Carolina is saved in the "Windows-1521" encoding. Vote history data is saved as "UTF-8" encoding.
- The
trim_dataset()function was used to sort the data by "ncid" (North Carolina ID) and reduce each dataset to a set number of observations for testing. - Both
load_voterreg()andload_votehistory()functions take in a dataset as an argument before reading them line by line into a iterable reader object. Then, using thepython-dotenvmodule, we connect to the Databricks warehouse and check for existing tables. If no table exists, we create a new table with a pre-defined schema, and insert each line of the csv reader object into the table. Once this is completed, we close the connection to conserve resources.
- Both functions
- Finally, the
query.pyscript includes thegeneral_query()function required to execute various SQL operations. This function connections to the Databricks warehouse and takes in a SQL query formatted as a string.- This script also includes a
log_query()helper function that is included in thegeneral_query()function to record the syntax and result of each SQL query.
- This script also includes a
- This query retrieves data from our two tables (ped19_voterreg and ped19_voterhist), performs an inner join based on the "ncid" column, groups the data by the party the individual voted for, and orders the count for each party in descending order.
- Note: To perform this operation within our server warehouse limit, I trimmed the voter registration and vote history datasets to the first 5,000 observations.
PeterdeGuzman_Mini6/
βββ __pycache__/
βββ .devcontainer/
β βββ devcontainer.json
β βββ Dockerfile
βββ .pytestcache/
βββ .ruff_cache/
βββ __pycache__
βββ .github/
β βββ workflows/
β βββ cicd.yml
βββ mylib/
β βββ extract.py
β βββ transform_load.py
β βββ query.py
βββ data/
βββ data_documentation
βββ .gitignore
βββ proof_test.png
βββ main.py
βββ Makefile
βββ README.md
βββ Requirements.txt
βββ test_main.py
