This project involves extracting, transforming, and loading (ETL) baseball statistics data into an external database Databricks, and then performing complex SQL queries on the data.
The data used in this project comes FiveThirtyEight's public dataset: https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/goose/goose_rawdata.csv
- Utilizes Databricks SQL for big data processing and analytics
- Implements Common Table Expressions (CTE) for efficient data manipulation
- Employs complex SQL Query with joins, aggregations and sorting techniques
- Uses Lint and code format to maintain code quality, and uses tests for key functionalities
- Applies secure environment variables including "DATABRICKS_KEY", "SERVER_HOSTNAME", "HTTP_PATH" for Databricks authentication.
File mylib/extract.py: Fetches baseball statistics from a URL and saves it locally as Goose.csvFile "mylib/transform_load.py: Transforms and loads data into Databricks database using the "databricks-sql-connector".
- File "mylib/query.py": Executes a complex SQL query on the loaded data
SELECT Statement
The query selects the following columns:
- team: The name of the team.
- league: The league to which the team belongs.
- COUNT(name) AS total_players: This counts the total number of players associated with each team and assigns it an alias
total_players. - ROUND(AVG(goose_eggs), 1) AS avg_goose_eggs: This calculates the average number of goose eggs for each team, rounding it to one decimal place, and assigns it the alias
avg_goose_eggs. - ROUND(AVG(broken_eggs), 1) AS avg_broken_eggs: This computes the average number of broken eggs for each team, also rounding it to one decimal place, with the alias
avg_broken_eggs.
FROM Clause
The data is sourced from the default.goosedb table, which contains the relevant information for the query.
GROUP BY Clause
The results are grouped by team and league. This means that the aggregation functions (like COUNT and AVG) will operate within each combination of team and league, providing statistics specific to each group.
The goose_stats CTE will return a summarized table containing the following information for each team and league:
- Total Players (
total_players): The total number of players associated with each team. - Average Goose Eggs (
avg_goose_eggs): The average number of goose eggs scored by players on each team. - Average Broken Eggs (
avg_broken_eggs): The average number of broken eggs for each team.
Create Goose_stats Successfully 😃
The main query then joins this aggregated data back to the original player-level data, and then sorts the results by the team's average "goose eggs" in descending order.
- SELECT Statement
The query selects all columns (
*) from the combined results of thedefault.goosedbtable and thegoose_statsCTE. - FROM Clause
The data is sourced from the
default.goosedbtable, which contains player and team information. - JOIN Operation
The query performs an inner join with the
goose_statsCTE. The join condition specifies that theteamandleaguecolumns in both thedefault.goosedbtable and thegoose_statsCTE must match. This effectively combines the records based on their team and league affiliations. - ORDER BY Clause
The results are ordered by
avg_goose_eggsin descending order. This means that teams with higher average goose eggs will appear first in the result set.
The query will return a combined table that includes all columns from both the default.goosedb table and the goose_stats CTE. The expected results include:
- Detailed player information from the
default.goosedbtable. - Corresponding average goose eggs and other metrics from the
goose_statsCTE. - The records will be sorted in descending order by the average number of goose eggs (
avg_goose_eggs), allowing users to easily identify the teams with the highest performance based on this metric.
Join and Sort Data Successfully 😃
https://github.com/nogibjj/sqlite-lab https://fivethirtyeight.com/features/kenley-jansen-is-the-model-of-a-modern-reliever/






