Skip to content

timstan-db/utility-intelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

US Utility Intelligence on Databricks

Natural-language exploration of US electric utility data on Databricks. ~7,400 utilities, ~30,000 power plants, generation, sales, reliability, demand response, distributed generation, FERC Form 1 financials, and county-level service territories -- all queryable in plain English through an AI/BI Genie Space.

The repo is a single Databricks Asset Bundle (DAB) that handles the full ETL: downloads public data, lands it in a Unity Catalog volume, builds raw and curated star-schema layers via Spark Declarative Pipelines (SDP), and creates the Genie Space over the curated tables.

What you get

Raw schema <catalog>.raw_utility_intelligence -- 23 tables landed from EIA Forms 860/861/923, EPA eGRID 2022, PUDL FERC Form 1 + EIA, and Census TIGER/Line.

Curated schema <catalog>.curated_utility_intelligence -- a clean star schema with 4 dimension tables and 14 fact tables, joined and typed, column-level comments included.

Dimensions
dim_utility ~7,400 US electric utilities (EIA + PUDL crosswalk)
dim_plant ~30,000 power plants (EIA 860 + eGRID emissions)
dim_generator Generators with nameplate capacity, fuel, prime mover
dim_county County boundary polygons (Census TIGER/Line 2024)
Facts
fact_generation Monthly net generation by plant, prime mover, fuel
fact_fuel_receipts Fuel deliveries with heat content, sulfur, cost
fact_plant_ownership Generator ownership shares by utility
fact_sales Annual sales (MWh, $, customers) by sector
fact_reliability SAIDI / SAIFI / CAIDI by utility
fact_demand_response DR program enrollment, savings, costs
fact_net_metering Net-metering capacity, installations, exports
fact_ami_deployment Smart meter deployment by sector
fact_distributed_generation DG capacity and customers by tech
fact_energy_efficiency EE program savings, peak reduction, costs
fact_ferc1_plants FERC Form 1 plant capacity, generation, capex
fact_ferc1_expenses FERC Form 1 operating expenses
fact_ferc1_revenues FERC Form 1 revenues by class
fact_service_territory Counties served per utility-year

A Genie Space named US Utility Intelligence over the 18 curated tables, with sample questions and modeling instructions baked in.

Prerequisites

  • A Databricks workspace with Unity Catalog and serverless compute enabled.
  • The Databricks CLI v0.228+ installed and configured for your workspace (e.g. databricks configure).
  • A UC catalog you can write to (defaults to main).
  • A serverless SQL warehouse in the workspace (any will do -- the Genie setup picks one automatically).

Deploy

From the repo root:

databricks bundle deploy

# 1. Land + curate the data (downloads ~1 GB, runs two SDP pipelines)
databricks bundle run ingest
databricks bundle run curate

# 2. Create the Genie Space
databricks bundle run setup

When setup finishes, the job output prints the Genie Space URL. Open it and try:

Which 10 utilities have the highest total electricity sales in MWh?

Show me the top 20 power plants by CO2 emissions

What is the total installed solar capacity by state?

Overriding catalog or warehouse

export BUNDLE_VAR_catalog=YOUR_CATALOG
export BUNDLE_VAR_default_warehouse_id=YOUR_WAREHOUSE_ID

Or pass --var=catalog=... / --var=default_warehouse_id=... on each databricks bundle invocation.

What the bundle does

  1. Ingest (ingest job)

    • Downloads source files in parallel from EIA, EPA, PUDL, and Census (src/01_download.py).
    • Converts Excel/shapefile sources to Parquet and writes to a UC volume.
    • Runs the raw_tables SDP pipeline to build 23 materialized views in raw_utility_intelligence (src/raw_*.sql).
    • Applies table and column comments plus UC tags (src/03_raw_table_comments.py).
  2. Curate (curate job)

    • Runs the curated_tables SDP pipeline to build a clean star schema in curated_utility_intelligence (src/dim_*.sql, src/fact_*.sql).
    • Applies curated table and column comments plus UC tags (src/03_curated_table_comments.py).
  3. Setup (setup job)

    • Creates or updates the Genie Space US Utility Intelligence over the 18 curated tables, with sample questions and modeling instructions (src/01_create_genie_space.py).

Data sources

Source What Year
EIA Form 860 Plant & generator characteristics 2023
EIA Form 861 Utility sales, revenue, reliability 2023
EIA Form 923 Monthly generation & fuel 2023
EPA eGRID Plant/generator emissions 2022
PUDL FERC financials, service territory, crosswalks Stable
Census TIGER/Line County boundary polygons 2024

All public, no credentials required.

Notes

  • Serverless only. Both SDP pipelines and the Genie Space's warehouse use Databricks serverless compute.
  • The first ingest run downloads ~1 GB and runs the SDP pipeline -- expect 10-20 minutes on a fresh deployment. Subsequent runs are incremental.
  • Genie Spaces aren't yet a native Databricks Asset Bundle resource (see databricks/cli#4191), so creation happens via the REST API from a notebook task in the bundle.

Contributing

This repo is published from an upstream source, so pull requests here will be overwritten on the next release. Please file an issue for bugs, questions, or improvement ideas -- fixes will be made upstream and reflected here on the next push.

License

MIT -- see LICENSE.

About

Natural-language exploration of US electric utility data on Databricks via an AI/BI Genie Space. Single-bundle DAB combining ingestion (EIA, EPA, PUDL, Census), curated star schema, and Genie Space creation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages