Natural-language exploration of US electric utility data on Databricks. ~7,400 utilities, ~30,000 power plants, generation, sales, reliability, demand response, distributed generation, FERC Form 1 financials, and county-level service territories -- all queryable in plain English through an AI/BI Genie Space.
The repo is a single Databricks Asset Bundle (DAB) that handles the full ETL: downloads public data, lands it in a Unity Catalog volume, builds raw and curated star-schema layers via Spark Declarative Pipelines (SDP), and creates the Genie Space over the curated tables.
Raw schema <catalog>.raw_utility_intelligence -- 23 tables landed from
EIA Forms 860/861/923, EPA eGRID 2022, PUDL FERC Form 1 + EIA, and Census
TIGER/Line.
Curated schema <catalog>.curated_utility_intelligence -- a clean star
schema with 4 dimension tables and 14 fact tables, joined and typed,
column-level comments included.
| Dimensions | |
|---|---|
dim_utility |
~7,400 US electric utilities (EIA + PUDL crosswalk) |
dim_plant |
~30,000 power plants (EIA 860 + eGRID emissions) |
dim_generator |
Generators with nameplate capacity, fuel, prime mover |
dim_county |
County boundary polygons (Census TIGER/Line 2024) |
| Facts | |
|---|---|
fact_generation |
Monthly net generation by plant, prime mover, fuel |
fact_fuel_receipts |
Fuel deliveries with heat content, sulfur, cost |
fact_plant_ownership |
Generator ownership shares by utility |
fact_sales |
Annual sales (MWh, $, customers) by sector |
fact_reliability |
SAIDI / SAIFI / CAIDI by utility |
fact_demand_response |
DR program enrollment, savings, costs |
fact_net_metering |
Net-metering capacity, installations, exports |
fact_ami_deployment |
Smart meter deployment by sector |
fact_distributed_generation |
DG capacity and customers by tech |
fact_energy_efficiency |
EE program savings, peak reduction, costs |
fact_ferc1_plants |
FERC Form 1 plant capacity, generation, capex |
fact_ferc1_expenses |
FERC Form 1 operating expenses |
fact_ferc1_revenues |
FERC Form 1 revenues by class |
fact_service_territory |
Counties served per utility-year |
A Genie Space named US Utility Intelligence over the 18 curated tables, with sample questions and modeling instructions baked in.
- A Databricks workspace with Unity Catalog and serverless compute enabled.
- The Databricks CLI
v0.228+ installed and configured
for your workspace (e.g.
databricks configure). - A UC catalog you can write to (defaults to
main). - A serverless SQL warehouse in the workspace (any will do -- the Genie setup picks one automatically).
From the repo root:
databricks bundle deploy
# 1. Land + curate the data (downloads ~1 GB, runs two SDP pipelines)
databricks bundle run ingest
databricks bundle run curate
# 2. Create the Genie Space
databricks bundle run setupWhen setup finishes, the job output prints the Genie Space URL. Open it and
try:
Which 10 utilities have the highest total electricity sales in MWh?
Show me the top 20 power plants by CO2 emissions
What is the total installed solar capacity by state?
export BUNDLE_VAR_catalog=YOUR_CATALOG
export BUNDLE_VAR_default_warehouse_id=YOUR_WAREHOUSE_IDOr pass --var=catalog=... / --var=default_warehouse_id=... on each
databricks bundle invocation.
-
Ingest (
ingestjob)- Downloads source files in parallel from EIA, EPA, PUDL, and Census
(
src/01_download.py). - Converts Excel/shapefile sources to Parquet and writes to a UC volume.
- Runs the
raw_tablesSDP pipeline to build 23 materialized views inraw_utility_intelligence(src/raw_*.sql). - Applies table and column comments plus UC tags
(
src/03_raw_table_comments.py).
- Downloads source files in parallel from EIA, EPA, PUDL, and Census
(
-
Curate (
curatejob)- Runs the
curated_tablesSDP pipeline to build a clean star schema incurated_utility_intelligence(src/dim_*.sql,src/fact_*.sql). - Applies curated table and column comments plus UC tags
(
src/03_curated_table_comments.py).
- Runs the
-
Setup (
setupjob)- Creates or updates the Genie Space US Utility Intelligence over the
18 curated tables, with sample questions and modeling instructions
(
src/01_create_genie_space.py).
- Creates or updates the Genie Space US Utility Intelligence over the
18 curated tables, with sample questions and modeling instructions
(
| Source | What | Year |
|---|---|---|
| EIA Form 860 | Plant & generator characteristics | 2023 |
| EIA Form 861 | Utility sales, revenue, reliability | 2023 |
| EIA Form 923 | Monthly generation & fuel | 2023 |
| EPA eGRID | Plant/generator emissions | 2022 |
| PUDL | FERC financials, service territory, crosswalks | Stable |
| Census TIGER/Line | County boundary polygons | 2024 |
All public, no credentials required.
- Serverless only. Both SDP pipelines and the Genie Space's warehouse use Databricks serverless compute.
- The first
ingestrun downloads ~1 GB and runs the SDP pipeline -- expect 10-20 minutes on a fresh deployment. Subsequent runs are incremental. - Genie Spaces aren't yet a native Databricks Asset Bundle resource (see databricks/cli#4191), so creation happens via the REST API from a notebook task in the bundle.
This repo is published from an upstream source, so pull requests here will be overwritten on the next release. Please file an issue for bugs, questions, or improvement ideas -- fixes will be made upstream and reflected here on the next push.
MIT -- see LICENSE.