databricks-template — production-ready ETL + agentic development for Databricks

A PySpark project template for Databricks built for both human and AI-assisted (agentic) development. Medallion architecture, Python packaging, unit + integration tests, CI/CD via Declarative Automation Bundles, DQX data quality, service-principal-based production deploys — and a curated CLAUDE.md so AI assistants understand the project's conventions on the first session.

🚀 Overview

This project template is designed to boost productivity and promote maintainability when developing ETL pipelines on Databricks. It is built for agentic development and brings software engineering best practices—modular architecture, automated unit and integration testing, CI/CD, structured logging, service-principal-based production guardrails—into the world of data engineering. By combining a clean project structure with robust development and deployment jobs, it helps teams move faster with confidence.

You're encouraged to adapt the structure and tooling to suit your project's specific needs and environment.

Let's connect on LinkedIn.

🧪 Technologies

Databricks Free Edition (Serverless)
Databricks Runtime 18.0 LTS
Databricks Unity Catalog
Databricks Declarative Automation Bundles (former Asset Bundles)
Databricks CLI
Databricks Python SDK
Databricks DQX
Databricks AI Dev Kit
Claude Code
PySpark 4.1
Python 3.12+
GitHub Actions
Pytest

📦 Features

This project template demonstrates how to:

use agentic development (with Databricks AI Dev Kit and Claude Code) in data projects. The template ships with a CLAUDE.md that documents the project's conventions — CLI surface, catalog/schema model, runtime parameters, production guardrails, and a constraints section recording the gotchas we've hit. Design decisions made in collaboration with the agent are encoded as rules in CLAUDE.md so they survive across sessions.
structure PySpark code inside classes/packages and utilize a Python wheel package, instead of notebooks.
package and deploy code with Declarative Automation Bundles to different environments (dev, staging, prod). Use a CI/CD pipeline with GitHub Actions. Generate job definitions to run with environment-specific conditions using Databricks SDK.
isolate "dev" environments / catalogs to avoid concurrency issues between developer tests.
run unit tests on transformations with the pytest package. Set up VS Code to run tests on your local machine.
run integration tests by setting the input data and validating the output data.
utilize job tags to track issues, costs, and ownership.
utilize the coverage package to generate test coverage reports.
utilize uv as a project/package manager.
use the medallion architecture pattern.
lint and format code with ruff and pre-commit.
use a Makefile to automate repetitive tasks.
utilize the argparse package to build a flexible command-line interface to start the jobs.
utilize Databricks DQX to enforce data quality rules, such as null checks, uniqueness, thresholds, and schema validation, and filter bad data into quarantine tables.
utilize service principals to run production code.
utilize the Databricks SDK for Python to manage workspaces and accounts and analyze costs. Refer to the scripts folder for examples.
utilize Databricks Unity Catalog to manage permissions and get data lineage.
utilize Databricks Lakeflow Jobs to execute a DAG. Yes, you don't need Airflow to manage your DAGs here!!!
utilize serverless job clusters on Databricks Free Edition to deploy your pipelines.

🧠 Resources

Agentic development:

Debates on the use of notebooks vs. Python packaging:

Sessions on Databricks Declarative Automation Bundles, CI/CD, and Software Development Life Cycle at Data + AI Summit 2025:

Other resources:

📁 Folder Structure

databricks-template/
│
├── .github/                       # CI/CD automation
│   └── workflows/
│       └── onpush.yml             # GitHub Actions pipeline
│
├── src/                           # Main source code
│   └── template/                  # Python package
│       ├── main.py                # Entry point with CLI (argparse)
│       ├── config.py              # Configuration management
│       ├── baseTask.py            # Base class for all tasks
│       ├── commonSchemas.py       # Shared PySpark schemas
│       ├── job1/                  # Job-specific tasks
│       │   ├── extract_source1.py
│       │   ├── extract_source2.py        # DQX validation + quarantine
│       │   ├── generate_orders.py
│       │   ├── generate_orders_agg.py
│       │   ├── health_check.py           # Prod smoke task (runs first)
│       │   ├── integration_setup.py
│       │   └── integration_validate.py
│       └── job2/                  # Additional job tasks
│
├── tests/                          # Unit tests
│   ├── job1/
│   │   └── unit_test.py            # Pytest unit tests
│   └── job2/
│
├── resources/                      # Databricks workflow templates
│   └── jobs.yml                    # Generated job definition (auto-created)
│
├── scripts/                              # Helper scripts
│   ├── sdk_generate_template_job.py      # Job definition generator (Databricks SDK)
│   ├── sdk_init_workspace.py             # Workspace initialization (SP, catalogs, schemas, grants)
│   ├── sdk_analyze_job_costs.py          # Cost analysis script
│   └── sdk_workspace_and_account.py      # Workspace and account management
│
├── docs/                           # Documentation assets
│   ├── dag.png
│   ├── task_output.png
│   ├── data_lineage.png
│   ├── data_quality.png
│   └── ci_cd.png
│
├── dist/                        # Build artifacts (Python wheel)
├── coverage_reports/            # Test coverage reports
│
├── databricks.yml               # Declarative Automation Bundle config
├── pyproject.toml               # Python project configuration (uv)
├── Makefile                     # Build automation
├── .pre-commit-config.yaml      # Pre-commit hooks (ruff)
└── README.md                    # This file

CI/CD pipeline

Databricks Jobs

Task Output

Data Lineage (Unity Catalog)

Quarantine table (generated by Databricks DQX)

Instructions

(Optional) Install Databricks AI Dev Kit and Claude Code.
Create a Databricks Free Edition workspace.
Install and configure the Databricks CLI on your local machine. Check the current version in databricks.yml. Follow the instructions here.
Set up the Python environment and run unit tests on your local machine.
```
 make sync && make test
```
Initialize the workspace. Create an external location in Databricks and update the storage-root parameter in the Makefile. This step will create the catalogs, schemas, service principal, and the required grants. For more details, see Overview of external locations. Then run:
```
 make init
```

Generate a secret for the service principal. In Databricks, go to: Workspace -> Settings -> Identity and access -> Service principals -> Secrets. Generate a new secret for your service principal and update the corresponding profiles in your .databrickscfg file. Your configuration should look similar to this:

 [dev]
 host          = https://xxxx.cloud.databricks.com/
 token         = bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
                 
 [staging]
 host          = https://xxxx.cloud.databricks.com/
 client_id     = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
 client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

 [prod]
 host          = https://xxxx.cloud.databricks.com/
 client_id     = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
 client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Deploy and execute on the dev workspace.
```
 make deploy env=dev
```
Configure CI/CD automation with the service principal ID and secret. Configure GitHub Actions repository secrets (DATABRICKS_HOST, DATABRICKS_PRINCIPAL_ID, DATABRICKS_SECRET).
(Optional) You can also execute unit tests from your preferred IDE. Here's a screenshot from VS Code with Microsoft's Python extension installed.

Job-level parameters (runtime, overridable per-run)

These are defined as JobParameterDefinition in scripts/sdk_generate_template_job.py and threaded into every task as CLI args via {{job.parameters.*}}. Operators can override them for a single run using the Databricks Jobs UI "Run with different parameters" dialog — no code change or redeployment needed.

Parameter	CLI arg	Purpose	Default (dev/staging)	Default (prod)
`log_level`	`--log-level`	`DEBUG` / `INFO` / `WARNING`. Bump to `DEBUG` for a single prod run during incident response.	`INFO`	`INFO`
`quarantine_fail_ratio`	`--quarantine-fail-ratio`	Hard-fail `extract_source2` if more than this fraction of rows are quarantined by DQX. Defaults to disabled so demo seed data still ingests.	`1.0`	`0.1`
`seed_date`	`--seed-date`	ISO-8601 date (e.g. `2024-03-15`) for the `seed_sources` task. Empty string (default) resolves to today's date at runtime. Override per-run to backfill a specific day.	`""` → today	`""` → today

Deploy-time environment variables (CI/build machine only)

Read by scripts/sdk_generate_template_job.py when generating resources/jobs.yml — never on Databricks serverless compute.

Variable	Purpose	Default
`TEMPLATE_ALERT_EMAILS`	Comma-separated recipients for prod `JobEmailNotifications` (on_failure + on_duration_warning). Wired from CI secret of the same name.	`data-platform-oncall@example.com`
`TEMPLATE_SP_APP_ID`	Override the service principal `application_id` looked up by display name. Used by CI to avoid the SCIM lookup.	resolved from `SP_DISPLAY_NAME`

Production guardrails

databricks.yml sets mode: production on the prod target — DABs enforces that the deployer identity equals the run-as identity (the SP). make deploy env=prod from a developer's local machine will fail by design; only CI can push prod.
run_as and permissions on every staging/prod job are pinned to the service principal's application_id (not ${workspace.current_user.userName}), wired by scripts/sdk_generate_template_job.py.
health_check task runs first in prod and fails fast on a broken wheel, missing grant, or unreachable SQL warehouse — before any medallion table is touched.
Wheel version pinning: _project_version() reads pyproject.toml to produce the exact wheel filename in the bundle's JobEnvironment.dependencies, so a forgotten rebuild can't silently deploy an old wheel.
Per-environment retries: 0 in dev (fast feedback), 2 in staging/prod (transient failure resilience). Retries on staging/prod back off MIN_RETRY_INTERVAL_MS (60s) before re-attempting, giving transient lock/metastore blips time to clear.
Per-task timeouts: each task has its own timeout_seconds (300s for health-check, 900s for extracts, 1800s for transforms) so a single hung task can't consume the whole job budget.
Schema-drift guard: all writes use overwriteSchema=false so an upstream change in column type or order fails the task loudly instead of silently propagating bad data.
Queued runs, not skipped: prod has max_concurrent_runs=1 paired with queue.enabled=true — if a run is still in flight when the next 5 a.m. tick arrives, the new run queues rather than getting silently dropped.
Health-rule-backed duration alert: the on_duration_warning_threshold_exceeded email is wired to a JobsHealthRule on RUN_DURATION_SECONDS > 1800 (30 min). Without that rule, the email would be wired to an event that can never fire.
Cancelled/skipped runs don't page: notification_settings.no_alert_for_canceled_runs and no_alert_for_skipped_runs are both true, so manual cancellations or upstream-condition skips don't generate failure alerts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

databricks-template — production-ready ETL + agentic development for Databricks

🚀 Overview

🧪 Technologies

📦 Features

🧠 Resources

📁 Folder Structure

CI/CD pipeline

Databricks Jobs

Task Output

Data Lineage (Unity Catalog)

Quarantine table (generated by Databricks DQX)

Instructions

Job-level parameters (runtime, overridable per-run)

Deploy-time environment variables (CI/build machine only)

Production guardrails

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
docs		docs
resources		resources
scripts		scripts
src/template		src/template
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
databricks.yml		databricks.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

databricks-template — production-ready ETL + agentic development for Databricks

🚀 Overview

🧪 Technologies

📦 Features

🧠 Resources

📁 Folder Structure

CI/CD pipeline

Databricks Jobs

Task Output

Data Lineage (Unity Catalog)

Quarantine table (generated by Databricks DQX)

Instructions

Job-level parameters (runtime, overridable per-run)

Deploy-time environment variables (CI/build machine only)

Production guardrails

Star History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages