A PySpark project template for Databricks built for both human and AI-assisted (agentic) development. Medallion architecture, Python packaging, unit + integration tests, CI/CD via Declarative Automation Bundles, DQX data quality, service-principal-based production deploys β and a curated
CLAUDE.mdso AI assistants understand the project's conventions on the first session.
This project template is designed to boost productivity and promote maintainability when developing ETL pipelines on Databricks. It is built for agentic development and brings software engineering best practicesβmodular architecture, automated unit and integration testing, CI/CD, structured logging, service-principal-based production guardrailsβinto the world of data engineering. By combining a clean project structure with robust development and deployment jobs, it helps teams move faster with confidence.
You're encouraged to adapt the structure and tooling to suit your project's specific needs and environment.
Let's connect on LinkedIn.
- Databricks Free Edition (Serverless)
- Databricks Runtime 18.0 LTS
- Databricks Unity Catalog
- Databricks Declarative Automation Bundles (former Asset Bundles)
- Databricks CLI
- Databricks Python SDK
- Databricks DQX
- Databricks AI Dev Kit
- Claude Code
- PySpark 4.1
- Python 3.12+
- GitHub Actions
- Pytest
This project template demonstrates how to:
- use agentic development (with Databricks AI Dev Kit and Claude Code) in data projects. The template ships with a
CLAUDE.mdthat documents the project's conventions β CLI surface, catalog/schema model, runtime parameters, production guardrails, and a constraints section recording the gotchas we've hit. Design decisions made in collaboration with the agent are encoded as rules inCLAUDE.mdso they survive across sessions. - structure PySpark code inside classes/packages and utilize a Python wheel package, instead of notebooks.
- package and deploy code with Declarative Automation Bundles to different environments (dev, staging, prod). Use a CI/CD pipeline with GitHub Actions. Generate job definitions to run with environment-specific conditions using Databricks SDK.
- isolate "dev" environments / catalogs to avoid concurrency issues between developer tests.
- run unit tests on transformations with the pytest package. Set up VS Code to run tests on your local machine.
- run integration tests by setting the input data and validating the output data.
- utilize job tags to track issues, costs, and ownership.
- utilize the coverage package to generate test coverage reports.
- utilize uv as a project/package manager.
- use the medallion architecture pattern.
- lint and format code with ruff and pre-commit.
- use a Makefile to automate repetitive tasks.
- utilize the argparse package to build a flexible command-line interface to start the jobs.
- utilize Databricks DQX to enforce data quality rules, such as null checks, uniqueness, thresholds, and schema validation, and filter bad data into quarantine tables.
- utilize service principals to run production code.
- utilize the Databricks SDK for Python to manage workspaces and accounts and analyze costs. Refer to the
scriptsfolder for examples. - utilize Databricks Unity Catalog to manage permissions and get data lineage.
- utilize Databricks Lakeflow Jobs to execute a DAG. Yes, you don't need Airflow to manage your DAGs here!!!
- utilize serverless job clusters on Databricks Free Edition to deploy your pipelines.
Agentic development:
- Claude Code: 5 Essentials for Data Engineering
- Mastering Claude Code in 30 minutes
- Introducing Databricks AI Dev Kit - Skills, MCP server, Builder App
Debates on the use of notebooks vs. Python packaging:
- The Rise of The Notebook Engineer
- Please donβt make me use Databricks notebooks
- this Linkedin thread by Daniel Beach
- this Linkedin thread by Ryan Chynoweth
- this Linkedin thread by Jaco van Gelder
Sessions on Databricks Declarative Automation Bundles, CI/CD, and Software Development Life Cycle at Data + AI Summit 2025:
- CI/CD for Databricks: Advanced Asset Bundles and GitHub Actions
- Deploying Databricks Asset Bundles (DABs) at Scale
- A Prescription for Success: Leveraging DABs for Faster Deployment and Better Patient Outcomes
Other resources:
- Goodbye Pip and Poetry. Why UV Might Be All You Need
- The Spark Revolution You Didnβt See Coming: How Apache Spark 4.0 in Databricks Just Changed Everything
databricks-template/
β
βββ .github/ # CI/CD automation
β βββ workflows/
β βββ onpush.yml # GitHub Actions pipeline
β
βββ src/ # Main source code
β βββ template/ # Python package
β βββ main.py # Entry point with CLI (argparse)
β βββ config.py # Configuration management
β βββ baseTask.py # Base class for all tasks
β βββ commonSchemas.py # Shared PySpark schemas
β βββ job1/ # Job-specific tasks
β β βββ extract_source1.py
β β βββ extract_source2.py # DQX validation + quarantine
β β βββ generate_orders.py
β β βββ generate_orders_agg.py
β β βββ health_check.py # Prod smoke task (runs first)
β β βββ integration_setup.py
β β βββ integration_validate.py
β βββ job2/ # Additional job tasks
β
βββ tests/ # Unit tests
β βββ job1/
β β βββ unit_test.py # Pytest unit tests
β βββ job2/
β
βββ resources/ # Databricks workflow templates
β βββ jobs.yml # Generated job definition (auto-created)
β
βββ scripts/ # Helper scripts
β βββ sdk_generate_template_job.py # Job definition generator (Databricks SDK)
β βββ sdk_init_workspace.py # Workspace initialization (SP, catalogs, schemas, grants)
β βββ sdk_analyze_job_costs.py # Cost analysis script
β βββ sdk_workspace_and_account.py # Workspace and account management
β
βββ docs/ # Documentation assets
β βββ dag.png
β βββ task_output.png
β βββ data_lineage.png
β βββ data_quality.png
β βββ ci_cd.png
β
βββ dist/ # Build artifacts (Python wheel)
βββ coverage_reports/ # Test coverage reports
β
βββ databricks.yml # Declarative Automation Bundle config
βββ pyproject.toml # Python project configuration (uv)
βββ Makefile # Build automation
βββ .pre-commit-config.yaml # Pre-commit hooks (ruff)
βββ README.md # This file
-
(Optional) Install Databricks AI Dev Kit and Claude Code.
-
Create a Databricks Free Edition workspace.
-
Install and configure the Databricks CLI on your local machine. Check the current version in
databricks.yml. Follow the instructions here. -
Set up the Python environment and run unit tests on your local machine.
make sync && make test -
Initialize the workspace. Create an external location in Databricks and update the
storage-rootparameter in the Makefile. This step will create the catalogs, schemas, service principal, and the required grants. For more details, see Overview of external locations. Then run:make init -
Generate a secret for the service principal. In Databricks, go to: Workspace -> Settings -> Identity and access -> Service principals -> Secrets. Generate a new secret for your service principal and update the corresponding profiles in your .databrickscfg file. Your configuration should look similar to this:
[dev] host = https://xxxx.cloud.databricks.com/ token = bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb [staging] host = https://xxxx.cloud.databricks.com/ client_id = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [prod] host = https://xxxx.cloud.databricks.com/ client_id = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -
Deploy and execute on the dev workspace.
make deploy env=dev -
Configure CI/CD automation with the service principal ID and secret. Configure GitHub Actions repository secrets (DATABRICKS_HOST, DATABRICKS_PRINCIPAL_ID, DATABRICKS_SECRET).
-
(Optional) You can also execute unit tests from your preferred IDE. Here's a screenshot from VS Code with Microsoft's Python extension installed.
These are defined as JobParameterDefinition in scripts/sdk_generate_template_job.py and threaded into every task as CLI args via {{job.parameters.*}}. Operators can override them for a single run using the Databricks Jobs UI "Run with different parameters" dialog β no code change or redeployment needed.
| Parameter | CLI arg | Purpose | Default (dev/staging) | Default (prod) |
|---|---|---|---|---|
log_level |
--log-level |
DEBUG / INFO / WARNING. Bump to DEBUG for a single prod run during incident response. |
INFO |
INFO |
quarantine_fail_ratio |
--quarantine-fail-ratio |
Hard-fail extract_source2 if more than this fraction of rows are quarantined by DQX. Defaults to disabled so demo seed data still ingests. |
1.0 |
0.1 |
seed_date |
--seed-date |
ISO-8601 date (e.g. 2024-03-15) for the seed_sources task. Empty string (default) resolves to today's date at runtime. Override per-run to backfill a specific day. |
"" β today |
"" β today |
Read by scripts/sdk_generate_template_job.py when generating resources/jobs.yml β never on Databricks serverless compute.
| Variable | Purpose | Default |
|---|---|---|
TEMPLATE_ALERT_EMAILS |
Comma-separated recipients for prod JobEmailNotifications (on_failure + on_duration_warning). Wired from CI secret of the same name. |
data-platform-oncall@example.com |
TEMPLATE_SP_APP_ID |
Override the service principal application_id looked up by display name. Used by CI to avoid the SCIM lookup. |
resolved from SP_DISPLAY_NAME |
databricks.ymlsetsmode: productionon the prod target β DABs enforces that the deployer identity equals the run-as identity (the SP).make deploy env=prodfrom a developer's local machine will fail by design; only CI can push prod.run_asandpermissionson every staging/prod job are pinned to the service principal'sapplication_id(not${workspace.current_user.userName}), wired byscripts/sdk_generate_template_job.py.health_checktask runs first in prod and fails fast on a broken wheel, missing grant, or unreachable SQL warehouse β before any medallion table is touched.- Wheel version pinning:
_project_version()readspyproject.tomlto produce the exact wheel filename in the bundle'sJobEnvironment.dependencies, so a forgotten rebuild can't silently deploy an old wheel. - Per-environment retries: 0 in dev (fast feedback), 2 in staging/prod (transient failure resilience). Retries on staging/prod back off
MIN_RETRY_INTERVAL_MS(60s) before re-attempting, giving transient lock/metastore blips time to clear. - Per-task timeouts: each task has its own
timeout_seconds(300s for health-check, 900s for extracts, 1800s for transforms) so a single hung task can't consume the whole job budget. - Schema-drift guard: all writes use
overwriteSchema=falseso an upstream change in column type or order fails the task loudly instead of silently propagating bad data. - Queued runs, not skipped: prod has
max_concurrent_runs=1paired withqueue.enabled=trueβ if a run is still in flight when the next 5 a.m. tick arrives, the new run queues rather than getting silently dropped. - Health-rule-backed duration alert: the
on_duration_warning_threshold_exceededemail is wired to aJobsHealthRuleonRUN_DURATION_SECONDS > 1800(30 min). Without that rule, the email would be wired to an event that can never fire. - Cancelled/skipped runs don't page:
notification_settings.no_alert_for_canceled_runsandno_alert_for_skipped_runsare bothtrue, so manual cancellations or upstream-condition skips don't generate failure alerts.





