Skip to content

andre-salvati/databricks-template

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

71 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

databricks-template β€” production-ready ETL + agentic development for Databricks

A PySpark project template for Databricks built for both human and AI-assisted (agentic) development. Medallion architecture, Python packaging, unit + integration tests, CI/CD via Declarative Automation Bundles, DQX data quality, service-principal-based production deploys β€” and a curated CLAUDE.md so AI assistants understand the project's conventions on the first session.

Databricks PySpark CI/CD Claude Code Stars

πŸš€ Overview

This project template is designed to boost productivity and promote maintainability when developing ETL pipelines on Databricks. It is built for agentic development and brings software engineering best practicesβ€”modular architecture, automated unit and integration testing, CI/CD, structured logging, service-principal-based production guardrailsβ€”into the world of data engineering. By combining a clean project structure with robust development and deployment jobs, it helps teams move faster with confidence.

You're encouraged to adapt the structure and tooling to suit your project's specific needs and environment.

Let's connect on LinkedIn.

πŸ§ͺ Technologies

  • Databricks Free Edition (Serverless)
  • Databricks Runtime 18.0 LTS
  • Databricks Unity Catalog
  • Databricks Declarative Automation Bundles (former Asset Bundles)
  • Databricks CLI
  • Databricks Python SDK
  • Databricks DQX
  • Databricks AI Dev Kit
  • Claude Code
  • PySpark 4.1
  • Python 3.12+
  • GitHub Actions
  • Pytest

πŸ“¦ Features

This project template demonstrates how to:

  • use agentic development (with Databricks AI Dev Kit and Claude Code) in data projects. The template ships with a CLAUDE.md that documents the project's conventions β€” CLI surface, catalog/schema model, runtime parameters, production guardrails, and a constraints section recording the gotchas we've hit. Design decisions made in collaboration with the agent are encoded as rules in CLAUDE.md so they survive across sessions.
  • structure PySpark code inside classes/packages and utilize a Python wheel package, instead of notebooks.
  • package and deploy code with Declarative Automation Bundles to different environments (dev, staging, prod). Use a CI/CD pipeline with GitHub Actions. Generate job definitions to run with environment-specific conditions using Databricks SDK.
  • isolate "dev" environments / catalogs to avoid concurrency issues between developer tests.
  • run unit tests on transformations with the pytest package. Set up VS Code to run tests on your local machine.
  • run integration tests by setting the input data and validating the output data.
  • utilize job tags to track issues, costs, and ownership.
  • utilize the coverage package to generate test coverage reports.
  • utilize uv as a project/package manager.
  • use the medallion architecture pattern.
  • lint and format code with ruff and pre-commit.
  • use a Makefile to automate repetitive tasks.
  • utilize the argparse package to build a flexible command-line interface to start the jobs.
  • utilize Databricks DQX to enforce data quality rules, such as null checks, uniqueness, thresholds, and schema validation, and filter bad data into quarantine tables.
  • utilize service principals to run production code.
  • utilize the Databricks SDK for Python to manage workspaces and accounts and analyze costs. Refer to the scripts folder for examples.
  • utilize Databricks Unity Catalog to manage permissions and get data lineage.
  • utilize Databricks Lakeflow Jobs to execute a DAG. Yes, you don't need Airflow to manage your DAGs here!!!
  • utilize serverless job clusters on Databricks Free Edition to deploy your pipelines.

🧠 Resources

Agentic development:

Debates on the use of notebooks vs. Python packaging:

Sessions on Databricks Declarative Automation Bundles, CI/CD, and Software Development Life Cycle at Data + AI Summit 2025:

Other resources:

πŸ“ Folder Structure

databricks-template/
β”‚
β”œβ”€β”€ .github/                       # CI/CD automation
β”‚   └── workflows/
β”‚       └── onpush.yml             # GitHub Actions pipeline
β”‚
β”œβ”€β”€ src/                           # Main source code
β”‚   └── template/                  # Python package
β”‚       β”œβ”€β”€ main.py                # Entry point with CLI (argparse)
β”‚       β”œβ”€β”€ config.py              # Configuration management
β”‚       β”œβ”€β”€ baseTask.py            # Base class for all tasks
β”‚       β”œβ”€β”€ commonSchemas.py       # Shared PySpark schemas
β”‚       β”œβ”€β”€ job1/                  # Job-specific tasks
β”‚       β”‚   β”œβ”€β”€ extract_source1.py
β”‚       β”‚   β”œβ”€β”€ extract_source2.py        # DQX validation + quarantine
β”‚       β”‚   β”œβ”€β”€ generate_orders.py
β”‚       β”‚   β”œβ”€β”€ generate_orders_agg.py
β”‚       β”‚   β”œβ”€β”€ health_check.py           # Prod smoke task (runs first)
β”‚       β”‚   β”œβ”€β”€ integration_setup.py
β”‚       β”‚   └── integration_validate.py
β”‚       └── job2/                  # Additional job tasks
β”‚
β”œβ”€β”€ tests/                          # Unit tests
β”‚   β”œβ”€β”€ job1/
β”‚   β”‚   └── unit_test.py            # Pytest unit tests
β”‚   └── job2/
β”‚
β”œβ”€β”€ resources/                      # Databricks workflow templates
β”‚   └── jobs.yml                    # Generated job definition (auto-created)
β”‚
β”œβ”€β”€ scripts/                              # Helper scripts
β”‚   β”œβ”€β”€ sdk_generate_template_job.py      # Job definition generator (Databricks SDK)
β”‚   β”œβ”€β”€ sdk_init_workspace.py             # Workspace initialization (SP, catalogs, schemas, grants)
β”‚   β”œβ”€β”€ sdk_analyze_job_costs.py          # Cost analysis script
β”‚   └── sdk_workspace_and_account.py      # Workspace and account management
β”‚
β”œβ”€β”€ docs/                           # Documentation assets
β”‚   β”œβ”€β”€ dag.png
β”‚   β”œβ”€β”€ task_output.png
β”‚   β”œβ”€β”€ data_lineage.png
β”‚   β”œβ”€β”€ data_quality.png
β”‚   └── ci_cd.png
β”‚
β”œβ”€β”€ dist/                        # Build artifacts (Python wheel)
β”œβ”€β”€ coverage_reports/            # Test coverage reports
β”‚
β”œβ”€β”€ databricks.yml               # Declarative Automation Bundle config
β”œβ”€β”€ pyproject.toml               # Python project configuration (uv)
β”œβ”€β”€ Makefile                     # Build automation
β”œβ”€β”€ .pre-commit-config.yaml      # Pre-commit hooks (ruff)
└── README.md                    # This file

CI/CD pipeline



Databricks Jobs



Task Output



Data Lineage (Unity Catalog)



Quarantine table (generated by Databricks DQX)



Instructions

  1. (Optional) Install Databricks AI Dev Kit and Claude Code.

  2. Create a Databricks Free Edition workspace.

  3. Install and configure the Databricks CLI on your local machine. Check the current version in databricks.yml. Follow the instructions here.

  4. Set up the Python environment and run unit tests on your local machine.

     make sync && make test
    
  5. Initialize the workspace. Create an external location in Databricks and update the storage-root parameter in the Makefile. This step will create the catalogs, schemas, service principal, and the required grants. For more details, see Overview of external locations. Then run:

     make init
    
  6. Generate a secret for the service principal. In Databricks, go to: Workspace -> Settings -> Identity and access -> Service principals -> Secrets. Generate a new secret for your service principal and update the corresponding profiles in your .databrickscfg file. Your configuration should look similar to this:

     [dev]
     host          = https://xxxx.cloud.databricks.com/
     token         = bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
                     
     [staging]
     host          = https://xxxx.cloud.databricks.com/
     client_id     = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
     client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    
     [prod]
     host          = https://xxxx.cloud.databricks.com/
     client_id     = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
     client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    
  7. Deploy and execute on the dev workspace.

     make deploy env=dev
    
  8. Configure CI/CD automation with the service principal ID and secret. Configure GitHub Actions repository secrets (DATABRICKS_HOST, DATABRICKS_PRINCIPAL_ID, DATABRICKS_SECRET).

  9. (Optional) You can also execute unit tests from your preferred IDE. Here's a screenshot from VS Code with Microsoft's Python extension installed.

Job-level parameters (runtime, overridable per-run)

These are defined as JobParameterDefinition in scripts/sdk_generate_template_job.py and threaded into every task as CLI args via {{job.parameters.*}}. Operators can override them for a single run using the Databricks Jobs UI "Run with different parameters" dialog β€” no code change or redeployment needed.

Parameter CLI arg Purpose Default (dev/staging) Default (prod)
log_level --log-level DEBUG / INFO / WARNING. Bump to DEBUG for a single prod run during incident response. INFO INFO
quarantine_fail_ratio --quarantine-fail-ratio Hard-fail extract_source2 if more than this fraction of rows are quarantined by DQX. Defaults to disabled so demo seed data still ingests. 1.0 0.1
seed_date --seed-date ISO-8601 date (e.g. 2024-03-15) for the seed_sources task. Empty string (default) resolves to today's date at runtime. Override per-run to backfill a specific day. "" β†’ today "" β†’ today

Deploy-time environment variables (CI/build machine only)

Read by scripts/sdk_generate_template_job.py when generating resources/jobs.yml β€” never on Databricks serverless compute.

Variable Purpose Default
TEMPLATE_ALERT_EMAILS Comma-separated recipients for prod JobEmailNotifications (on_failure + on_duration_warning). Wired from CI secret of the same name. data-platform-oncall@example.com
TEMPLATE_SP_APP_ID Override the service principal application_id looked up by display name. Used by CI to avoid the SCIM lookup. resolved from SP_DISPLAY_NAME

Production guardrails

  • databricks.yml sets mode: production on the prod target β€” DABs enforces that the deployer identity equals the run-as identity (the SP). make deploy env=prod from a developer's local machine will fail by design; only CI can push prod.
  • run_as and permissions on every staging/prod job are pinned to the service principal's application_id (not ${workspace.current_user.userName}), wired by scripts/sdk_generate_template_job.py.
  • health_check task runs first in prod and fails fast on a broken wheel, missing grant, or unreachable SQL warehouse β€” before any medallion table is touched.
  • Wheel version pinning: _project_version() reads pyproject.toml to produce the exact wheel filename in the bundle's JobEnvironment.dependencies, so a forgotten rebuild can't silently deploy an old wheel.
  • Per-environment retries: 0 in dev (fast feedback), 2 in staging/prod (transient failure resilience). Retries on staging/prod back off MIN_RETRY_INTERVAL_MS (60s) before re-attempting, giving transient lock/metastore blips time to clear.
  • Per-task timeouts: each task has its own timeout_seconds (300s for health-check, 900s for extracts, 1800s for transforms) so a single hung task can't consume the whole job budget.
  • Schema-drift guard: all writes use overwriteSchema=false so an upstream change in column type or order fails the task loudly instead of silently propagating bad data.
  • Queued runs, not skipped: prod has max_concurrent_runs=1 paired with queue.enabled=true β€” if a run is still in flight when the next 5 a.m. tick arrives, the new run queues rather than getting silently dropped.
  • Health-rule-backed duration alert: the on_duration_warning_threshold_exceeded email is wired to a JobsHealthRule on RUN_DURATION_SECONDS > 1800 (30 min). Without that rule, the email would be wired to an event that can never fire.
  • Cancelled/skipped runs don't page: notification_settings.no_alert_for_canceled_runs and no_alert_for_skipped_runs are both true, so manual cancellations or upstream-condition skips don't generate failure alerts.

Star History

Star History Chart

Releases

No releases published

Packages

 
 
 

Contributors