Skip to content

antonbarr-data/dqt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

393 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

質 dqt

A Data Questioning Tool that tells you the what and surfaces the why.

Python ≥3.12 License: MIT PyPI Release notes

64 detectors across 5 families (drift, outlier, time series, distribution, rule) · best F1 0.933 (holt_winters / wasserstein_1) · full results

(shitsu) - quality, substance, the inner nature of a thing. The kanji points to what something truly is, not how it appears. dqt is meant to work the same way: concerned with the truth of the data, not its surface. The mark is also a quiet acknowledgment of a tradition I have learned much from - one in which quality is one of its most distinguishing characteristics, and craft and precision are understood to be the same thing. — Anton Barr


Unifies your scattered data into one source of truth. Upgrades your existing models, dashboards, and queries into a causal semantic layer you didn't have to write. Picks up on trends and surfaces business insights, all wrapped in a quality harness that puts guardrails on the AI so the reports it generates stay on-spec.


The problem it solves

Without dqt: orders.amount null_fraction >= 0.05 -- threshold exceeded. Now what? Go dig through git log, dbt docs, warehouse history.

With dqt:

orders.amount null_fraction = 12.4% (baseline 0.3%)
Lineage: stg_payments -> orders -> revenue
Schema break in stg_payments 6h ago.
Causal candidate: stg_payments -> orders.amount (E-value 3.2, pending human review)

Four layers

  • Statistical detectors - MAD, double-MAD, isolation forest, KS, STL residuals, adjusted boxplot fences. Plus completeness, validity, freshness, schema-change, and SQL-assertion checks. Every detector returns (verdict, score, plain_english).
  • Column-level lineage - walks your dbt manifest and warehouse DDL with sqlglot. From any incident, automatic blast radius across downstream tables and metrics.
  • LLM Wiki + Semantic layer - dump tickets, SQL, and BI reports into raw/. Point Claude Code at the vault. It synthesises dataset descriptions, metric definitions, and causal edges into wiki/ from the artifacts your team already has. Based on Karpathy's LLM Wiki pattern.
  • Causal discovery - Granger causality, PCMCI+, Transfer Entropy across your metric time series. Edges are proposed, human-reviewed, then enter the production DAG annotated with lag, confidence, and E-values.

Quick start

pip install dqtlib
from dqt import Check, Runner, MemoryStore

check = Check(
    schema_name="public",
    table_name="orders",
    column_name="amount",
    detector_slug="mad_outlier_fraction",
)

result = Runner(MemoryStore()).run(check, adapter)
print(result.plain_english)
# "0.82% of values are outliers -- within the 1% warn threshold"
# Or from YAML
dqt run checks.yaml

# Exit codes: 0 = all pass, 2 = one or more failed

Installation

pip install dqtlib                # core library + CLI
pip install "dqtlib[wiki]"        # + LLM Wiki synthesis (Anthropic Claude)
pip install "dqtlib[dashboard]"   # + local browser dashboard
pip install "dqtlib[reports]"     # + HTML profiling reports
pip install "dqtlib[causal]"      # + PCMCI+ causal discovery
pip install "dqtlib[all]"         # everything

Requires Python >= 3.12.


Warehouse support

Built for ClickHouse and BigQuery first. Snowflake, Databricks, Postgres - WIP.

Engine Status
ClickHouse Supported
BigQuery Supported
PostgreSQL Supported
DuckDB / CSV / Parquet Supported
Snowflake WIP
Databricks SQL WIP

All adapters are cost-guarded (dryRun/EXPLAIN before any query) and read-only.


Integrations

  • dbt - reads manifest.json and semantic_models.yml directly
  • Airflow / Dagster / Prefect - runs as one Python task
  • OpenLineage - ingests events from any non-dbt pipeline
  • Claude Code - Context7 plugin for live dqt docs, Superpowers for agentic check-suite builds

Screenshots

Overview - fleet KPIs, dataset health table with sparklines, live activity feed

Overview dashboard

Incident detail - statistical evidence, distribution overlay, causal trace, AI explanation

Incident detail


Documentation

Doc Description
Getting started First check in 5 min, drift detection, CLI, dashboard, quick-reference slug table
Detectors reference All detectors with parameters and examples
YAML check format Complete YAML config reference
CLI reference All CLI commands including dqt wiki, dqt report
Python API Check model, CheckScope, Runner, MemoryStore
LLM Wiki Semantic layer synthesis from raw docs
Adapters Warehouse adapter protocol
Local dashboard Browser UI for check results
Benchmarks F1, recall, precision across 30 trials
Architecture System design, module boundaries, project layout
Comparison dqt vs GE, Soda, Elementary, Dataplex
Release notes Per-version changelog

About

Anton Barr is an engineer and data geek with 25+ years building data systems. A student of 質 (shitsu): quality, substance, the inner nature of a thing. dqt is a personal project built by a practitioner who believes craft and precision are the same thing - and got tired of tools that answer what but never why.


License

MIT - see LICENSE.