Skip to content

DocumentDrivenDX/tablespec

Repository files navigation

tablespec

Python library for working with table schemas in Universal Metadata Format (UMF). Provides type-safe models, validation, profiling integration, and schema generation tools.

Features

  • Type-Safe UMF Models: Pydantic-based models with runtime validation
  • Schema Generation: Generate SQL DDL, PySpark schemas, and JSON schemas from UMF
  • Great Expectations Integration: Baseline expectation generation and constraint extraction
  • Profiling Mappers: Convert Spark DataFrame profiles and Deequ profiles to UMF
  • Validation: Table validation against UMF specifications with Great Expectations
  • Type Mappings: Convert between UMF, PySpark, JSON, and Great Expectations types
  • LLM Prompt Generation: Generate structured prompts for documentation, validation rules, relationships, and survivorship logic
  • CLI: Typer-based command-line interface with Rich output for schema management and conversion
  • Excel Conversion: Bidirectional Excel export/import for domain expert collaboration
  • Split-Format UMF: Git-friendly directory-based storage with automatic format detection
  • Sample Data Generation: Healthcare-specific, constraint-aware sample data from UMF specs
  • Domain Type Inference: Automatic detection of domain types (SSN, NPI, phone, state codes, etc.)
  • Change Management: UMF diffing, atomic change application, and git-based changelogs

Demo

asciicast

The demo walks through loading a UMF schema, generating SQL/PySpark/JSON schemas, type mappings, domain type inference, Great Expectations baseline generation, LLM prompt generation, UMF diffing, and PySpark validation with sample data generation.

Watch with narration (MP4)

Play the asciinema recording in your terminal:

# Install asciinema, then play the recording
asciinema play examples/tablespec-demo.cast

# Or run the demo live (requires tablespec[spark])
uv run python examples/demo.py

Installation

Using uv (recommended)

# Add to your uv project
uv add tablespec --index-url https://easel.github.io/tablespec/simple/

# With Spark support (for profiling and validation)
uv add tablespec[spark] --index-url https://easel.github.io/tablespec/simple/

Using pip

# Install from GitHub Pages index
pip install tablespec --index-url https://easel.github.io/tablespec/simple/

# With Spark support
pip install tablespec[spark] --index-url https://easel.github.io/tablespec/simple/

Note: This package is distributed via GitHub Pages. The --index-url flag is required.

Optional extras:

  • tablespec[spark] - Adds PySpark support for SparkToUmfMapper, TableValidator, SampleDataGenerator (with Spark FK seeding), BaselineService, and table merge. Install this extra only if you need Spark-dependent features.

Quick Start

Loading and Saving UMF Files

from tablespec import load_umf_from_yaml, save_umf_to_yaml, UMF

# Load UMF from YAML (see examples/ for sample files)
umf = load_umf_from_yaml("examples/schema.yaml")

# Access metadata
print(f"Table: {umf.table_name}")
print(f"Columns: {len(umf.columns)}")

# Modify and save
umf.description = "Updated description"
save_umf_to_yaml(umf, "updated_schema.yaml")

Creating UMF Programmatically

from tablespec import UMF, UMFColumn, Nullable

umf = UMF(
    version="1.0",
    table_name="Medical_Claims",
    description="Healthcare claims data",
    columns=[
        UMFColumn(
            name="claim_id",
            data_type="VARCHAR",
            length=50,
            description="Unique claim identifier",
            nullable=Nullable(MD=False, MP=False, ME=False)
        ),
        UMFColumn(
            name="claim_amount",
            data_type="DECIMAL",
            precision=10,
            scale=2,
            description="Claim amount in USD",
            nullable=Nullable(MD=True, MP=True, ME=True)
        )
    ]
)

save_umf_to_yaml(umf, "medical_claims.yaml")

Generating Schemas

from tablespec import generate_sql_ddl, generate_pyspark_schema, generate_json_schema

# Load UMF
umf = load_umf_from_yaml("examples/schema.yaml")
umf_dict = umf.model_dump()

# Generate SQL DDL
ddl = generate_sql_ddl(umf_dict)
print(ddl)

# Generate PySpark schema code
spark_schema = generate_pyspark_schema(umf_dict)
print(spark_schema)

# Generate JSON Schema
json_schema = generate_json_schema(umf_dict)
print(json_schema)

Documentation

Full documentation is available at easel.github.io/tablespec:

  • User Guide -- UMF format, schema generation, GX integration, CLI, Excel, and more
  • API Reference -- Complete API documentation
  • Development -- Setup, testing, and project structure

License

Apache License 2.0 - see LICENSE file for details.

About

Tools for working with tables specified by Universal Metadata Format (UMF)

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages