Skip to content

ryankwagner/DataHub

Repository files navigation

DataHub

DataHub is a modular metadata and DDL generation library designed to standardize table definitions across various data platforms (Hudi, etc.).

⚠️ NOTICE: This repository is currently CLOSED SOURCE and under active development. It is not ready for external use or distribution.

Feature Status

Module Status Description
core 🟢 Implemented Foundational interfaces (Table, Schema, Field) and reusable metadata definitions.
hudi 🟡 In Progress Hudi-specific implementations, properties management, and DDL generation logic.
api 🔴 Planned REST API definitions for metadata management.
schema 🔴 Planned Schema registry integrations and converters.
orchestration 🔴 Planned Airflow/Dagster integration patterns.
observability 🔴 Planned Data quality and lineage tracking.

Prerequisites

  • JDK 17+
  • Gradle (wrapper provided)
  • Docker & Docker Compose (for local environment)

Local Development

Building the Project

To build the project and run tests:

./gradlew build

Local Trino Environment

This project includes a Docker Compose setup to run a local data platform consisting of:

  • Trino: Distributed SQL query engine.
  • MinIO: S3-compatible object storage.
  • Hive Metastore: Metadata service for Trino/Hudi.
  • Postgres: Backend for Hive Metastore.

To start the environment:

docker-compose up -d

Code Style

This project enforces code style using Checkstyle. Violations will cause the build to fail. To run checks explicitly:

./gradlew check

About

Centralized metadata framework for defining, deploying, and observing data assets across the entire data lifecycle.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors