Skip to content

syda-ai/syda

Repository files navigation

Syda - AI-Powered Synthetic Data Generation

PyPI version Python 3.8+ License: MIT Documentation GitHub stars DOI

Generate high-quality synthetic data with AI while preserving referential integrity

SYDA seamlessly generates realistic synthetic test data—including structured, unstructured, PDF, and HTML—using AI and large language models. It preserves referential integrity, maintains privacy compliance, and accelerates development workflows. SYDA enables both highly regulated industries such as healthcare and banking, as well as non-regulated environments like software testing, research, and analytics, to safely simulate diverse data scenarios without exposing sensitive information.

Documentation

For detailed documentation, examples, and API reference, visit: https://python.syda.ai/

Quick Start

pip install syda

Create .env file:

# .env
ANTHROPIC_API_KEY=your_anthropic_api_key_here
# OR
OPENAI_API_KEY=your_openai_api_key_here
# OR
GEMINI_API_KEY=your_gemini_api_key_here
# OR
GROK_API_KEY=your_grok_api_key_here
"""
Syda 30-Second Quick Start Example
Demonstrates AI-powered synthetic data generation with perfect referential integrity
"""

from syda import SyntheticDataGenerator, ModelConfig
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

print("🚀 Starting Syda Quick Start...")

# Configure AI model
generator = SyntheticDataGenerator(
    model_config=ModelConfig(
        provider="anthropic", 
        model_name="claude-haiku-4-5-20251001"
    )
)

# Define schemas with rich descriptions for better AI understanding
schemas = {
    # Categories schema with table and column descriptions
    'categories': {
        '__table_description__': 'Product categories for organizing items in the e-commerce catalog',
        'id': {
            'type': 'number', 
            'description': 'Unique identifier for the category', 
            'primary_key': True
        },
        'name': {
            'type': 'text', 
            'description': 'Category name (Electronics, Home Decor, Sports, etc.)'
        },
        'description': {
            'type': 'text', 
            'description': 'Detailed description of what products belong in this category'
        }
    },

    # Products schema with table and column descriptions and foreign keys
    'products': {
        '__table_description__': 'Individual products available for purchase with pricing and category assignment',
        '__foreign_keys__': {
            'category_id': ['categories', 'id']  # products.category_id references categories.id
        },
        'id': {
            'type': 'number', 
            'description': 'Unique product identifier', 
            'primary_key': True
        },
        'name': {
            'type': 'text', 
            'description': 'Product name and title'
        },
        'category_id': {
            'type': 'foreign_key', 
            'description': 'Reference to the category this product belongs to'
        },
        'price': {
            'type': 'number', 
            'description': 'Product price in USD'
        }
    }
}

# Generate data with perfect referential integrity
print("📊 Generating categories and products...")
results = generator.generate_for_schemas(
    schemas=schemas,
    sample_sizes={"categories": 5, "products": 20},
    output_dir="data"
)

print("✅ Generated realistic data with perfect foreign key relationships!")
print("📂 Check the 'data' folder for categories.csv and products.csv")
# Check data/ folder for categories.csv and products.csv

Why Developers Love Syda

Feature Benefit Example
Multi-AI Provider No vendor lock-in Claude, GPT, Gemini, Grok, Ollama, and any OpenAI-compatible API
Zero Orphaned Records Perfect referential integrity product.category_idcategory.id
SQLAlchemy Native Use existing models directly Customer, Contact classes → CSV data
Multiple Schema Formats Flexible input options SQLAlchemy, YAML, JSON, Dict
Document Generation AI-powered PDFs linked to data Product catalogs, receipts, contracts
Custom Generators Complex business logic Tax calculations, pricing rules, arrays
Large Dataset Support Thousands to millions of rows Code-gen mode: 10,000 rows with ~3 LLM calls
Privacy-First Protect real user data GDPR/CCPA compliant testing
Database Integration Any SQLAlchemy-compatible database DatabaseSchemaLoader("postgresql://...") → generate → write back
CLI No Python required syda generate --schema patients.yaml --rows 1000 --large-dataset
Cost Tracking Know what you're spending Per-table & per-column cost breakdown in every run report
Developer Experience Just works Type hints, great docs, HTML run reports

Retail Example

1. Define your schemas

Click to view schema files (category_schema.yml & product_schema.yml)

category_schema.yml:

__table_name__: Category
__description__: Retail product categories

id:
  type: integer
  description: Unique category ID
  constraints:
    primary_key: true
    not_null: true
    min: 1
    max: 1000

name:
  type: string
  description: Category name
  constraints:
    not_null: true
    length: 50
    unique: true

parent_id:
  type: integer
  description: Parent category ID for hierarchical categories, if it is a parent category, this field should be 0
  constraints:
    min: 0
    max: 1000

description:
  type: text
  description: Detailed category description
  constraints:
    length: 500

active:
  type: boolean
  description: Whether the category is active
  constraints:
    not_null: true

product_schema.yml:

__table_name__: Product
__description__: Retail products
__foreign_keys__:
  category_id: [Category, id]

id:
  type: integer
  description: Unique product ID
  constraints:
    primary_key: true
    not_null: true
    min: 1
    max: 10000

name:
  type: string
  description: Product name
  constraints:
    not_null: true
    length: 100
    unique: true

category_id:
  type: integer
  description: Category ID for the product
  constraints:
    not_null: true
    min: 1
    max: 1000

sku:
  type: string
  description: Stock Keeping Unit - unique product code
  constraints:
    not_null: true
    pattern: '^P[A-Z]{2}-\d{5}$'
    length: 10
    unique: true

price:
  type: float
  description: Product price in USD
  constraints:
    not_null: true
    min: 0.99
    max: 9999.99
    decimals: 2

stock_quantity:
  type: integer
  description: Current stock level
  constraints:
    not_null: true
    min: 0
    max: 10000

is_featured:
  type: boolean
  description: Whether the product is featured
  constraints:
    not_null: true

2. Generate structured data

🐍 Click to view Python code
from syda import SyntheticDataGenerator, ModelConfig
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Configure your AI model  
config = ModelConfig(
    provider="anthropic",
    model_name="claude-haiku-4-5-20251001"
)

# Create generator
generator = SyntheticDataGenerator(model_config=config)

# Define your schemas (structured data only)
schemas = {
    "categories": "category_schema.yml",
    "products": "product_schema.yml"
}

# Generate synthetic data with relationships intact
results = generator.generate_for_schemas(
    schemas=schemas,
    sample_sizes={"categories": 5, "products": 20},
    output_dir="output",
    prompts = {
        "Category": "Generate retail product categories with hierarchical structure.",
        "Product": "Generate retail products with names, SKUs, prices, and descriptions. Ensure a good variety of prices and categories."
    }
)

# Perfect referential integrity guaranteed! 🎯
print("✅ Generated realistic data with perfect foreign key relationships!")

Output:

output/
├── categories.csv    # 5 product categories with hierarchical structure
└── products.csv      # 20 products, all with valid category_id references

3. Want to generate documents too? Add document templates!

To generate AI-powered documents along with your structured data, simply add the product catalog schema and update your code:

Click to view document schema (product_catalog_schema.yml)

product_catalog_schema.yml (Document Template):

__template__: true
__description__: Product catalog page template
__name__: ProductCatalog
__depends_on__: [Product, Category]
__foreign_keys__:
  product_name: [Product, name]
  category_name: [Category, name]
  product_price: [Product, price]
  product_sku: [Product, sku]
__template_source__: templates/product_catalog.html
__input_file_type__: html
__output_file_type__: pdf

# Product information (linked to Product table)
product_name:
  type: string
  length: 100
  description: Name of the featured product

category_name:
  type: string
  length: 50
  description: Category this product belongs to

product_sku:
  type: string
  length: 10
  description: Product SKU code

product_price:
  type: float
  decimals: 2
  description: Product price in USD

# Marketing content (AI-generated)
product_description:
  type: text
  length: 500
  description: Detailed marketing description of the product

key_features:
  type: text
  length: 300
  description: Bullet points of key product features

marketing_tagline:
  type: string
  length: 100
  description: Catchy marketing tagline for the product

availability_status:
  type: string
  enum: ["In Stock", "Limited Stock", "Out of Stock", "Pre-Order"]
  description: Current availability status
🎨 Click to view HTML template (templates/product_catalog.html)

Create the Jinja HTML template (templates/product_catalog.html):

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>{{ product_name }} - Product Catalog</title>
    <style>
        body {
            font-family: 'Arial', sans-serif;
            max-width: 800px;
            margin: 0 auto;
            padding: 40px;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: #333;
        }
        .catalog-page {
            background: white;
            padding: 40px;
            border-radius: 15px;
            box-shadow: 0 10px 30px rgba(0,0,0,0.2);
        }
        .product-header {
            text-align: center;
            margin-bottom: 30px;
            border-bottom: 3px solid #667eea;
            padding-bottom: 20px;
        }
        .product-name {
            font-size: 36px;
            font-weight: bold;
            color: #2c3e50;
            margin-bottom: 10px;
        }
        .category-sku {
            font-size: 16px;
            color: #7f8c8d;
            margin-bottom: 15px;
        }
        .price {
            font-size: 32px;
            color: #e74c3c;
            font-weight: bold;
        }
        .tagline {
            font-style: italic;
            font-size: 18px;
            color: #34495e;
            text-align: center;
            margin: 20px 0;
            padding: 15px;
            background: #ecf0f1;
            border-radius: 8px;
        }
        .description {
            font-size: 16px;
            line-height: 1.6;
            margin: 25px 0;
            text-align: justify;
        }
        .features {
            background: #f8f9fa;
            padding: 20px;
            border-radius: 8px;
            margin: 25px 0;
        }
        .features h3 {
            color: #2c3e50;
            margin-top: 0;
        }
        .availability {
            text-align: center;
            font-size: 18px;
            font-weight: bold;
            padding: 15px;
            border-radius: 8px;
            margin-top: 30px;
        }
        .in-stock { background: #d4edda; color: #155724; }
        .limited-stock { background: #fff3cd; color: #856404; }
        .out-of-stock { background: #f8d7da; color: #721c24; }
        .pre-order { background: #d1ecf1; color: #0c5460; }
    </style>
</head>
<body>
    <div class="catalog-page">
        <div class="product-header">
            <div class="product-name">{{ product_name }}</div>
            <div class="category-sku">{{ category_name }} Category | SKU: {{ product_sku }}</div>
            <div class="price">${{ "%.2f"|format(product_price) }}</div>
        </div>
        
        <div class="tagline">"{{ marketing_tagline }}"</div>
        
        <div class="description">
            {{ product_description }}
        </div>
        
        <div class="features">
            <h3>KEY FEATURES:</h3>
            {{ key_features }}
        </div>
        
        <div class="availability {{ availability_status.lower().replace(' ', '-') }}">
            Availability: {{ availability_status }}
        </div>
    </div>
</body>
</html>
🐍 Click to view updated Python code (with document generation)
# Same setup as before...
from syda import SyntheticDataGenerator, ModelConfig
from dotenv import load_dotenv

load_dotenv()
config = ModelConfig(provider="anthropic", model_name="claude-haiku-4-5-20251001")
generator = SyntheticDataGenerator(model_config=config)

# Define your schemas (structured data)
schemas = {
    "categories": "category_schema.yml",
    "products": "product_schema.yml",
    # 🆕 Add document templates
    "product_catalogs": "product_catalog_schema.yml"
}


# Generate both structured data AND documents
results = generator.generate_for_schemas(
    schemas=schemas,
    templates=templates,  # 🆕 Add this line
    sample_sizes={
      "categories": 5,
      "products": 20,
      "product_catalogs": 10 # 🆕 Add this line
    },
    output_dir="output",
    prompts = {
        "Category": "Generate retail product categories with hierarchical structure.",
        "Product": "Generate retail products with names, SKUs, prices, and descriptions. Ensure a good variety of prices and categories.",
        "ProductCatalog": "Generate compelling product catalog pages with marketing descriptions, key features, and sales copy."  # 🆕 Add this line
    }
)

print("✅ Generated structured data + AI-powered product catalogs!")

Enhanced Output:

output/
├── categories.csv           # 5 product categories with hierarchical structure
├── products.csv             # 20 products, all with valid category_id references  
└── product_catalogs/        # AI-generated marketing documents
    ├── catalog_1.pdf           # Product names match products.csv
    ├── catalog_2.pdf           # Prices match products.csv
    ├── catalog_3.pdf           # Perfect data consistency!
    ├── ...
    └── catalog_10.pdf

See It In Action

Realistic Retail Data + AI-Generated Product Catalogs

Categories Table:

id,name,parent_id,description,active
1,Electronics,0,Electronic devices and accessories,true
2,Smartphones,1,Mobile phones and accessories,true
3,Laptops,1,Portable computers and accessories,true
4,Clothing,0,Apparel and fashion items,true
5,Men's Clothing,4,Men's apparel and accessories,true

Products Table (with matching category_id):

id,name,category_id,sku,price,stock_quantity,is_featured
1,iPhone 15 Pro,2,PSM-12345,999.99,50,true
2,MacBook Air M3,3,PLA-67890,1299.99,25,true
3,Samsung Galaxy S24,2,PSA-11111,899.99,75,false
4,Dell XPS 13,3,PDE-22222,1099.99,30,false
5,Men's Cotton T-Shirt,5,PMC-33333,24.99,200,false

Generated Product Catalog PDF Content:

IPHONE 15 PRO
Smartphones Category | SKU: PSM-12345

$999.99

Revolutionary Performance, Unmatched Design

Experience the future of mobile technology with the iPhone 15 Pro. 
Featuring the powerful A17 Pro chip, this device delivers unprecedented 
performance for both work and play. The titanium design combines 
durability with elegance, while the advanced camera system captures 
professional-quality photos and videos.

KEY FEATURES:
• A17 Pro chip with 6-core GPU
• Pro camera system with 3x optical zoom  
• Titanium design with Action Button
• USB-C connectivity
• All-day battery life

"Innovation that fits in your pocket"

Availability: In Stock

🎯 Perfect Integration: The PDF catalog contains actual product names, SKUs, and prices from the CSV data, plus AI-generated marketing content - zero inconsistencies!

4. Need custom business logic? Add custom generators!

For advanced scenarios requiring custom calculations or complex business rules, you can add custom generator functions:

🔧 Click to view custom generators example
# Define custom generator functions
def calculate_tax(row, parent_dfs=None, **kwargs):
    """Calculate tax amount based on subtotal and tax rate"""
    subtotal = row.get('subtotal', 0)
    tax_rate = row.get('tax_rate', 8.5)  # Default 8.5%
    return round(subtotal * (tax_rate / 100), 2)

def calculate_total(row, parent_dfs=None, **kwargs):
    """Calculate final total: subtotal + tax - discount"""
    subtotal = row.get('subtotal', 0)
    tax_amount = row.get('tax_amount', 0)
    discount = row.get('discount_amount', 0)
    return round(subtotal + tax_amount - discount, 2)

def generate_receipt_items(row, parent_dfs=None, **kwargs):
    """Generate receipt items based on actual transactions"""
    items = []
    
    if parent_dfs and 'Product' in parent_dfs and 'Transaction' in parent_dfs:
        products_df = parent_dfs['Product']
        transactions_df = parent_dfs['Transaction']
        
        # Get customer's transactions
        customer_id = row.get('customer_id')
        customer_transactions = transactions_df[
            transactions_df['customer_id'] == customer_id
        ]
        
        # Build receipt items from actual transaction data
        for _, tx in customer_transactions.iterrows():
            product = products_df[products_df['id'] == tx['product_id']].iloc[0]
            
            items.append({
                "product_name": product['name'],
                "sku": product['sku'],
                "quantity": int(tx['quantity']),
                "unit_price": float(product['price']),
                "item_total": round(tx['quantity'] * product['price'], 2)
            })
    
    return items

# Add custom generators to your generation
custom_generators = {
    "ProductCatalog": {
        "tax_amount": calculate_tax,
        "total": calculate_total,
        "items": generate_receipt_items
    }
}

# Generate with custom business logic
results = generator.generate_for_schemas(
    schemas=schemas,
    templates=templates,
    sample_sizes={"categories": 5, "products": 20, "product_catalogs": 10},
    output_dir="output",
    custom_generators=custom_generators,  # 🆕 Add this line
    prompts={
        "Category": "Generate retail product categories with hierarchical structure.",
        "Product": "Generate retail products with names, SKUs, prices, and descriptions.",
        "ProductCatalog": "Generate compelling product catalog pages with marketing copy."
    }
)

print("✅ Generated data with custom business logic!")

🎯 Custom generators let you:

  • Calculate fields based on other data (taxes, totals, discounts)
  • Access related data from other tables via parent_dfs
  • Implement complex business rules (pricing logic, inventory rules)
  • Generate structured data (arrays, nested objects, JSON)

Works with Your Existing SQLAlchemy Models

Already using SQLAlchemy? Syda works directly with your existing models - no schema conversion needed!

Click to view SQLAlchemy example
from sqlalchemy import Column, Integer, String, Float, ForeignKey, Boolean
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from syda import SyntheticDataGenerator, ModelConfig
from dotenv import load_dotenv

load_dotenv()

Base = declarative_base()

# Your existing SQLAlchemy models
class Customer(Base):
    __tablename__ = 'customers'
    
    id = Column(Integer, primary_key=True)
    name = Column(String(100), nullable=False, comment='Customer organization name')
    industry = Column(String(50), comment='Industry sector')
    annual_revenue = Column(Float, comment='Annual revenue in USD')
    status = Column(String(20), comment='Active, Inactive, or Prospect')
    
    # Relationships work perfectly
    contacts = relationship("Contact", back_populates="customer")

class Contact(Base):
    __tablename__ = 'contacts'
    
    id = Column(Integer, primary_key=True)
    customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)
    first_name = Column(String(50), nullable=False)
    last_name = Column(String(50), nullable=False)
    email = Column(String(100), nullable=False, unique=True)
    position = Column(String(100), comment='Job title')
    is_primary = Column(Boolean, comment='Primary contact for customer')
    
    customer = relationship("Customer", back_populates="contacts")

# Generate data directly from your models
config = ModelConfig(provider="anthropic", model_name="claude-haiku-4-5-20251001")
generator = SyntheticDataGenerator(model_config=config)

results = generator.generate_for_sqlalchemy_models(
    sqlalchemy_models=[Customer, Contact],
    sample_sizes={"Customer": 10, "Contact": 25},
    output_dir="crm_data"
)

print("✅ Generated CRM data with perfect foreign key relationships!")

Output:

crm_data/
├── customers.csv     # 10 companies with realistic industry data
└── contacts.csv      # 25 contacts, all with valid customer_id references

🎯 Zero Configuration: Your SQLAlchemy comments become AI generation hints, ForeignKey relationships are automatically maintained, and nullable=False constraints are respected!

Generate Data From an Existing Database

Already have a database? DatabaseSchemaLoader connects to it, infers all table schemas (columns, types, primary keys, foreign keys), generates synthetic data, and writes it back — no manual schema definition needed.

Supports any SQLAlchemy-compatible database — SQLite, PostgreSQL, MySQL, MariaDB, MS SQL Server, Oracle, and more. Pass any valid SQLAlchemy connection string and it works.

pip install syda sqlalchemy
# PostgreSQL: pip install psycopg2-binary
# MySQL:      pip install pymysql

Option A — in-memory (no intermediate files)

from syda import SyntheticDataGenerator, DatabaseSchemaLoader, ModelConfig
from dotenv import load_dotenv

load_dotenv()

loader    = DatabaseSchemaLoader("sqlite:///mydb.db")
schemas   = loader.load_schemas()          # infer schemas as dicts

generator = SyntheticDataGenerator(model_config=ModelConfig(
    provider="anthropic", model_name="claude-haiku-4-5-20251001"
))

results = generator.generate_for_schemas(
    schemas=schemas,
    sample_sizes={"patient": 10, "claim": 20},
    output_dir="output"
)

loader.write_to_database(results)          # write generated rows back

Option B — file-based (inspect or version-control schemas first)

loader = DatabaseSchemaLoader("postgresql+psycopg2://user:pass@localhost/mydb")

# Save one YAML file per table — edit them before generating if needed
schema_files = loader.save_schemas("schemas/", format="yaml")

results = generator.generate_for_schemas(schemas=schema_files, output_dir="output")

loader.write_to_database(results)

Output:

output/
├── patient.csv      # generated rows
├── provider.csv
└── claim.csv        # all foreign keys reference valid parent rows

schemas/             # (Option B only) editable YAML schema files
├── patient.yaml
└── claim.yaml

🎯 FK-safe writes: write_to_database() inserts rows in topological order (parents before children) so referential integrity is preserved in the target database.

Use the CLI — No Python Required

Generate synthetic data directly from the terminal without writing a single line of Python.

pip install syda
export ANTHROPIC_API_KEY=your_key   # or OPENAI_API_KEY / GEMINI_API_KEY

Validate schemas

syda validate --schema schemas/
#   [OK] patient
#   [OK] provider
#   [OK] appointment

Generate from a schema file

# Single table → CSV
syda generate --schema patients.yaml --rows 50 --output patients.csv

# Single table → JSON
syda generate --schema patients.yaml --rows 50 --output patients.json

# Multi-table directory → FK-safe CSV output
syda generate --schema schemas/ --rows 100 --output-dir ./data

# Large dataset — chunked direct mode (3 LLM calls of 50 rows each)
syda generate --schema schemas/product.yml --rows 150 --batch-size 50 --output-dir ./data

# Large dataset — code-gen mode (auto-triggered above 500 rows)
syda generate --schema schemas/ --rows 2000 --output-dir ./data

# Force code-gen for any row count
syda generate --schema schemas/ --rows 50 --large-dataset --output-dir ./data

Database workflows

# Infer schemas from a live database
syda db infer --db-url sqlite:///mydb.db --output-dir schemas/

# Generate data from a database schema
syda db generate --db-url sqlite:///mydb.db --rows 50 --output-dir ./data

# Generate and write directly back into the database
syda db generate --db-url postgresql://user:pass@localhost/mydb \
  --rows 100 --write-back --if-exists replace

CI / pipeline usage

- name: Validate schemas
  run: syda validate --schema schemas/

- name: Generate test fixtures
  run: syda generate --schema schemas/ --rows 20 --output-dir tests/fixtures
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

📖 Full CLI reference: python.syda.ai/deep_dive/cli

Generate Large Datasets

Syda has two modes that scale to thousands or millions of rows without blowing your token budget.

Direct mode — chunked LLM calls

Split generation into batch_size-row chunks with automatic retry on transient errors:

generator = SyntheticDataGenerator(
    model_config=ModelConfig(
        provider="anthropic",
        model_name="claude-haiku-4-5-20251001",
        generation_mode="auto",  # direct ≤500 rows, codegen >500
        batch_size=50,           # rows per LLM call
        max_retries=3,
    )
)
results = generator.generate_for_schemas(schemas=schemas, sample_sizes={"products": 300}, output_dir="output")
# [syda] Generating chunk 1/6 (rows 1–50 of 300)...
# [syda] Generating chunk 2/6 (rows 51–100 of 300)...
# ...

Code-gen mode — LLM writes Python, runs locally

For > 500 rows (auto-selected) or via generation_mode="codegen", the LLM makes one analysis call that writes Python generator functions for simple columns (IDs, dates, enums, emails). Only semantic columns (descriptions, narratives) call the LLM at runtime — regardless of row count.

# 10,000 rows, ~3 LLM calls total (1 analysis + 2 semantic columns)
generator = SyntheticDataGenerator(
    model_config=ModelConfig(provider="grok", model_name="grok-4.3", max_tokens=16384)
)
generator.generate_for_schemas(schemas=schemas, sample_sizes={"orders": 10_000}, output_dir="output")

Generated Python functions are cached under output_dir/.syda_cache/ — re-runs are instant on cache hits.

force_llm column flag

Need specific columns to always use LLM generation in code-gen mode? Mark them force_llm: true in the schema:

tagline:
  type: text
  description: Short marketing tagline (one punchy sentence)
  force_llm: true   # always LLM-generated, never replaced by a Python function

Cost tracking

Every run produces a cost breakdown accessible via generator.last_report and saved as an HTML report in output_dir/:

generator.last_report.print_summary()
# Table         Rows  Mode     Calls   In tok  Out tok  Cost
# products       200  direct       4    2,168   15,291  $0.24
# orders       5,000  codegen    100   28,300   25,201  $0.46
# order_items 10,000  codegen      0        0        0  $0.00
# TOTAL       15,200             104   30,468   40,492  $0.70

CLI

# Chunked direct mode
syda generate --schema schemas/product.yml --rows 300 --batch-size 50 --output-dir ./data

# Auto code-gen (>500 rows)
syda generate --schema schemas/product.yml --rows 1000 --output-dir ./data

# Force code-gen
syda generate --schema schemas/ --rows 5000 --large-dataset --output-dir ./data

📖 Full guide: python.syda.ai/deep_dive/large_dataset


Use Any OpenAI-Compatible Model

Run Syda against any OpenAI-compatible API — local models via Ollama, Groq, Together AI, Fireworks, DeepSeek, Mistral, and more — using the openai_compatible provider:

# Install and start Ollama
brew install ollama && brew services start ollama
ollama pull llama3
from syda import SyntheticDataGenerator, ModelConfig

generator = SyntheticDataGenerator(
    model_config=ModelConfig(
        provider="openai_compatible",
        model_name="llama3",           # any model your server supports
        temperature=0.7,
        max_tokens=2048,
        extra_kwargs={
            "base_url": "http://localhost:11434/v1",  # Ollama
            "api_key": "ollama",                       # any string for Ollama
            # "response_mode": "tools",  # for models with native tool-call support
            # "response_mode": "json",   # for models returning clean JSON
        }
    )
)
response_mode When to use
"markdown" Default — model wraps JSON in ```json ``` fences
"tools" Model supports tool calls natively
"json" Model returns clean JSON without fences

Works with: Ollama · Groq · Together AI · Fireworks · DeepSeek · Mistral · LM Studio · vLLM · Perplexity — any server that speaks the OpenAI API.

Contributing

We would love your contributions! Syda is an open-source project that thrives on community involvement.

Ways to Contribute

  • Report bugs - Help us identify and fix issues
  • Suggest features - Share your ideas for new capabilities
  • Improve docs - Help make our documentation even better
  • Submit code - Fix bugs, add features, optimize performance
  • Add examples - Show how Syda works in your domain
  • ⭐ Star the repo - Help others discover Syda

How to Get Started

  1. Check our Contributing Guide for detailed instructions
  2. Browse open issues to find something to work on
  3. Join discussions in our GitHub Issues and Discussions
  4. Fork the repo and submit your first pull request!

Good First Issues

Looking for ways to contribute? Check out issues labeled:

  • good first issue - Perfect for newcomers
  • help wanted - We'd especially appreciate help here
  • documentation - Help improve our docs
  • examples - Add new use cases and examples

Every contribution matters - from fixing typos to adding major features! 🙏

⭐ Star this repo if Syda helps your workflow • 📖 Read the docs for detailed guides • 🐛 Report issues to help us improve

Citation

If you use SYDA in your research, publications, or products, please cite it as follows:

APA:

Lingamgunta, R. K. K. (2025). Syda - AI-Powered Synthetic Data Generation (v0.2.0). Zenodo. https://doi.org/10.5281/zenodo.17345575

IEEE:

[1]R. K. K. Lingamgunta, “Syda - AI-Powered Synthetic Data Generation”. Zenodo, 2025. doi: 10.5281/zenodo.17345575.

BibTeX:

@software{Lingamgunta_Syda_-_AI-Powered_2025,
author = {Lingamgunta, Rama Krishna Kumar},
license = {MIT},
title = {{Syda - AI-Powered Synthetic Data Generation}},
url = {https://github.com/syda-ai/syda},
version = {0.2.0},
year = {2025}
}

About

AI-powered synthetic data generation — structured tables, unstructured documents, multi-provider LLM support, referential integrity, and code-gen mode for millions of rows

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors