An autonomous data engineering agent that turns messy supplier CSVs into clean warehouse tables — with a human in the loop for anything risky.
Conduit watches incoming data files, compares them to the target schema, and uses an LLM to generate a safe transformation script. The human engineer just reviews the diff and clicks approve. The agent handles the schema drift, the PII masking, the rename, the null fill, the missing column, the type cast — all of it.
Built around a gatekeeper-classifier pattern: the LLM never gets the final say. Every proposal is checked against deterministic rules before execution.
clean_orders.csv → AUTO_LINK (no drift, auto-approved)
drifted_orders.csv → SCHEMA_EVOLUTION (rename + null + extra col, needs review)
conflicted_orders.csv → CONFLICT (type mismatch, blocked)
- Quick start
- Project structure
- The agent pipeline
- A realistic user journey
- Frontend pages
- Backend endpoints
- Database schema
- Configuration
- AI tools & safety
- Demo data
- Local testing
- Docker + Docker Compose
- Node.js 20+
- A Groq API key — grab one at console.groq.com
git clone https://github.com/adityaatre26/Conduit.git
cd Conduit
cp backend/.env.example backend/.env 2>/dev/null || true # if you have oneCreate /home/<you>/dev/Conduit/.env (the root one — Docker reads this):
GROQ_API_KEY=gsk_your_key_here
MOCK_AI=False # set to True to skip LLM calls during development
POSTGRES_PASSWORD=passwordImportant: Never commit
.envfiles. The repo's.gitignoreblocks them. If you accidentally push one, GitHub's push protection will block it and you'll need to scrub history.
cd /path/to/Conduit
docker compose up -dThis brings up three containers: conduit-source-db-1, conduit-warehouse-db-1, conduit-api-1. The warehouse DB is auto-seeded from db/seed_warehouse.sql. The API is on http://localhost:8000.
Verify:
curl http://localhost:8000/api/sources
# → [{"id":1,"name":"Production_Warehouse_PG","unit_type":"POSTGRES","status":"CONNECTED"}]The skill registry, knowledge graph, and lineage tables are populated by db/seed_extensions.sql. Run it once after the warehouse is up:
docker exec -i conduit-warehouse-db-1 psql -U user -d warehousedb < db/seed_extensions.sqlcd frontend
npm install
npm run devThe UI is on http://localhost:3000. The Next.js dev server proxies /api/* to :8000 via next.config.js, so no CORS or env config is needed.
Open http://localhost:3000, click Ingest in the sidebar, drop db/demo_csvs/drifted_orders.csv on the upload zone, and click Generate proposal. Watch the 6-step state machine animate, then open the detail view to see the drift, generated code, and context bundle.
Conduit/
├── backend/ # FastAPI backend (Python 3.11)
│ ├── app/
│ │ ├── main.py # App factory, CORS, router mounting
│ │ ├── database.py # Async SQLAlchemy engine
│ │ ├── models.py # Core ORM models (proposals, ledger, quarantine, …)
│ │ ├── extension_models.py # Phase 2 models (skills, graph, lineage)
│ │ ├── schemas.py # Pydantic request/response models
│ │ ├── extension_schemas.py
│ │ ├── routers/ # 8 routers, 25+ endpoints
│ │ │ ├── ingest.py
│ │ │ ├── proposals.py
│ │ │ ├── audit.py
│ │ │ ├── quarantine.py
│ │ │ ├── sources.py
│ │ │ ├── skills.py
│ │ │ ├── graph.py
│ │ │ └── lineage.py
│ │ ├── services/ # Business logic
│ │ │ ├── ai_service.py # Groq LLM call
│ │ │ ├── mcp_service.py # Target schema fetch
│ │ │ ├── gateway_service.py # AUTO_LINK / SCHEMA_EVOLUTION / CONFLICT
│ │ │ ├── validation_service.py # Magic bytes + AST safety guard
│ │ │ ├── execution_service.py # Run the generated code, write to warehouse
│ │ │ ├── context_retrieval_service.py # Build the Phase 1 context bundle
│ │ │ ├── skill_registry_service.py
│ │ │ ├── graph_service.py
│ │ │ └── lineage_service.py
│ │ └── core/config.py # Pydantic settings (env-driven)
│ ├── Dockerfile
│ └── requirements.txt
│
├── frontend/ # Next.js 14 frontend (TypeScript, App Router)
│ ├── src/
│ │ ├── app/ # 12 pages, all client-rendered
│ │ │ ├── page.tsx # / — Overview
│ │ │ ├── ingest/page.tsx # /ingest — Upload + state machine
│ │ │ ├── proposals/ # /proposals + /proposals/[id]
│ │ │ ├── audit/ # /audit + /audit/[id]
│ │ │ ├── quarantine/page.tsx
│ │ │ ├── skills/ # /skills + /skills/[id]
│ │ │ ├── graph/page.tsx # /graph — BFS tools
│ │ │ ├── lineage/page.tsx # /lineage
│ │ │ └── sources/page.tsx
│ │ ├── components/ # Reusable: PageHeader, CodeBlock, badges, …
│ │ └── lib/ # api.ts (REST client), types.ts, format.ts
│ ├── next.config.js # /api/* → :8000 proxy
│ └── tailwind.config.js # Inter font, neutral palette, design tokens
│
├── db/
│ ├── seed_warehouse.sql # Auto-loaded on first container start
│ ├── seed_extensions.sql # One-time manual seed (skills, graph, lineage)
│ └── demo_csvs/ # Test files
│
├── docker-compose.yml
├── capabilities.md # Authoritative backend capability spec
├── frontend_requirements.md # Authoritative frontend spec
├── PROJECT_STATUS.txt # Build checklist
└── README.md
Every POST /api/ingest runs through 10 steps inside the backend:
flowchart TD
A[User File Upload] --> B[FastAPI Router]
B --> C{Magic Bytes Check}
C -->|Invalid| D[Reject Upload]
C -->|Valid| E[Parse CSV with Pandas]
E --> F[Schema Introspection<br/>conduit.tables_metadata + attributes_metadata]
F --> Z[Zero-overlap fast path]
Z -->|0 cols in common| I2[Return CONFLICT immediately]
Z -->|Some overlap| G[Build Context Bundle<br/>BFS knowledge graph + skill search]
G --> H[AI Proposal Generation<br/>Groq Llama 3.3 70B]
H -->|Success| I[AST Safety Guard<br/>Reject dangerous imports/builtins]
H -->|Timeout/Error| J[Resilience Fallback<br/>Use most recent cached proposal]
I -->|Invalid AST| K[Retry once with error message]
K -->|Still invalid| D
I -->|Valid| L[Gateway Classification<br/>AUTO_LINK / SCHEMA_EVOLUTION / CONFLICT]
J --> L
L --> M{Human Approval}
M -->|Approved| N[Execute in restricted namespace]
M -->|Rejected| O[Mark REJECTED, store reason]
N --> P[Write valid rows → public.orders_clean]
N --> Q[Quarantine bad rows → conduit.quarantine_records]
N --> R[Create pipeline_skills_ledger row]
N --> S[Record conduit_lineage.lineage_events row]
Two design choices worth highlighting:
- The LLM is called once, in step H. Every other endpoint just reads.
- The AI's gateway recommendation is sanity-checked by
gateway_service. Critical issues and low confidence get force-upgraded toCONFLICTregardless of what the LLM said.
Persona: Maya, data engineer. A new supplier CSV arrives in her inbox.
Maya opens http://localhost:3000. The Overview page shows her:
- Auto-resolved rate (what % of proposals auto-link without her touching them)
- 3 horizontal bars for gateway classification breakdown
- Recent proposals and executions
- Source health, skill count, graph node count
She clicks Ingest, drags drifted_orders.csv onto the upload zone. The state machine animates through 6 steps while the backend works:
- Validating file magic bytes
- Introspecting target schema
- Building context bundle
- Generating transformation script
- Validating AST
- Classifying gateway policy
~5s with MOCK_AI=True, 5–15s with real Groq. The animation is purely visual — there's no stream from the backend, we time-step it client-side and fast-forward when the API call returns.
A "Proposal review" card appears with 4 stat tiles + the drift table. For drifted_orders.csv she sees:
- Gateway: yellow
Schema Evolution - Confidence: 85%
- Drift items: 3 (RENAME
order_amount→amount_usd, NULL_VIOLATION inorder_status, EXTRA_COLUMNdiscount_code) - PII columns: customer_email
She clicks Open detail view → to read the actual generated code.
The proposal detail page has 4 tabs:
- Drift — the per-column issue breakdown
- Generated code — the actual Python script that will run
- AI prompt — what was sent to Groq, plus the raw response (post-execution only)
- AI context bundle — the related skills, entities, dependencies, and PII columns the agent considered (Phase 1)
She clicks Approve & execute. The backend:
- Executes the generated code in a restricted namespace
- Writes valid rows to
public.orders_clean - Quarantines failed rows to
conduit.quarantine_records - Creates a
conduit.pipeline_skills_ledgerrow - Updates the proposal to status
EXECUTED - Records a
conduit_lineage.lineage_eventsrow
The right sidebar switches from "Approve" to "Execution" with row counts.
Maya checks Audit for the immutable record, Lineage for the operation log, and Quarantine if any rows failed.
All pages are client-rendered ("use client"), styled with Tailwind, fonts: Inter, palette: neutral grays with success / warning / danger only used to direct attention.
| Route | Page | Purpose |
|---|---|---|
/ |
Overview | Morning dashboard: auto-resolved rate, classification breakdown, recent activity, source health |
/ingest |
Ingest | Drop a file, watch the 6-step state machine, review the proposal |
/proposals |
Proposals list | Paginated queue of all past proposals, filterable by gateway status, searchable by target table |
/proposals/[id] |
Proposal detail | Drift + code + AI prompt + context bundle tabs, approve/reject |
/audit |
Audit ledger | Immutable execution history (compliance) |
/audit/[id] |
Audit detail | Full audit row with executed script, AI prompt, raw response |
/quarantine |
Quarantine | Rows that failed validation, with raw data + failure reason |
/skills |
Skill registry | Catalog of reusable transformation skills (grid/table toggle, category/status filters) |
/skills/[id] |
Skill detail | A specific skill's scripts, examples, issue references |
/graph |
Knowledge graph | Browser for nodes/edges, plus forward BFS (lineage explorer) and reverse BFS (impact analyzer) |
/lineage |
Lineage events | Chronological log of all pipeline operations |
/sources |
Sources | Registered warehouses and their connection status |
| Method | Endpoint | Purpose |
|---|---|---|
POST |
/api/ingest |
The big one — full pipeline: validate → parse → schema compare → LLM → AST → classify → save |
GET |
/api/proposals |
Paginated list of all proposals (?status= filter, ?limit=, ?offset=) |
GET |
/api/proposals/{id} |
Single proposal with drift, code, confidence, PII |
POST |
/api/proposals/{id}/approve |
Execute the saved code against the warehouse, write rows, update audit |
POST |
/api/proposals/{id}/reject |
Mark proposal as REJECTED with a reason |
GET |
/api/proposals/{id}/context |
The Phase 1 context bundle: related skills, entities, dependencies, PII, business context |
| Method | Endpoint | Purpose |
|---|---|---|
GET |
/api/audit |
Audit ledger (?proposal_id= filter for one row, ?limit=, ?offset=) |
GET |
/api/audit/{id} |
Single audit row with full LLM prompt/response |
GET |
/api/quarantine |
All quarantined rows |
GET |
/api/quarantine/{proposal_id} |
Quarantined rows for a specific proposal |
GET |
/api/sources |
Registered warehouses and connection status |
| Method | Endpoint | Purpose |
|---|---|---|
GET |
/api/skills |
List skills (?category=, ?status=, ?limit=, ?offset=) |
GET |
/api/skills/search?q= |
Keyword search across name/description/use_cases |
GET |
/api/skills/{id} |
Skill with scripts, examples, issue references |
POST |
/api/skills |
Register a new skill |
PATCH |
/api/skills/{id} |
Update skill status, owner, description |
POST |
/api/skills/{id}/scripts |
Attach a script reference to a skill |
POST |
/api/skills/{id}/issues |
Link a past incident to a skill |
| Method | Endpoint | Purpose |
|---|---|---|
GET |
/api/graph/nodes |
All graph nodes (?node_type= filter) |
POST |
/api/graph/nodes |
Create a node (idempotent: get-or-create on node_type + entity_id) |
GET |
/api/graph/edges |
All graph edges (?relation_type= filter) |
POST |
/api/graph/edges |
Create an edge (idempotent on source + target + relation) |
GET |
/api/graph/neighbors/{node_id} |
1-hop neighbors (`?direction=in |
GET |
/api/graph/lineage/{entity} |
Forward BFS — what does this entity reach? (?max_depth=) |
GET |
/api/graph/impact/{entity} |
Reverse BFS — what depends on this entity? |
| Method | Endpoint | Purpose |
|---|---|---|
GET |
/api/lineage |
All lineage events, most recent first (?limit=, ?offset=) |
GET |
/api/lineage/{proposal_id} |
Lineage events for a specific proposal, in execution order |
Five schemas in warehousedb, one in sourcedb:
| Schema | Tables | What lives here |
|---|---|---|
public |
orders_clean |
The actual business data — only written after a proposal is approved and executed |
conduit |
proposals, pipeline_skills_ledger, quarantine_records, tables_metadata, attributes_metadata, sub_projects, warehouse_units |
Core pipeline state |
conduit_skills |
skills, skill_scripts, skill_examples, skill_issue_references, proposal_contexts |
The skill registry + per-proposal context bundles |
conduit_graph |
graph_nodes, graph_edges |
The relationship graph (tables, skills, KPIs, PII columns, projects) |
conduit_lineage |
lineage_events |
Per-operation breadcrumbs (ingest, transform, validate, quarantine, evolve) |
conduit.proposals is the heart of the system — every uploaded file becomes a row here with status PENDING → APPROVED → EXECUTED (or REJECTED / FAILED).
sourcedb is currently a placeholder — registered via /api/sources but not actually read from yet. Future work.
| Env var | Where | Default | Purpose |
|---|---|---|---|
GROQ_API_KEY |
Root .env |
— | Required. API key for Groq LLM calls |
MOCK_AI |
Root .env |
False |
If True, AI service returns stub responses without calling Groq (dev only) |
POSTGRES_PASSWORD |
Root .env |
password |
Used by both DB containers |
ENVIRONMENT |
Root .env |
development |
Logged in API responses |
WAREHOUSE_DB_URL |
Root .env |
postgresql+asyncpg://… |
Backend connects here (overrides the per-service default) |
SOURCE_DB_URL |
Root .env |
postgresql+asyncpg://… |
Future use |
Docker quirk:
docker compose up -d --force-recreateis required to pick up changes to.envafter the first run. Plainrestartkeeps the old env.
- Model: Groq
llama-3.3-70b-versatilefor fast, structured JSON output. Configurable inbackend/app/services/ai_service.py. - AST Safety Guard: Every piece of AI-generated code is parsed with
ast.parseand checked against an imports/builtins blocklist before execution. The blocklist rejectsos,sys,subprocess,eval,exec, and any file/network I/O. If validation fails, the system retries once with the failure reason appended to the prompt. - Rule-based Gateway Override: The LLM's
gateway_recommendationis sanity-checked. Critical mismatches (TYPE_MISMATCH, missing required columns) and low confidence are force-upgraded toCONFLICTregardless of what the LLM said. - Resilience Fallback: If Groq is unreachable, the system fetches the most recent successfully executed proposal for the same source/target schema and serves it as a fallback. The user sees a
reasoning_noteexplaining the substitution. - Context Bundle (Phase 1): The bundle is built (BFS the knowledge graph for related entities, dependencies, PII columns; search the skill registry for matching skills) and stored alongside the proposal. It is not yet injected into the LLM prompt — that's Phase 2. The new "AI context bundle" tab on the proposal detail page is the user-facing window into what's been collected.
Three CSVs in db/demo_csvs/:
| File | What it tests | Expected classification |
|---|---|---|
clean_orders.csv |
Schema matches target exactly | AUTO_LINK |
drifted_orders.csv |
Column rename, null, extra column | SCHEMA_EVOLUTION |
conflicted_orders.csv |
Type mismatches, missing required | CONFLICT |
The target is public.orders_clean (the only registered table in conduit.tables_metadata).
# Backend
cd Conduit
docker compose up -d
docker exec -i conduit-warehouse-db-1 psql -U user -d warehousedb < db/seed_extensions.sql
# Frontend
cd frontend
npm install
npm run devThen in another terminal:
# Sanity: backend is up
curl http://localhost:8000/api/sources
# → [{"id":1,"name":"Production_Warehouse_PG","unit_type":"POSTGRES","status":"CONNECTED"}]
# Run an ingest
curl -X POST http://localhost:8000/api/ingest \
-F "file=@../db/demo_csvs/clean_orders.csv" \
-F "target_table=orders_clean" | python3 -m json.toolThe response will include a proposal_id. You can then:
# Get the proposal back
curl http://localhost:8000/api/proposals/<id>
# Get the context bundle
curl http://localhost:8000/api/proposals/<id>/context
# Approve it
curl -X POST http://localhost:8000/api/proposals/<id>/approve \
-H "Content-Type: application/json" \
-d '{"human_approver_id": "demo_engineer_01"}'
# Verify rows landed
docker exec conduit-warehouse-db-1 psql -U user -d warehousedb \
-c "SELECT COUNT(*) FROM public.orders_clean;"- Open
http://localhost:3000 - Overview — see health metrics populate as data flows in
- Ingest → drop
drifted_orders.csv→ watch the state machine → review the proposal - Click Open detail view → switch between Drift / Generated code / AI prompt / AI context bundle tabs
- Click Approve & execute → watch the right sidebar switch to "Execution"
- Visit Audit for the immutable record
- Visit Quarantine if any rows failed
- Visit Skills → click
pii_masking→ see its scripts and examples - Visit Graph → type
orders_cleanin the Lineage explorer → Run → see the BFS reach - Visit Graph → type
orders_cleanin the Impact analyzer → Run → see what depends on it - Visit Lineage → filter by operation type, search by proposal_id
Backend: Python 3.11, FastAPI, SQLAlchemy 2.0 async, asyncpg, Pydantic v2, Groq SDK, pandas, ast (stdlib)
Frontend: Next.js 14, React 18, TypeScript 5.5, Tailwind CSS 3.4, Inter font
Infrastructure: Docker Compose, PostgreSQL 15
LLM: Groq llama-3.3-70b-versatile
Internal project — see project owner.