Skip to content

Sumukha87/Insight-Engine

Repository files navigation

⚡ Insight Engine

Cross-Domain Innovation Discovery via GraphRAG

Python FastAPI Next.js Neo4j Qdrant MLflow DVC Docker License

A B2B SaaS platform that finds cross-domain innovation opportunities no keyword search can surface.

Architecture · Demo · Tech Stack · Quickstart · Roadmap


The Problem

Global R&D is siloed.

A battery-longevity breakthrough in aerospace may directly solve an unsolved problem in cardiology — but no researcher ever makes the connection. A nanomaterial developed for semiconductors could transform drug delivery — but it's buried in domain-specific literature no one cross-reads.

Insight Engine is the matchmaker. It ingests scientific papers at scale, builds a unified knowledge graph across all domains, and uses GraphRAG to surface cross-domain innovation opportunities that no keyword search or standard vector RAG can find.


4 Intelligence Modules

Module Nickname What It Does
Cross-Pollination Discovery The Matchmaker Finds technologies from Domain A that solve problems in Domain B
Trend Velocity Tracking The Early Warning System Detects technologies being cited across multiple industries before mainstream
Patent Portfolio De-Risking The Shield Surfaces prior art via graph topology, not keyword similarity
Automated Gap Analysis The Opportunity Finder Uses unconnected graph nodes to flag research white-spaces

Knowledge Graph Stats

Metric Value
Scientific papers ingested 229,498 (12 domains, arXiv)
Entity nodes in graph 1,529,916
Cross-domain relationship edges 1,837,582
Vector embeddings (768-dim) 1,529,916 / 1,529,916 (100%)
Domains covered Aerospace · Medical · Materials · Energy · Biotech · Robotics · Quantum · Nanotechnology · Environment · Semiconductors · Pharma · Neuroscience
End-to-end query latency (p95) ~32s (embed 1s + Qdrant 0.2s + Neo4j 12s + LLM 14s)

Architecture

┌─────────────────────────────────────────────────────┐
│  Browser — Next.js 14 App Router  :3000             │
│  Dashboard · Auth · Graph Explorer · Citations      │
└────────────────────┬────────────────────────────────┘
                     │ HTTP / REST
┌────────────────────▼────────────────────────────────┐
│  FastAPI  :8000                                     │
│  /auth/*  /query  /health  /metrics                 │
│  JWT + refresh token auth · Prometheus metrics      │
└──┬──────────┬──────────┬──────────┬─────────────────┘
   │          │          │          │
┌──▼───┐ ┌───▼───┐ ┌────▼───┐ ┌───▼────┐
│Postgr│ │ Neo4j │ │ Qdrant │ │ Ollama │
│ :5432│ │ :7687 │ │ :6333  │ │ :11434 │
│users │ │ graph │ │vectors │ │Mistral │
└──────┘ └───────┘ └────────┘ └────────┘
              ▲
┌─────────────┴───────────────────────────────────────┐
│  NLP Pipeline (DVC-managed, MLflow-tracked)         │
│  arXiv fetch → spaCy NER → Relation Extraction      │
│  → Neo4j graph loader → Qdrant embedding pipeline   │
└─────────────────────────────────────────────────────┘

5-Layer Stack:

Layer Components
L5 — Application FastAPI REST API + Next.js 14 frontend
L4 — Knowledge Graph Neo4j Community — 1.5M entities, 1.8M edges
L3 — Vector Search Qdrant — 768-dim nomic-embed-text embeddings
L2 — NLP Pipeline spaCy 3.7 + SciSpacy NER · Rule-based relation extraction
L1 — Ingestion arXiv bulk API · DVC pipeline · MLflow experiment tracking

GraphRAG Query Flow

User query (natural language)
    │
    ▼
Entity extraction — spaCy NER on query text
    │
    ▼
Embedding — nomic-embed-text via Ollama (768-dim)
    │
    ▼
Qdrant ANN search — filtered to entities with graph edges
    │  (has_edges payload flag — eliminates 1M+ dead-end seeds instantly)
    ▼
Neo4j graph traversal — cross-domain paths up to 4 hops
    │  MATCH path = (seed)-[RELATES_TO*1..4]-(target {domain: $target})
    ▼
Subgraph context assembly — entities + relations + source paper IDs
    │
    ▼
LLM synthesis — Mistral 7B via Ollama
    │  "Given these cross-domain graph paths, answer: [query]. Cite sources."
    ▼
Response: { answer, paths[], sources[], confidence, latency_ms }

Demo Query

Query: "What aerospace materials could improve cardiac implant longevity?"

🔍 Searching knowledge graph...
   Seed entities found: titanium alloy, carbon fiber composite, shape memory alloy
   Cross-domain paths discovered: 20
   Source papers resolved: 14

💡 Answer:
   Three aerospace materials show strong cross-domain potential for cardiac implants:

   1. Nitinol (shape memory alloy) — developed for aerospace actuators, already
      transitioning to stents. Graph path: aerospace/actuators → materials/superelastic
      → medical/cardiovascular. Cited in 847 papers across both domains.

   2. Carbon fiber composites — used in structural aerospace components, now emerging
      in radiolucent implant housings. 12 direct cross-domain RELATES_TO edges found.

   3. Titanium-Zirconium alloys — aerospace fatigue resistance research (Ti-Zr-Nb)
      maps directly onto long-term implant load cycling requirements.

📚 Sources: 14 papers cited (arXiv: 2021–2024)
⏱  Latency: 31.4s

Tech Stack

AI / ML

spaCy Ollama nomic PyTorch LlamaIndex

MLOps

MLflow DVC Airflow Prometheus Grafana

Databases & Vector Stores

Neo4j Qdrant PostgreSQL Redis

Backend & Frontend

FastAPI Next.js TypeScript Tailwind

Infrastructure

Docker


Project Structure

insight-engine/
├── src/
│   ├── ingestion/          # arXiv fetcher — 12 domains, 229K papers
│   ├── nlp/
│   │   ├── ner_pipeline.py         # spaCy NER → 10.7M entities extracted
│   │   └── relation_extractor.py   # 2.3M relations extracted
│   ├── graph/
│   │   ├── graph_loader.py         # JSONL → Neo4j MERGE
│   │   ├── embedding_pipeline.py   # entities → Qdrant (768-dim)
│   │   └── graphrag_query.py       # full GraphRAG query engine
│   ├── backend/            # FastAPI app — auth, query, health
│   └── frontend/           # Next.js 14 App Router dashboard
├── dags/                   # Apache Airflow DAGs
├── dvc.yaml                # 4-stage reproducible pipeline
├── params.yaml             # all hyperparams version-controlled
├── docker-compose.yml      # full stack — one command startup
└── requirements/           # nlp.txt · graph.txt · api.txt · mlops.txt

Quickstart

Prerequisites: Docker Desktop, NVIDIA GPU (8GB+ VRAM recommended), Ollama

# 1. Clone
git clone https://github.com/Sumukha87/Insight-Engine.git
cd Insight-Engine

# 2. Environment
cp .env.example .env
# Edit .env with your passwords

# 3. Start full stack
docker compose up -d

# 4. Pull LLM models (first time only)
docker exec ollama ollama pull mistral:v0.3
docker exec ollama ollama pull nomic-embed-text

# 5. Run pipeline (fetch → NER → relations → graph)
source .venv/bin/activate
dvc repro

# 6. Open dashboard
open http://localhost:3000

Services after startup:

Service URL Purpose
Frontend http://localhost:3000 Query dashboard
API http://localhost:8000/docs FastAPI Swagger
Neo4j Browser http://localhost:7474 Graph explorer
MLflow http://localhost:5000 Experiment tracking
Grafana http://localhost:3001 Metrics dashboard
pgAdmin http://localhost:5050 Database GUI

Roadmap

  • Phase 1 — Data ingestion pipeline (229K papers, 12 domains)
  • Phase 2 — Knowledge graph + vector embeddings + GraphRAG engine
  • Phase 3 — FastAPI backend + Next.js dashboard + JWT auth
  • Phase 4 — Sigma.js graph explorer · Grafana dashboards · Cloudflare Tunnel
  • Phase 5 — Patent data (USPTO) · Clinical trials · GitHub activity signals
  • Phase 6 — Multi-tenant SaaS · usage quotas · billing

About the Author

Built by SriSumukha S — Infrastructure & Full-Stack Engineer at JEMS Inc., Japan, working on IT systems for environmental and waste management.

At work: Spring Boot APIs · Next.js frontends · Terraform on AWS & GCP. Built an in-house CRM that replaced Salesforce and is still in production.

On the side: building production AI systems from scratch — knowledge graphs, LLM pipelines, and MLOps tooling.

Certifications: AWS AI Practitioner · AWS Cloud Practitioner · JLPT N3

LinkedIn Portfolio


Built with curiosity · Powered by open-source · Running locally on an RTX 4060

About

Cross-domain innovation discovery via GraphRAG — 1.5M entity knowledge graph, LLM synthesis & MLOps pipeline. Neo4j · Qdrant · FastAPI · Next.js · DVC · MLflow

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors