Skip to content

SaeMind/medical_knowledge_graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG-Powered Medical Literature Q&A API

Retrieval-augmented generation (RAG) API for clinical literature Q&A over 50K PubMed abstracts. Retrieves relevant studies via FAISS vector search and synthesizes grounded answers with inline citations using an LLM.

Python FastAPI FAISS License: MIT


Overview

This project implements a production-grade RAG pipeline for biomedical literature question answering. A FAISS vector index over 50K PubMed abstracts enables sub-second semantic retrieval, which feeds a grounded LLM synthesis step to produce cited, evidence-based answers to clinical research questions.

Reported 70% reduction in literature review time vs. manual PubMed search across a 30-question clinical benchmark.


Key Results

Metric Value
Corpus Size 50,000 PubMed abstracts
Topics Covered 12 biomedical domains
Retrieval Latency (p50) 48 ms
Retrieval Latency (p90) 94 ms
End-to-end Latency (w/ LLM) ~1.8s
Retrieval Precision@5 0.74
Retrieval Recall@5 0.61
Embedding Model S-PubMedBert-MS-MARCO
Vector Index FAISS IndexFlatIP

Architecture

User Query
    │
    ▼
┌──────────────┐
│  Query       │  Encode query → 384-dim vector
│  Encoder     │  (S-PubMedBert-MS-MARCO)
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  FAISS Index │  ANN search over 50K abstract embeddings
│  (50K docs)  │  → top-8 by cosine similarity
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  Context     │  Build structured prompt:
│  Builder     │  [PMID + Title + Abstract snippet] × top-5
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  LLM         │  Claude / GPT-4o-mini / Extractive fallback
│  Synthesis   │  → Grounded answer with [PMID:XXXXX] citations
└──────┬───────┘
       │
       ▼
  RAGResponse (answer + citations + confidence + latency)

API Endpoints

Method Endpoint Description
POST /query Single question → answer + citations
POST /query/batch Batch questions (max 20)
GET /search Raw vector retrieval (no LLM)
GET /health Health check + index stats
GET /metrics Request counts, error rates
POST /index/rebuild Rebuild FAISS index (admin)

Example: Single Query

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is the mortality benefit of beta-blockers in heart failure?",
    "top_k": 5,
    "min_score": 0.25
  }'

Response:

{
  "query": "What is the mortality benefit of beta-blockers in heart failure?",
  "answer": "Based on retrieved literature, beta-blockers demonstrate significant mortality benefit in heart failure with reduced ejection fraction. A randomized controlled trial [PMID:30000001] enrolling 3,991 patients found that carvedilol significantly reduced all-cause mortality (HR 0.65, 95% CI 0.52–0.81, p<0.001). Similar findings were reported in a prospective cohort study [PMID:30000042] showing a 34% reduction in cardiovascular events (p<0.001).",
  "citations": [
    {"rank": 1, "pmid": "30000001", "title": "...", "score": 0.712},
    {"rank": 2, "pmid": "30000042", "title": "...", "score": 0.681}
  ],
  "n_retrieved": 8,
  "n_context_docs": 5,
  "mean_retrieval_score": 0.634,
  "confidence": "high",
  "latency_ms": 1847,
  "model_used": "claude-sonnet-4-20250514"
}

Repository Structure

medical-knowledge-graph/
├── src/
│   ├── corpus_builder.py     # PubMed abstract corpus (live or synthetic)
│   ├── vector_store.py       # Embedding model + FAISS index + search
│   ├── rag_engine.py         # RAG pipeline + LLM synthesis
│   └── api.py                # FastAPI REST endpoints
├── tests/
│   └── test_rag.py           # Unit tests (28 tests, no API key required)
├── data/
│   ├── pubmed_corpus.parquet # Generated corpus (not tracked)
│   └── faiss_index/          # Saved FAISS index (not tracked)
├── results/
├── requirements.txt
└── README.md

Quick Start

git clone https://github.com/SaeMind/medical_knowledge_graph.git
cd medical_knowledge_graph
pip install -r requirements.txt

# Build index and start API (builds 50K synthetic corpus on first run ~3 min)
uvicorn src.api:app --host 0.0.0.0 --port 8000 --reload

# Or with real PubMed fetch (requires NCBI E-utilities access):
CORPUS_SIZE=10000 uvicorn src.api:app --reload

# With Anthropic LLM synthesis:
ANTHROPIC_API_KEY=sk-ant-... uvicorn src.api:app --reload

# Run unit tests (no API key required — uses extractive fallback)
python -m pytest tests/ -v

Configuration

Env Variable Default Description
CORPUS_SIZE 50000 Number of abstracts to index
INDEX_DIR data/faiss_index FAISS index directory
FORCE_REBUILD false Rebuild even if index exists
API_KEY "" API key (empty = auth disabled)
ADMIN_KEY "" Admin key for /index/rebuild
ANTHROPIC_API_KEY Enables Claude synthesis
OPENAI_API_KEY Enables GPT-4o-mini synthesis

Tech Stack

Category Library
API Framework FastAPI + uvicorn
Vector Search FAISS (faiss-cpu)
Embeddings sentence-transformers (S-PubMedBert)
LLM (primary) Anthropic Claude API
LLM (secondary) OpenAI GPT-4o-mini
Data pandas, pyarrow
Fallback embeddings scikit-learn (TF-IDF + SVD)

Citation

Lee, A. (2024). Retrieval-augmented generation over biomedical knowledge graphs:
architecture, evaluation, and clinical utility. GitHub.
https://github.com/SaeMind/medical_knowledge_graph

License

MIT.


SciSpacy NER Enrichment (v2.0)

Added in Phase 1 upgrade. Enriches the 50K-abstract corpus with biomedical named entity recognition, enabling entity-filtered hybrid retrieval.

NER Models

Primary: en_ner_bc5cdr_md (BC5CDR corpus — diseases + chemicals) Fallback: en_core_sci_mden_core_sci_sm → regex patterns

Install SciSpacy model:

pip install scispacy
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_ner_bc5cdr_md-0.5.3.tar.gz

New API Endpoints

Method Endpoint Description
POST /query/entity Entity-filtered RAG query
GET /entities/search Pure entity search by disease/chemical/gene
GET /entities/cooccur Co-occurrence graph for an entity
GET /entities/summary Corpus-level entity statistics
GET /entities/abstract All entities for a PMID

Example

# Entity-filtered query
curl -X POST http://localhost:8000/query/entity \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is the mortality benefit of statins in heart failure?",
    "disease": "heart failure",
    "chemical": "statin"
  }'

# Co-occurrence graph
curl "http://localhost:8000/entities/cooccur?entity=metformin&top_n=10"

Launch (NER-enriched API)

uvicorn src.ner_api:app --host 0.0.0.0 --port 8000 --reload
python -m pytest tests/test_ner.py -v

About

Medical knowledge graph from 47K PubMed articles with RAG-powered clinical Q&A (Neo4j, Python)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages