Skip to content

WBChain3/MedicalRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

MedicalRAG

A local-first Retrieval-Augmented Generation (RAG) system for querying knowledge from a corpus of documents. Everything runs on your machine — no cloud APIs, no external services.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    INGESTION PIPELINE                        │
│                                                             │
│  Source Docs ──► Text Extraction ──► Chunking ──► Embed ──► │
│  (PDF, etc.)       (parser + OCR        (overlapping    │   │
│                     fallback)            word windows)   ▼   │
│                                                    Vector    │
│                                                    Store     │
│                                                 (ChromaDB)   │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                      QUERY PIPELINE                          │
│                                                             │
│  User Query ──► Retrieve ──► Context Assembly ──► Prompt ──►│
│                   (top-k       (retrieved chunks     │       │
│                 similarity     formatted with        │       │
│                   search)      source metadata)      │       │
│                                                      ▼       │
│                                               Local LLM      │
│                                              (via Ollama)    │
│                                                      │       │
│                                                      ▼       │
│                                              Response +      │
│                                              Sources         │
└─────────────────────────────────────────────────────────────┘

The system operates in two phases:

Ingestion — Source documents are parsed, split into overlapping word-based chunks, embedded into vectors, and stored in a local vector database. Re-ingestion uses a diff-based approach: only new or changed documents are reprocessed.

Query — A user question is embedded and used to retrieve the most relevant chunks from the vector store. Retrieved context is assembled into a prompt alongside conversation history and sent to a locally-running LLM. The response is cleaned of model-internal reasoning tags before display.

Technology Stack

Layer Technology
Language Python 3.12
Vector Store ChromaDB (persistent, HNSW index, cosine similarity)
Embeddings Sentence-Transformers (all-MiniLM-L6-v2, 384-dim)
LLM Runtime Ollama (local)
Document Parsing PyMuPDF + Tesseract OCR fallback
Configuration Pydantic Settings (.env / environment variables)
Testing pytest

All ML inference (embedding + generation) runs locally on CPU or CUDA GPU.

Project Structure

├── med_ai/                  # Core library package
│   ├── config.py            # Typed settings with pydantic-settings
│   └── logging_config.py    # Rotating file + console logging setup
├── scripts/                 # Standalone utility scripts
│   └── process_pdfs.py      # Document parsing and chunking pipeline
├── tests/                   # Test suite (pytest)
│   ├── conftest.py          # Shared fixtures and mocks
│   ├── test_config.py
│   ├── test_process_pdfs.py
│   ├── test_rag_pipeline.py
│   └── test_reingestion.py
├── rag_pipeline.py          # Orchestration: ingestion, retrieval, generation, CLI
├── data/                    # Source documents + processed chunks (gitignored)
├── chroma_db/               # Persistent vector store (gitignored)
└── Modelfile                # Ollama model configuration

Configuration

All settings are managed through med_ai/config.py using pydantic-settings and can be overridden via a .env file or environment variables:

Setting Default Description
CHROMA_PATH chroma_db Vector store directory
COLLECTION_NAME medical_papers ChromaDB collection name
EMBEDDING_MODEL all-MiniLM-L6-v2 Sentence-Transformer model
LLM_MODEL med-assistant-fast Ollama model tag
TOP_K 8 Number of chunks to retrieve
CHUNK_SIZE 200 Words per chunk
CHUNK_OVERLAP 20 Overlap between consecutive chunks
DEVICE cuda Inference device (auto-falls back to CPU)

Data Flow

  1. Place source documents in the configured input directory
  2. Run the document processing script to extract text and generate overlapping chunks (saved as JSON)
  3. Run the RAG pipeline — it ingests chunks into the vector store (skipping unchanged sources) then opens an interactive REPL
  4. Each query retrieves relevant context from the vector store, constructs a prompt with conversation history, calls the local LLM, and returns the answer with source citations

Re-ingestion is source-aware: if a processed document is re-generated under the same filename, old chunks are deleted and new ones are added. Unchanged documents are skipped.

Testing

Tests use pytest with external services (ChromaDB, Ollama, Sentence-Transformers) mocked via unittest.mock. No live dependencies or network access is required. Run with:

pytest

Requirements

  • Python 3.12+
  • Ollama installed locally with a compatible model
  • (Optional) CUDA-capable GPU for GPU-accelerated inference
  • (Optional) Tesseract OCR for scanned documents

About

Local RAG for querying sensitive medical data using ollama.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors