MedicalRAG

A local-first Retrieval-Augmented Generation (RAG) system for querying knowledge from a corpus of documents. Everything runs on your machine — no cloud APIs, no external services.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    INGESTION PIPELINE                        │
│                                                             │
│  Source Docs ──► Text Extraction ──► Chunking ──► Embed ──► │
│  (PDF, etc.)       (parser + OCR        (overlapping    │   │
│                     fallback)            word windows)   ▼   │
│                                                    Vector    │
│                                                    Store     │
│                                                 (ChromaDB)   │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                      QUERY PIPELINE                          │
│                                                             │
│  User Query ──► Retrieve ──► Context Assembly ──► Prompt ──►│
│                   (top-k       (retrieved chunks     │       │
│                 similarity     formatted with        │       │
│                   search)      source metadata)      │       │
│                                                      ▼       │
│                                               Local LLM      │
│                                              (via Ollama)    │
│                                                      │       │
│                                                      ▼       │
│                                              Response +      │
│                                              Sources         │
└─────────────────────────────────────────────────────────────┘

The system operates in two phases:

Ingestion — Source documents are parsed, split into overlapping word-based chunks, embedded into vectors, and stored in a local vector database. Re-ingestion uses a diff-based approach: only new or changed documents are reprocessed.

Query — A user question is embedded and used to retrieve the most relevant chunks from the vector store. Retrieved context is assembled into a prompt alongside conversation history and sent to a locally-running LLM. The response is cleaned of model-internal reasoning tags before display.

Technology Stack

Layer	Technology
Language	Python 3.12
Vector Store	ChromaDB (persistent, HNSW index, cosine similarity)
Embeddings	Sentence-Transformers (`all-MiniLM-L6-v2`, 384-dim)
LLM Runtime	Ollama (local)
Document Parsing	PyMuPDF + Tesseract OCR fallback
Configuration	Pydantic Settings (`.env` / environment variables)
Testing	pytest

All ML inference (embedding + generation) runs locally on CPU or CUDA GPU.

Project Structure

├── med_ai/                  # Core library package
│   ├── config.py            # Typed settings with pydantic-settings
│   └── logging_config.py    # Rotating file + console logging setup
├── scripts/                 # Standalone utility scripts
│   └── process_pdfs.py      # Document parsing and chunking pipeline
├── tests/                   # Test suite (pytest)
│   ├── conftest.py          # Shared fixtures and mocks
│   ├── test_config.py
│   ├── test_process_pdfs.py
│   ├── test_rag_pipeline.py
│   └── test_reingestion.py
├── rag_pipeline.py          # Orchestration: ingestion, retrieval, generation, CLI
├── data/                    # Source documents + processed chunks (gitignored)
├── chroma_db/               # Persistent vector store (gitignored)
└── Modelfile                # Ollama model configuration

Configuration

All settings are managed through med_ai/config.py using pydantic-settings and can be overridden via a .env file or environment variables:

Setting	Default	Description
`CHROMA_PATH`	`chroma_db`	Vector store directory
`COLLECTION_NAME`	`medical_papers`	ChromaDB collection name
`EMBEDDING_MODEL`	`all-MiniLM-L6-v2`	Sentence-Transformer model
`LLM_MODEL`	`med-assistant-fast`	Ollama model tag
`TOP_K`	`8`	Number of chunks to retrieve
`CHUNK_SIZE`	`200`	Words per chunk
`CHUNK_OVERLAP`	`20`	Overlap between consecutive chunks
`DEVICE`	`cuda`	Inference device (auto-falls back to CPU)

Data Flow

Place source documents in the configured input directory
Run the document processing script to extract text and generate overlapping chunks (saved as JSON)
Run the RAG pipeline — it ingests chunks into the vector store (skipping unchanged sources) then opens an interactive REPL
Each query retrieves relevant context from the vector store, constructs a prompt with conversation history, calls the local LLM, and returns the answer with source citations

Re-ingestion is source-aware: if a processed document is re-generated under the same filename, old chunks are deleted and new ones are added. Unchanged documents are skipped.

Testing

Tests use pytest with external services (ChromaDB, Ollama, Sentence-Transformers) mocked via unittest.mock. No live dependencies or network access is required. Run with:

pytest

Requirements

Python 3.12+
Ollama installed locally with a compatible model
(Optional) CUDA-capable GPU for GPU-accelerated inference
(Optional) Tesseract OCR for scanned documents

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MedicalRAG

Architecture

Technology Stack

Project Structure

Configuration

Data Flow

Testing

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MedicalRAG

Architecture

Technology Stack

Project Structure

Configuration

Data Flow

Testing

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages