A local-first Retrieval-Augmented Generation (RAG) system for querying knowledge from a corpus of documents. Everything runs on your machine — no cloud APIs, no external services.
┌─────────────────────────────────────────────────────────────┐
│ INGESTION PIPELINE │
│ │
│ Source Docs ──► Text Extraction ──► Chunking ──► Embed ──► │
│ (PDF, etc.) (parser + OCR (overlapping │ │
│ fallback) word windows) ▼ │
│ Vector │
│ Store │
│ (ChromaDB) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ QUERY PIPELINE │
│ │
│ User Query ──► Retrieve ──► Context Assembly ──► Prompt ──►│
│ (top-k (retrieved chunks │ │
│ similarity formatted with │ │
│ search) source metadata) │ │
│ ▼ │
│ Local LLM │
│ (via Ollama) │
│ │ │
│ ▼ │
│ Response + │
│ Sources │
└─────────────────────────────────────────────────────────────┘
The system operates in two phases:
Ingestion — Source documents are parsed, split into overlapping word-based chunks, embedded into vectors, and stored in a local vector database. Re-ingestion uses a diff-based approach: only new or changed documents are reprocessed.
Query — A user question is embedded and used to retrieve the most relevant chunks from the vector store. Retrieved context is assembled into a prompt alongside conversation history and sent to a locally-running LLM. The response is cleaned of model-internal reasoning tags before display.
| Layer | Technology |
|---|---|
| Language | Python 3.12 |
| Vector Store | ChromaDB (persistent, HNSW index, cosine similarity) |
| Embeddings | Sentence-Transformers (all-MiniLM-L6-v2, 384-dim) |
| LLM Runtime | Ollama (local) |
| Document Parsing | PyMuPDF + Tesseract OCR fallback |
| Configuration | Pydantic Settings (.env / environment variables) |
| Testing | pytest |
All ML inference (embedding + generation) runs locally on CPU or CUDA GPU.
├── med_ai/ # Core library package
│ ├── config.py # Typed settings with pydantic-settings
│ └── logging_config.py # Rotating file + console logging setup
├── scripts/ # Standalone utility scripts
│ └── process_pdfs.py # Document parsing and chunking pipeline
├── tests/ # Test suite (pytest)
│ ├── conftest.py # Shared fixtures and mocks
│ ├── test_config.py
│ ├── test_process_pdfs.py
│ ├── test_rag_pipeline.py
│ └── test_reingestion.py
├── rag_pipeline.py # Orchestration: ingestion, retrieval, generation, CLI
├── data/ # Source documents + processed chunks (gitignored)
├── chroma_db/ # Persistent vector store (gitignored)
└── Modelfile # Ollama model configuration
All settings are managed through med_ai/config.py using pydantic-settings and can be overridden via a .env file or environment variables:
| Setting | Default | Description |
|---|---|---|
CHROMA_PATH |
chroma_db |
Vector store directory |
COLLECTION_NAME |
medical_papers |
ChromaDB collection name |
EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
Sentence-Transformer model |
LLM_MODEL |
med-assistant-fast |
Ollama model tag |
TOP_K |
8 |
Number of chunks to retrieve |
CHUNK_SIZE |
200 |
Words per chunk |
CHUNK_OVERLAP |
20 |
Overlap between consecutive chunks |
DEVICE |
cuda |
Inference device (auto-falls back to CPU) |
- Place source documents in the configured input directory
- Run the document processing script to extract text and generate overlapping chunks (saved as JSON)
- Run the RAG pipeline — it ingests chunks into the vector store (skipping unchanged sources) then opens an interactive REPL
- Each query retrieves relevant context from the vector store, constructs a prompt with conversation history, calls the local LLM, and returns the answer with source citations
Re-ingestion is source-aware: if a processed document is re-generated under the same filename, old chunks are deleted and new ones are added. Unchanged documents are skipped.
Tests use pytest with external services (ChromaDB, Ollama, Sentence-Transformers) mocked via unittest.mock. No live dependencies or network access is required. Run with:
pytest- Python 3.12+
- Ollama installed locally with a compatible model
- (Optional) CUDA-capable GPU for GPU-accelerated inference
- (Optional) Tesseract OCR for scanned documents