A fully offline RAG-based document question answering system optimized for Windows PCs. Features semantic search, hybrid retrieval, and CPU-based LLM inference with GGUF models.
- Offline-First Design: No internet required after initial setup
- Multi-format Support: PDF, DOCX, PPTX, TXT, MD documents
- Hybrid Retrieval: BM25 + Vector search with Reciprocal Rank Fusion (RRF)
- Window Expansion: Automatically fetches adjacent context chunks
- Smart Chunking: Paragraph and sentence boundary aware
- Cross-Encoder Reranking: Optional MS MARCO MiniLM for precise ranking
The application uses GGUF models via llama-cpp-python for fully offline inference:
- Default Model: Gemma 4 E2B (Q5_K_M GGUF, ~3.1GB) — bundled
- Set via:
RAG_GGUF_PATHenvironment variable or--gguf-pathCLI option - No GPU required
- No network access required
- ~5-10 tokens/second on standard CPU
- Windows 11 (64-bit)
- Intel Core i5 11th generation or newer (or equivalent AMD Ryzen 5000+)
- Intel integrated graphics (present on all 11th gen+ Intel CPUs) — no discrete GPU required
- 16GB RAM
- ~4GB free storage for model + app
- Performance: ~5-7 tokens/second
- Intel Core i7 12th generation or newer (or equivalent AMD Ryzen 7000+)
- Intel Iris Xe integrated graphics or discrete GPU
- 32GB RAM
- SSD for vector database
- Performance: ~10-15 tokens/second
- High-end CPU (Intel Core i9 or AMD Ryzen 9)
- 64GB RAM
- Performance: ~15-20 tokens/second (CPU-only with GGUF)
- Dynamic Text Wrapping: Chat messages automatically wrap based on window width — text reflows as you resize the window
- Empty State Guide: Friendly placeholder shown when no documents are loaded, with sample questions and quick-start button
- Operation Cancellation: Cancel long-running operations (ingestion, querying, engine init) via Cancel button or Escape key
- CTkTooltip Class: Non-blocking hover tooltips with 500ms delay for all settings fields
- Contextual Help: Each RAG configuration field has descriptive hint text explaining its purpose
- Dark Theme Tooltips: Tooltips use dark background (#3a3a4e) with white text for consistent visibility
- Real-time UI Updates: Font size slider now applies to all widgets immediately when saved
- Debug Mode: Toggle debug-level logging for troubleshooting
- Log File Persistence: Customizable log file path with automatic persistence
- Auto-Reconfiguration: RAG settings (chunk size, n_results, etc.) trigger engine reinitialization when changed
- Thread-Safe RAG Engine: Full serialization via
asyncio.to_thread()wrapping for blocking endpoints - ChromaDB Locking:
RLockfor vector store operations preventing concurrent access corruption - BM25 Index Threadsafety: Incremental add operations protected by RLock for safe concurrent document ingestion
- Lazy LLM Initialization: On-demand LLM loading reduces memory footprint for CLI/API modes
- Cancellation Propagation:
cancellation_eventpassed through query processing for responsive long-operation termination - Memory Budget Checks: Pre-ingestion memory validation prevents OOM errors on large document sets
- QueryTransformer Singleton: Shared transformer instance across requests with thread-safe initialization
- Cross-Encoder threadsafety:
__new__pattern ensures single instance with RLock for concurrent reranking - Neighborhood Expansion: Increased k from 3 to 5 chunks for better context coverage in streaming mode
- Embedding Batch Normalization: Consistent batch sizes for predictable memory usage during ingestion
- Thinking Indicator: Animated "Thinking..." with dots while LLM generates responses
- Smart Regeneration: "Regenerate" button replaces the last assistant message instead of creating duplicates
- Feedback System: Working thumbs up/down buttons that persist to database
- Conversation Context Menu: Right-click options to delete or rename conversations
- Time Display: Relative timestamps in sidebar (e.g., "2 min ago", "Yesterday")
- Enter Key Submission: Press Enter to submit questions (no need to click "Ask" button)
- Escape Key: Clears input field or cancels active operations
- Ctrl+Enter: Alternative shortcut for submitting questions
- Ctrl+L: Quick clear chat shortcut
- Ctrl+,: Open settings dialog shortcut
- Inline Typing Indicator: "Thinking..." indicator appears in chat area while processing (replaces status bar overwrite)
- Clear Chat Confirmation: Clear button requires a second click within 3 seconds to prevent accidental deletion
- Settings Switch Labels: CTkSwitch widgets now display descriptive text labels ("Enable Hybrid Search", "Enable Reranking")
- Windows 10 or later
- Python 3.10+
- pip package manager
-
Clone or download the repository
cd doc_qa_app
-
Install dependencies
pip install -r requirements.txt -
Download required models
GGUF Model (Required for LLM inference)
# Default model: Gemma 4 E2B (Q5_K_M) is bundled # To use a custom model, download any GGUF format model # From Hugging Face: https://huggingface.co/models?search=gguf
Embedding Model (Required for search)
# BAAI/bge-small-en-v1.5 is automatically downloaded on first use # Can be manually downloaded if needed for offline installation
-
Run the application
GUI Mode (default):
python main.py
CLI Mode:
python main.py --cliAPI Server:
python main.py --api --port 8080
-
Download the offline installer bundle
- Includes Python embeddable, wheels, and model files
-
Extract the bundle
- Unzip to a directory on your machine
-
Install
- Run the provided installer or execute
main.py
- Run the provided installer or execute
-
No internet required after installation
| Variable | Description | Default |
|---|---|---|
RAG_DB_PATH |
Vector database location | ./doc_qa_db |
RAG_GGUF_PATH |
Path to GGUF model file | - |
RAG_CHUNK_SIZE |
Document chunk size (words) | 512 |
RAG_N_RESULTS |
Context chunks to retrieve | 3 |
RAG_MAX_TOKENS |
Max response tokens | 1024 |
RAG_TEMPERATURE |
LLM temperature | 0.3 |
API_PORT |
API server port | 8080 |
Set both environment variables to enable authentication:
| Variable | Description | Example |
|---|---|---|
ENABLE_AUTH |
Enable authentication (any value enables) | true |
API_KEY |
Secret API key for authentication | your-secure-api-key |
export ENABLE_AUTH=true
export API_KEY="your-secure-api-key"
python main.py --api --port 8080$env:ENABLE_AUTH=$true
$env:API_KEY="your-secure-api-key"
python main.py --api --port 8080All API requests require authentication headers:
- API Key:
X-API-Key: <your-api-key> - JWT Bearer Token:
Authorization: Bearer <jwt-token>
import requests
import os
# Configure authentication
os.environ["ENABLE_AUTH"] = "true"
os.environ["API_KEY"] = "your-secure-api-key"
# Make authenticated request
headers = {
"X-API-Key": os.environ["API_KEY"]
}
response = requests.post("http://localhost:8080/ask", json={
"question": "What are the main findings?",
"n_results": 3
}, headers=headers)
print(response.json())- Always use HTTPS in production
- Rotate API keys regularly
- Store API keys in environment variables, never in code
- See USAGE.md for complete authentication documentation
Backend Selection:
The application uses GGUF models only via llama-cpp-python.
If RAG_GGUF_PATH is set, that model is used. Otherwise, defaults to bundled Gemma 4.
GUI Mode:
- Click "Ingest" button
- Select document folder (folder-based ingestion)
- Wait for processing to complete
Note: GUI supports folder-based batch ingestion. For single-file upload, use API or CLI mode.
CLI Mode:
# Ingest all documents in a directory
python main.py --ingest "C:\Documents\reports"
# Ingest a single file
python main.py --ingest "C:\Documents\report.pdf"API Mode:
import requests
# Ingest entire directory
response = requests.post("http://localhost:8080/ingest", json={
"directory": "C:/Documents/reports"
})
print(response.json())
# Upload and ingest single file
with open("C:/Documents/report.pdf", "rb") as f:
response = requests.post(
"http://localhost:8080/ingest/file",
files={"file": ("report.pdf", f, "application/pdf")}
)
print(response.json())GUI Mode:
- Type your question in the input field
- Press Enter or click "Ask"
- View the answer with source citations
CLI Mode:
# Single question
python main.py --query "What are the main findings?"
# Interactive mode
python main.py --cliAPI Mode:
import requests
response = requests.post("http://localhost:8080/ask", json={
"question": "What are the main findings?",
"n_results": 3
})
print(response.json())Combines BM25 keyword search with vector semantic search using RRF fusion:
- BM25: Fast keyword matching
- Vector: Semantic understanding
- RRF Fusion: Combines both for optimal results
Automatically fetches adjacent chunks around retrieved results:
- Configurable window size (default: 1 chunk)
- Ensures context continuity
- Improves answer quality for multi-part questions
MS MARCO TinyBERT reranker (enabled by default):
- Ranks retrieved chunks by relevance after initial retrieval
- Higher accuracy than pure hybrid search
- Lightweight (~85MB) — optimized for minimum-spec hardware
- Can be disabled via Settings dialog
Keyword-based query expansion (disabled by default):
- Extracts key terms from questions to improve retrieval
- Note: The LLM-based step-back transformation is not wired (latency cost too high for minimum-spec hardware)
LLM Settings:
- GGUF Model Path: Path to
.ggufmodel file
RAG Settings:
- Chunk Size: Number of words per chunk
- Results to Retrieve: Number of chunks for context
- Max Tokens: Maximum response length
- Temperature: Response creativity (0.0-1.0)
Advanced Settings:
- Hybrid Search: Enable/disable BM25+Vector search
- Window Expansion: Number of adjacent chunks to fetch
- Cross-Encoder Reranking: Enable/disable reranking
python main.py [OPTIONS]
Options:
--api Run API server
--cli Run in interactive CLI mode
--ingest PATH Ingest documents from directory
--query QUESTION Ask a question
--db-path PATH Path to vector database (default: ./doc_qa_db)
--model-path PATH Path to GGUF model file (legacy alias for --gguf-path)
--gguf-path PATH GGUF model path
--port PORT API server port (default: 8080)
--chunk-size SIZE Chunk size in words (default: 512)
--chunk-overlap N Chunk overlap in words (default: 50)┌─────────────────────────────────────────────────────────────┐
│ Document Q&A App │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Document │ │ Vector Store │ │ LLM Interface│ │
│ │ Processor │───▶│ (ChromaDB+ │ │ (GGUF-only) │ │
│ │ │ │ BM25+RRF) │◀───│ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └───────────────────┴────────────────────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ RAG Engine │ │
│ │ (Query │ │
│ │ Processing)│ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ GUI / API │ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Document Processor
- Extracts text from PDF, DOCX, PPTX, TXT, MD
- Semantic chunking with paragraph/sentence boundaries
- Chunk overlap for context continuity
Vector Store
- ChromaDB for semantic vector storage
- BM25Index for keyword-based search
- Reciprocal Rank Fusion (RRF) for hybrid results
- Window expansion for context fetching
LLM Interface
- GGUF via llama-cpp-python (CPU-only, fully offline)
RAG Engine
- Query processing and routing
- Hybrid search orchestration
- Context assembly and answer generation
- Source citation tracking
Solution 1: GGUF Model Not Found
# Check if model file exists (default bundled model)
dir gemma-4-E2B-it-Q5_K-M.gguf
# If not, download from:
# https://huggingface.co/google/gemma-4-2b-it-ggufSolution 2: Wrong Model Path
- Check Settings dialog for correct path
- Use "Browse" button to select model file
pip install chromadb --break-system-packagespip install sentence-transformers# CPU-only build (recommended)
pip install llama-cpp-python
# With CUDA support (if you have NVIDIA GPU)
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121- Embedding model (~80MB) downloads on first use
- Subsequent runs use cached model
- BM25 index is built on first ingestion
Solution 1: Reduce chunk size
python main.py --chunk-size 128Solution 2: Increase chunk overlap
python main.py --chunk-size 256 --chunk-overlap 100Solution 3: Reduce number of results
$env:RAG_N_RESULTS=2Check BM25 is enabled:
# In API, check config
from rag_engine import create_engine_from_env
engine = create_engine_from_env()
print(engine.config.hybrid_search) # Should be TrueVerify both backends loaded:
# Check vector store stats
stats = engine.vector_store.get_stats()
print(f"Embedding model: {stats['embedding_model']}")
print(f"BM25 index: {'Ready' if engine.vector_store.bm25_index else 'Not built'}")| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Health check |
/stats |
GET | Engine statistics |
/ask |
POST | Ask a question |
/search |
POST | Search documents |
/ingest |
POST | Ingest directory |
/ingest/file |
POST | Upload and ingest file |
/documents |
GET | List documents |
/documents |
DELETE | Clear all documents |
import requests
import json
# Configure the engine
os.environ["RAG_GGUF_PATH"] = "path/to/gemma-4-E2B-it-Q5_K-M.gguf"
# Start API server in another terminal
# python main.py --api --port 8080
# Ask a question
response = requests.post("http://localhost:8080/ask", json={
"question": "What are the main findings?",
"n_results": 3
})
result = response.json()
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")
print(f"Inference time: {result['inference_time']:.2f}s")pip install pyinstallerpython build.pyThe executable will be created in dist/DocumentQA.exe.
To create an offline installer:
# Prepare installer files
python scripts/build_installer.py
# Manually download:
# 1. GGUF model to build_installer/models/
# 2. Embedding model to build_installer/embeddings/
# 3. Python embeddable to python_embeddable/
# Run Inno Setup
iscc build_installer/setup.issThis creates an offline installer with all dependencies and models included.
doc_qa_app/
├── main.py # Main entry point
├── app_gui.py # GUI application (customtkinter)
├── api_server.py # FastAPI REST server
├── rag_engine.py # RAG orchestration
├── document_processor.py # Document extraction & semantic chunking
├── vector_store.py # Vector search (ChromaDB + BM25 + RRF)
├── llm_interface.py # LLM interface (GGUF-only)
├── reranking.py # Cross-encoder reranking
├── query_transformer.py # Query transformation
├── utils.py # Utility functions (RRF fusion)
├── requirements.txt # Python dependencies
├── build.py # PyInstaller build script
├── scripts/
│ └── build_installer.py # Inno Setup preparation
└── README.md # This file
- Offline-Only: No data leaves your machine
- No Cloud Services: All processing is local
- Model Bundling: Models are stored locally
- Portable: Can be run from USB drive
MIT License - See LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
- ChromaDB - Vector database
- Sentence Transformers - Embedding models
- llama-cpp-python - GGUF inference
- PyMuPDF - PDF processing
- CustomTkinter - Modern GUI toolkit
Version: 2.2.0 Last Updated: 2026-05-17 Hardware: CPU-only optimized for Intel 11th gen i5 and above (16GB RAM minimum)