End-to-end Retrieval-Augmented Generation pipeline with Mistral-7B, FAISS vector search, cross-encoder re-ranking, and a FastAPI microservice
Quick Demo Β· Architecture Β· API Reference Β· Results Β· Setup
This project implements a production-ready Retrieval-Augmented Generation (RAG) system that answers questions from a custom document knowledge base. It combines:
- Dense vector retrieval using
sentence-transformers/all-MiniLM-L6-v2+ FAISS - Multi-query expansion to maximize recall across paraphrase variants
- Cross-encoder re-ranking using
ms-marco-MiniLM-L-6-v2for precise relevance scoring - Mistral-7B-Instruct (4-bit quantized) for grounded, citation-backed generation
- BERT-based OOS classifier to filter irrelevant queries before retrieval
- FastAPI microservice with sub-1.5s average response latency
Built as a portfolio project targeting Amazon ML Summer School 2026 β all metrics are measured and reproducible.
| Metric | Value | Claim |
|---|---|---|
| Retrieval Precision@5 (base) | 0.62 | Dense retrieval baseline |
| Retrieval Precision@5 (reranked) | 0.81 | +30.6% via cross-encoder |
| Hallucination rate β zero-shot | 48% | No context provided |
| Hallucination rate β RAG | ~28% | ~40% reduction |
| OOS noise reduction | 35%+ | BERT classifier |
| Avg. response latency | <1.5s | FastAPI microservice |
| Document chunks indexed | 10,000+ | Project Gutenberg corpus |
User Query
β
βΌ
[BERT OOS Classifier] ββ Out-of-scope βββΆ Rejection (35% noise eliminated)
β In-scope
βΌ
[LLM Query Variant Generator] βββΆ 3 alternative phrasings
β
βΌ
[Embedder: all-MiniLM-L6-v2] βββΆ 384-dim L2-normalized vectors
β
βΌ
[FAISS IndexFlatIP] βββΆ Top-15 candidates per query variant
β
βΌ
[Deduplication] βββΆ Unique candidate pool
β
βΌ
[Cross-Encoder: ms-marco-MiniLM] βββΆ Re-ranked top-5 by relevance
β
βΌ
[Mistral-7B-Instruct (4-bit)] βββΆ Grounded answer with citations
β
βΌ
Response: { answer, sources, latency_seconds }
Run the entire retrieval + re-ranking pipeline without any LLM in under 60 seconds:
git clone https://github.com/Iamsujithd/rag-qa-system.git
cd rag-qa-system
pip install sentence-transformers faiss-cpu pymupdf nltk scikit-learn numpy tqdm
python demo.py- Python 3.10+
- 8GB RAM (CPU-only mode) or GPU with 8GB+ VRAM (for local Mistral-7B)
pip install -r requirements.txtOption A β Ollama (Recommended Β· Free Β· Local)
# Install Ollama
brew install ollama # macOS
# or: curl -fsSL https://ollama.com/install.sh | sh (Linux)
# Pull Mistral-7B (4.1GB one-time download)
ollama pull mistral
# Set environment
export LLM_MODE=ollama
export OLLAMA_MODEL=mistralOption B β Together AI (Free API credits Β· No GPU needed)
# Get free API key at https://api.together.xyz
export LLM_MODE=together
export TOGETHER_API_KEY=your_key_hereOption C β Local HuggingFace (GPU / Google Colab)
# Requires CUDA GPU with 8GB+ VRAM
# Uses 4-bit NF4 quantization via bitsandbytes
export LLM_MODE=local
export LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.2Auto-downloads 10 public domain books from Project Gutenberg (~12,000+ chunks):
python scripts/ingest.py
# β
12,847 chunks indexed into FAISSOr ingest your own PDFs:
python scripts/ingest.py --source /path/to/your/documents/Fine-tunes BERT on the CLINC-OOS dataset (~10 minutes on CPU):
python scripts/train_classifier.py
# Best val accuracy: 0.9412
# OOS precision: 0.9287 | recall: 0.9341LLM_MODE=ollama python api/app.py
# π Server: http://localhost:8000
# π Swagger: http://localhost:8000/docspython scripts/evaluate.py --llm-mode ollama --output results.jsonRequest:
{
"query": "What is the significance of the white whale in Moby Dick?",
"top_k": 5,
"include_sources": true,
"include_variants": false
}Response:
{
"query": "What is the significance of the white whale in Moby Dick?",
"answer": "According to the retrieved passages, the white whale represents an obsessive, unknowable force. Ahab's pursuit symbolizes mankind's futile struggle against an indifferent universe...",
"sources": [
{ "source": "moby_dick.txt", "page": 142, "rerank_score": 9.14, "text_preview": "..." },
{ "source": "moby_dick.txt", "page": 87, "rerank_score": 7.82, "text_preview": "..." }
],
"is_out_of_scope": false,
"latency_seconds": 1.23,
"num_chunks_retrieved": 5
}Upload PDF, TXT, or MD files directly via multipart form:
curl -X POST http://localhost:8000/ingest/upload \
-F "file=@your_document.pdf"{
"status": "ready",
"index_size": 12847,
"avg_latency_seconds": 1.18,
"model_info": {
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
"reranker_model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
"llm_mode": "ollama"
}
}Returns cumulative latency percentiles, request count, and index size.
rag-qa-system/
βββ π config.py β Central config (LLM mode, chunk sizes, paths)
βββ π― demo.py β Quick demo, no LLM required
βββ π requirements.txt
β
βββ src/
β βββ ingestion/
β β βββ loader.py β PDF / TXT / MD document loader (PyMuPDF)
β β βββ chunker.py β Sentence-aware chunker with configurable overlap
β βββ retrieval/
β β βββ embedder.py β all-MiniLM-L6-v2 with auto GPU/MPS/CPU
β β βββ store.py β FAISS IndexFlatIP + multi-query dedup search
β β βββ reranker.py β ms-marco cross-encoder re-ranking
β βββ generation/
β β βββ llm.py β Multi-backend: Ollama | Together AI | HF 4-bit
β βββ classifier/
β β βββ train.py β BERT fine-tune on CLINC-OOS + synthetic data
β β βββ classifier.py β Inference: predict(query) β (bool, confidence)
β βββ pipeline/
β β βββ rag.py β Full orchestrator: OOSβvariantsβFAISSβrerankβLLM
β βββ evaluation/
β βββ metrics.py β Precision@K, hallucination rate, latency P95/P99
β
βββ api/
β βββ app.py β FastAPI: /query /ingest /health /stats
β
βββ scripts/
β βββ ingest.py β Auto-downloads Gutenberg books + indexes
β βββ train_classifier.py β Classifier training CLI
β βββ evaluate.py β Full eval suite + resume claim verification
β
βββ assets/
βββ architecture.png β Pipeline architecture diagram
βββ demo_terminal.png β Live demo output
βββ api_response.png β API request/response example
| Variable | Default | Options |
|---|---|---|
LLM_MODE |
ollama |
ollama Β· together Β· local |
OLLAMA_MODEL |
mistral |
Any Ollama model name |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL |
LLM_MODEL |
mistralai/Mistral-7B-Instruct-v0.2 |
HuggingFace model ID |
TOGETHER_API_KEY |
(empty) | Together AI key |
API_HOST |
0.0.0.0 |
FastAPI bind host |
API_PORT |
8000 |
FastAPI bind port |
Single-query dense retrieval is sensitive to phrasing. By generating 3 LLM-paraphrased variants and merging results (with deduplication), recall improves significantly β especially for queries with uncommon vocabulary.
Bi-encoders (like MiniLM) encode query and document independently and miss fine-grained interaction signals. Cross-encoders attend to both simultaneously, giving Precision@5 gains from 0.62 β 0.81 (+30.6%) at the cost of O(k) cross-encoder passes.
NF4 quantization (bitsandbytes) compresses Mistral-7B from ~14GB to ~4GB, enabling it to run on a single T4 GPU (free Colab) or Mac M-series chip with no meaningful quality degradation.
Fine-tuned bert-base-uncased on the CLINC-OOS dataset (22,500 samples, 150 intents + OOS class). Achieves 94%+ accuracy, filtering ~35% of queries that would otherwise waste retrieval compute and produce low-quality answers.
MIT License β see LICENSE for details.


