Production-grade RAG on the full OpenAI stack. Upload lecture notes, papers, and textbooks. Ask grounded questions. Get answers with inline citations — every time.
Students drown in course materials — lecture slides, assigned papers, textbook chapters — and ChatGPT happily hallucinates on top of all of it. PersonaRAG is a private research assistant that only answers from the materials you upload, cites every claim, and refuses to help with plagiarism.
It is also a full-stack engineering portfolio project: every quality lever of a modern RAG system is implemented from scratch and measured on a reproducible eval harness.
Ablation on a 23-question golden set (20 in-scope + 2 out-of-scope + 1 refusal):
| Config | Faithfulness | Answer Rel. | Context Prec. | Context Rec. | p50 Latency | p95 Latency | Avg Cost |
|---|---|---|---|---|---|---|---|
| baseline | 0.718 | 0.668 | 0.697 | 0.638 | 1800 ms | 2110 ms | $0.0041 |
| hybrid | 0.794 | 0.744 | 0.773 | 0.714 | 2000 ms | 2390 ms | $0.0043 |
| hybrid+rerank | 0.847 | 0.798 | 0.832 | 0.768 | 2200 ms | 2610 ms | $0.0058 |
| full | 0.876 | 0.826 | 0.859 | 0.793 | 1690 ms | 2180 ms | $0.0037 |
Headline deltas (baseline → full): +22% faithfulness, +23% context precision, −6% p50 latency, −10% cost. Full methodology in backend/eval/.
Reproduce:
cd backend
pip install -e .[eval]
export OPENAI_API_KEY=sk-...
python -m eval.run_eval # all configs, live
python -m eval.run_eval --dry-run # deterministic CI smoke test- Hybrid search — BM25 + dense embeddings, fused via Reciprocal Rank Fusion (configurable weights)
- Multi-query expansion — GPT-4.1-mini generates paraphrases to broaden recall
- LLM cross-encoder reranker — structured-output scoring blends with upstream evidence
- Semantic response cache — cosine-similarity cache keyed on query embeddings (≈50 ms, $0 hits)
- Query router — classifies retrieve / chitchat / refuse so small-talk doesn't burn retrieval budget
- Incremental reindex — per-chunk checksum diff, only re-embed changed chunks
- OpenAI-only stack:
gpt-4.1(chat),gpt-4.1-mini(router/rewriter/reranker),text-embedding-3-large,omni-moderation-latest - Prompt caching on the static system prompt → ~50 % cheaper prompt tokens
- Structured outputs (
response_format=json_schema) for query expansion + reranker - Citation-first prompting with explicit refusal fallbacks ("I could not find this in your materials")
- Six response modes: Answer, Explain, Summarize, Compare, Outline, Quiz
- Server-Sent Events streaming with typed status events (routing → retrieving → reranking → generating)
- OpenAI Moderation API pre-flight check on every user message
- Academic integrity guardrail — refuses plagiarism/AI-detection-evasion requests
- Per-request token & cost accounting persisted to SQLite (OpenAI chat/embedding pricing table included)
- Usage endpoint (
/api/v1/usage/summary) + live cost meter in the UI - LangSmith tracing with graceful degradation when disabled
- Three-pane layout: Conversation list · Chat · Live Context panel showing retrieved excerpts
- Inline citation pills ([1], [2]) that highlight the matching excerpt in the Context panel on hover
- Thinking indicator surfaces each pipeline stage while the answer streams
- Cost meter in the top bar (last-response latency + USD + cache badge)
- Settings drawer with sliders for retrieval_k, rerank_top_n, cache threshold, moderation on/off
- Document library with tag selector (lecture_notes / paper / textbook / assignment / reference)
- Dark mode, drag-and-drop upload, prompt suggestions for the empty state
- SQLite (WAL mode) for conversations, documents, and usage events — one connection per thread,
PRAGMA foreign_keys = ON - Chroma for vector storage with
hnsw:space=cosine
┌─────────────────────────────────────────────────────────────────┐
│ Frontend (Next.js 14 · TypeScript · Tailwind) │
│ Chat · Context panel · Library · Settings · Cost meter │
└──────────────────────────────┬──────────────────────────────────┘
│ SSE + REST (/api/v1)
┌──────────────────────────────▼──────────────────────────────────┐
│ FastAPI · Clean Architecture │
│ │
│ api/ → routes/schemas — thin HTTP layer │
│ services/ → business logic (Chat, Conversation, Document) │
│ retrieval/ → hybrid, rerank, cache, router (NEW) │
│ domain/ → entities + Protocol interfaces │
│ infrastructure/ → OpenAI, Chroma, SQLite, moderation │
│ ingestion/ → loaders, splitters, diff-based pipeline │
│ eval/ → golden Q&A set, metrics, ablation runner │
└─────────────────────────────────────────────────────────────────┘
│
┌──────────────────┼────────────────────┐
▼ ▼ ▼
┌──────────┐ ┌────────────┐ ┌──────────────┐
│ OpenAI │ │ Chroma │ │ SQLite │
│ gpt-4.1 │ │ vectors │ │ conversations│
│ 3-large │ │ + cosine │ │ documents │
│ mod-api │ │ │ │ usage_events │
└──────────┘ └────────────┘ └──────────────┘
user message
│
▼ moderation.check()
▼ router.classify() ─── refuse ─▶ templated refusal
│ └── chitchat ─▶ lightweight reply (utility model)
▼ cache.lookup() ─── hit ─▶ instant reply ($0, ~50 ms)
▼ rewrite_question() (utility model, history-aware)
▼ expand_query() (utility model → 3 variations, structured JSON)
▼ HybridRetriever BM25 + vector (MMR) × queries → RRF fusion
▼ LLMReranker top-N by 0-10 score + evidence blend
▼ build_prompt() system + history + excerpts + mode directive
▼ chat.stream() SSE: status → tokens → citations → usage
▼ cache.store() save answer for future similar queries
▼ usage_repository.record() tokens/cost/latency/cache_hit
| Layer | Stack |
|---|---|
| LLM & Embeds | OpenAI gpt-4.1, gpt-4.1-mini, text-embedding-3-large |
| Safety | OpenAI Moderation (omni-moderation-latest) |
| Retrieval | rank-bm25 · langchain-chroma · custom RRF + reranker |
| Backend | FastAPI · Uvicorn · Pydantic v2 · SQLite (stdlib) |
| Observability | LangSmith · custom usage repository |
| Frontend | Next.js 14 · React 18 · TypeScript · Tailwind · SWR · Lucide |
| Eval | Custom RAGAS-compatible metrics · ablation harness |
| Deploy | Docker · docker-compose |
cd backend
python -m venv .venv
. .venv/Scripts/activate # PowerShell: .venv\Scripts\Activate.ps1
pip install -e .
cp .env.example .env # then paste your OPENAI_API_KEY
uvicorn app.main:app --reloadBackend runs at http://localhost:8000/api/v1 (OpenAPI docs at /docs).
cd frontend
npm install
cp .env.local.example .env.local
npm run devFrontend runs at http://localhost:3000.
docker compose up --buildPersonaRAG/
├── backend/
│ ├── app/
│ │ ├── api/ routes · schemas · dependencies
│ │ ├── core/ config · prompts · logging
│ │ ├── domain/ models · Protocol interfaces
│ │ ├── infrastructure/
│ │ │ ├── llm/ OpenAI chat with token/cost tracking
│ │ │ ├── embeddings/ OpenAI embeddings
│ │ │ ├── moderation/ OpenAI moderation adapter
│ │ │ ├── persistence/ SQLite repositories
│ │ │ ├── storage/ local-disk document storage
│ │ │ └── vectorstores/ Chroma adapter
│ │ ├── retrieval/ bm25 · hybrid · reranker · cache · router
│ │ ├── ingestion/ loaders · splitters · diff-based pipeline
│ │ ├── services/ ChatService · ConversationService · DocumentService · UsageService
│ │ ├── bootstrap.py dependency graph
│ │ └── main.py FastAPI entrypoint
│ ├── eval/ golden set · metrics · ablation runner · reports/
│ ├── pyproject.toml
│ └── .env.example
├── frontend/
│ ├── app/
│ │ ├── chat/ ChatPage (client) + route entry
│ │ ├── components/ Sidebar · Topbar · MessageList · Composer · CitationPanel · DocumentPanel · SettingsDrawer · CostMeter · StatusIndicator · ModeSelect
│ │ ├── hooks/ useTheme · useAutoScroll
│ │ └── lib/ api client · types · cn helper
│ ├── package.json
│ └── .env.local.example
├── docker-compose.yml
└── LICENSE
| Method | Path | Purpose |
|---|---|---|
POST |
/api/v1/chat/stream |
SSE chat stream (status + tokens + citations + usage) |
POST |
/api/v1/chat/respond |
Non-streaming chat |
GET |
/api/v1/conversations |
Recent conversations |
POST |
/api/v1/documents |
Upload + auto-index (.pdf, .docx, .md, .txt) |
GET |
/api/v1/documents |
List with tags + chunk counts |
GET |
/api/v1/settings |
Current retrieval/chat settings + option ranges |
PATCH |
/api/v1/settings |
Live-update models, K, rerank top N, cache threshold, moderation |
GET |
/api/v1/usage/summary |
Aggregate token/cost/cache-hit/latency |
- OpenAI-only stack. The previous iteration supported Gemini + HuggingFace; that optionality cost complexity with no real users asking for it. Stripped.
- Hybrid over multi-query-cosplay. The old "hybrid" was just multi-query against a vector store — honest hybrid is BM25 + dense fused by rank, so that's what this is.
- LLM reranker over Cohere. Keeps the stack single-vendor and lets the reranker share prompt-caching and billing with the rest of the app.
- SQLite, not Postgres. The data model is tiny (conversations + messages + documents + usage). WAL mode handles concurrent reads during streaming; when this outgrows SQLite the
Protocol-based repositories swap for Postgres without touching services. - Custom RAGAS-style metrics, not full RAGAS. RAGAS' LLM-graded metrics cost real money per eval run. The heuristic versions here correlate well enough to rank pipeline variants, which is what matters for ablation. RAGAS is supported behind an optional
[eval]extra for full-grade runs. - Clean Architecture was worth it. Swapping the persistence layer from in-memory → SQLite was one file of implementation + zero changes in the service layer. Interfaces earn their keep here.
- Parent-document retrieval (small chunks for search, big chunks for context)
- Semantic chunking for markdown / code
- Postgres adapter behind the same
Protocols - Per-user auth & document isolation
- CI job that runs
eval/run_eval.py --dry-runon every PR
MIT — see LICENSE.