tiny-rag-lab is a learning-first RAG engine/laboratory for understanding how
classic retrieval-augmented generation works end to end.
The goal is to keep the RAG lifecycle visible: document loading, text normalization, chunking, metadata, embeddings, local vector search, retrieval, prompt assembly, answer generation, citations, evaluation, and failure inspection.
Phase 1 through Phase 2.0 are complete. No phase is currently active.
- Phase 1 — Naive Classic RAG: full pipeline from corpus to grounded answers with citations
- Phase 1.5 — Retrieval Mechanics: BM25 keyword retrieval, hybrid retrieval, and retriever comparison flags
- Phase 1.6 — Evaluation Harness: retrieval quality metrics (
rag eval) against a prepared QA set - Phase 1.7 — Observability And Debugging: retrieve/ask traces, stage latency, and optional JSON trace output
- Phase 1.8 — RAG Failure Lab: curated failure cases and
rag diagnosefor baseline vs. intervention retrieval - Phase 1.9 — Reranking: fake and cross-encoder reranker interfaces with retrieve/eval/ask/diagnose integration
- Phase 2.0 — Answer Quality Judging: fake and OpenAI-compatible judge paths for answer metrics and answer-side failure diagnosis
Completed phase contracts:
- Phase index
- Phase 1 spec · taskboard
- Phase 1.5 spec · taskboard
- Phase 1.6 spec · taskboard
- Phase 1.7 spec · taskboard
- Phase 1.8 spec · taskboard
- Phase 1.9 spec · taskboard
- Phase 2.0 spec · taskboard
Phase 1 delivers a minimal but complete CLI-first RAG baseline:
local corpus -> documents -> normalized text -> chunks -> embeddings
-> local vector index -> query embedding -> cosine retrieval
-> grounded prompt -> generated answer with citations
Key decisions:
- Python implementation
argparseCLI- primary corpus: IBM
watsonxDocsQA - local embeddings:
sentence-transformers/all-MiniLM-L6-v2 - OpenAI-compatible online generation for real answers
- fake embedder and fake generator for tests
- local index files under
.tiny-rag/index/ - no vector database in Phase 1
- no LangChain/LlamaIndex/Haystack wrapper in Phase 1
Phase 1.5 adds inspectable retrieval strategies to compare dense vector search, BM25 keyword search, and hybrid retrieval with Reciprocal Rank Fusion.
query + index -> dense retrieval | BM25 retrieval -> optional RRF fusion
-> ranked chunks and eval reports tagged with retriever=dense|bm25|hybrid
Phase 1.6 adds a rag eval command that measures retrieval quality against the
prepared qa.jsonl evaluation set. Four deterministic metrics are reported:
hit rate @ k, MRR, context precision, and context recall.
qa.jsonl + index -> embed questions -> retrieve top-k -> compare to gold docs
-> hit rate, MRR, context precision, context recall
Phase 1.7 adds trace records and human-readable trace output for retrieve and ask flows. Traces expose the retriever, top-k, ranked chunks, scores, citations, prompt/answer context, and stage latency.
query + retrieval/ask flow -> trace fields -> readable trace and optional JSON
Phase 1.8 adds a failure lab for curated retrieval failure scenarios. The
rag diagnose command compares each case's baseline and intervention retrieval
config, labels heuristic failure modes, and reports whether failures were
confirmed, fixed, moved, or unchanged.
failure cases + index -> baseline retrieval + intervention retrieval
-> failure labels, metrics, and diagnosis report
Phase 1.9 adds a reranker abstraction and optional second-pass reranking for
retrieve, eval, ask, and diagnose workflows. The default none path remains
unchanged; fake rerankers keep tests offline, and the cross-encoder path is
lazy and gated.
initial candidates -> optional reranker -> final top-k chunks
-> traces and reports with reranker metadata
Phase 2.0 adds answer-quality judging behind a fakeable interface. rag eval
can print retrieval metrics plus answer metrics, rag ask can include a judge
verdict in the trace, and rag diagnose can cover answer-side failures such
as unsupported answers and citation mismatches.
retrieved context + generated answer -> judge verdict
-> answer metrics, trace verdicts, and answer-side diagnosis
rag index --corpus PATH --index-dir .tiny-rag/index --chunk-size 800 --chunk-overlap 120
rag retrieve "question text" --index-dir .tiny-rag/index --top-k 5 --retriever dense
rag retrieve "question text" --index-dir .tiny-rag/index --top-k 5 --retriever bm25
rag retrieve "question text" --index-dir .tiny-rag/index --top-k 5 --retriever hybrid
rag ask "question text" --index-dir .tiny-rag/index --top-k 5
rag eval --qa-file corpus/watsonx-docsqa/qa.jsonl --index-dir .tiny-rag/index --top-k 5 --retriever dense
rag eval --qa-file corpus/watsonx-docsqa/qa.jsonl --index-dir .tiny-rag/index --top-k 5 --retriever bm25
rag eval --qa-file corpus/watsonx-docsqa/qa.jsonl --index-dir .tiny-rag/index --top-k 5 --retriever hybrid
rag eval --qa-file corpus/watsonx-docsqa/qa.jsonl --index-dir .tiny-rag/index --judge fake --generator fake
rag diagnose --cases-file tests/fixtures/failure/cases.jsonl --index-dir .tiny-rag/index
rag diagnose --cases-file tests/fixtures/failure/cases.jsonl --index-dir .tiny-rag/index --judge fake --generator fakeHelp is available for each command:
uv run rag --help
uv run rag index --help
uv run rag retrieve --help
uv run rag ask --help
uv run rag eval --help
uv run rag diagnose --helpInstall/sync dependencies:
uv sync --group devRun tests:
uv run pytest --tb=short -qPrepare the primary corpus after dependencies are installed:
uv run python scripts/prepare_watsonx_docsqa.py --inspect
uv run python scripts/prepare_watsonx_docsqa.py --output-dir corpus/watsonx-docsqaGenerated corpora and indexes are intentionally ignored by git:
corpus/
.tiny-rag/
- Proposal: project purpose, philosophy, and non-goals
- Roadmap: directional phase sequence
- Architecture: conceptual RAG planes and boundaries
- Agent guidelines: collaboration, review, and handoff workflow
- File structure: quick repository map
- Phase docs: active phase pointer and phase contracts
For implementation work, the phase spec and taskboard under docs/phases/ are
the source of truth.