🍜 East Asian Cuisine RAG Assistant

A production-grade Retrieval-Augmented Generation (RAG) system for East Asian cuisine question answering, built entirely from scratch without any RAG frameworks.

Results • UI Screenshots • Architecture • Quick Start • Methodology • Ablation Study • Project Structure

Overview

This project implements a complete end-to-end RAG pipeline for a culinary question-answering assistant specialising in East Asian cuisine (Chinese, Japanese, Korean). Built as part of the COMP64702 Transforming Text Into Meaning coursework at the University of Manchester.

Rather than using off-the-shelf RAG frameworks like LangChain or LlamaIndex, every component was designed, implemented, and evaluated from scratch — demonstrating deep understanding of how modern NLP systems work under the hood.

What makes this project stand out

No RAG framework used — every component built from first principles
Hybrid retrieval combining dense FAISS and sparse BM25 fused via Reciprocal Rank Fusion
Semantic chunking using cosine similarity between sentence embeddings to detect topic boundaries
Cross-encoder reranking for precision improvement over bi-encoder retrieval
23 evaluation metrics across generation, retrieval, faithfulness, and efficiency dimensions
Ablation study across 7 system configurations with statistical significance testing
Corrected faithfulness metric — content-word overlap with full retrieved chunks, stopwords removed
Interactive Gradio UI with live strategy comparison and one-click batch inference

Results

Generation Quality — RAG System vs No-Context Baseline

Metric	No-Context Baseline	Our RAG System	Improvement	p-value
BLEU-1	0.1325	0.2847	+114.9%	p < 0.001
BLEU-2	0.0629	0.2092	+232.6%	p < 0.001
BLEU-4	0.0245	0.1393	+468.6%	p < 0.001
ROUGE-1	0.2145	0.3725	+73.7%	p < 0.001
ROUGE-2	0.0556	0.2095	+276.8%	p < 0.001
ROUGE-L	0.1543	0.3280	+112.6%	p < 0.001
BERTScore F1	0.7815	0.8334	+6.6%	p < 0.001
METEOR	0.2184	0.3206	+46.8%	p < 0.001
Answer F1	0.1853	0.3379	+82.4%	p < 0.001
Answer Correctness	0.8141	0.8398	+3.2%	p < 0.001
Faithfulness†	0.000	0.6547	—	p < 0.001
Hallucination Rate†	100%	10.9%	-89.1pp	p < 0.001

† Faithfulness computed as content-word overlap (stopwords removed) with full retrieved chunk text via chunk_id lookup, threshold = 0.40. Hallucination rate = fraction of answers below threshold.

Retrieval Quality

Metric	Score
MRR@1	0.8478
MRR@3	0.9058
MRR@5	0.9080
Recall@5	0.9783
Recall@10	0.9783
NDCG@5	0.9130
Context Precision	0.6783
Mean Latency	~12s / query

Results by Question Type

Question Type	N	ROUGE-L	BERTScore	BLEU-1	METEOR
Factual	21	0.3454	0.8374	0.3062	0.3345
Ingredient	16	0.3332	0.8464	0.2817	0.3400
Cultural	27	0.3415	0.8267	0.2846	0.3119
Procedural	26	0.3047	0.8289	0.2741	0.3148
Comparative	2	0.2232	0.8384	0.2220	0.2104

Note: Comparative questions have only n=2 samples — insufficient for reliable per-type statistics.

Summary Scorecard

	Value
Total metrics evaluated	23
Metrics where RAG wins	20 / 23
Significant improvements (p < 0.001)	16 / 23
Queries evaluated	92 (train set)

UI Screenshots

Ask a Question

Screenshot of the Ask a Question tab — shows the question input, retrieved chunks, and generated answer

Compare Prompting Strategies

Screenshot of the Compare Strategies tab — shows all 4 strategies side by side for the same question

Results & Ablation

Screenshot of the Results & Ablation tab — shows the full evaluation tables

Batch Inference

Screenshot of the Batch Inference tab — shows the file upload and progress log

System Info

Screenshot of the System Info tab — shows pipeline configuration and design decisions

To add your own screenshots: Create a docs/screenshots/ folder in the repo, save your screenshots there with the filenames above, then commit and push. The images will automatically appear here on GitHub.

Ablation Study

A systematic ablation study was conducted across 7 configurations, removing one component at a time to quantify each component's contribution. All comparisons use paired t-tests against the full system on 30 queries per configuration.

Configuration	ROUGE-L	BERTScore	NDCG@5	MRR@5	Cost vs Full
Full system ✅	0.3533	0.8381	0.9276	0.9500	—
Cross-Encoder reranker	0.2921	0.8194	0.8818	0.8806	-0.0611
Dense only (no BM25)	0.3171	0.8237	0.9295	0.9500	-0.0362
BM25 only (no dense)	0.2805	0.8200	0.7358	0.7167	-0.0727
Fixed chunking (256w)	0.1611	0.7874	0.0333	0.0333	-0.1922 ✅ p=0.0007
Zero-shot prompting	0.2193	0.7946	0.9276	0.9500	-0.1340 ✅ p=0.0016
No RAG (baseline)	0.1495	0.7683	0.0000	0.0000	-0.2038 ✅ p=0.0002

Key findings:

Semantic chunking is the most impactful single component — removing it costs -0.1922 ROUGE-L (p=0.0007)
Structured prompting costs -0.1340 ROUGE-L if replaced with zero-shot (p=0.0016)
RAG context improves ROUGE-L by +0.2038 over no-context baseline (p=0.0002)
Hybrid retrieval outperforms BM25-only by +0.0727 ROUGE-L and +0.19 NDCG@5

Architecture

┌──────────────────────────────────────────────────────────────┐
│                    INGESTION  (offline, once)                 │
│                                                              │
│  237 docs  →  Semantic  →  BGE      →  FAISS + BM25         │
│  318k words   chunker      embedder    vector store          │
│               3,356 chunks  384-dim    saved to disk         │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                    INFERENCE  (per query)                     │
│                                                              │
│  Question → BGE embed → FAISS dense  ─┐                     │
│                       → BM25 sparse  ─┼→ RRF fusion         │
│                                        │                     │
│                             fused list → Cross-encoder       │
│                                          reranker            │
│                                          ↓                   │
│                              Top-5 most relevant chunks      │
│                                          ↓                   │
│                           Structured prompt builder          │
│                                          ↓                   │
│                        Qwen2.5-0.5B-Instruct                 │
│                                          ↓                   │
│                                     Final answer             │
└──────────────────────────────────────────────────────────────┘

Quick Start

Prerequisites

Python 3.11+, Conda recommended
~4GB disk space for models
Free HuggingFace account for Qwen access

Installation

git clone https://github.com/irohan0/rag-culinary-assistant.git
cd rag-culinary-assistant

conda create -n rag-project python=3.11
conda activate rag-project

pip install -r requirements.txt

echo "HF_TOKEN=your_token_here" > .env
# Get your free token at: https://huggingface.co/settings/tokens

Run the UI

python app.py
# Opens automatically at http://localhost:7860

Command-line inference

# Handles demo day sources format and our own queries format automatically
python run_inference.py --input "east_asia_sample_qa.json" --output outputs/output_payload.json

Run notebooks in order

01_data_collection.ipynb      → scrape 237 documents (~5 min)
02_benchmark_creation.ipynb   → generate 116 QA pairs
03_ingestion.ipynb            → chunk + embed + index (~10 min)
04_inference.ipynb            → run RAG pipeline on train set
05_evaluation.ipynb           → compute all 23 metrics
06_ablation_evaluation.ipynb  → ablation study across 7 configs

Note: vector_store/ index files are not committed due to size. Run 03_ingestion.ipynb to rebuild them in ~10 minutes.

Methodology

1. Data Collection

Source	Documents	Words	Licence
Wikipedia	184	252,937	CC-BY-SA 4.0
Wikibooks	21	7,901	CC-BY-SA 4.0
Around the World in 80 Cuisines (Blog)	32	57,977	Public
Total	237	~318,000	—

All scraping performed ethically: 1.5s crawl delay, transparent User-Agent, academic use only.

2. Benchmark Dataset

116 QA pairs auto-generated using Qwen2.5-0.5B-Instruct as an LLM-assisted annotation tool, with manual comparative and multi-hop pairs added for cross-document reasoning. Split 80/20 into train (92) and test (24) sets.

3. Chunking — 4 Strategies Compared

Strategy	Chunks	Avg Words	Selected
Fixed-size 256w	~4,100	256	—
Fixed-size 512w	~2,200	512	—
Sentence-based	~3,800	~200	—
Semantic (threshold=0.45)	3,356	~180	✅

Ablation confirms this is the most impactful component: -0.1922 ROUGE-L if removed (p=0.0007).

4. Embedding — 2 Models Compared

Model	Dim	Selected
all-MiniLM-L6-v2	384	—
BAAI/bge-small-en-v1.5	384	✅ Retrieval-optimised contrastive training

5. Retrieval — 3 Strategies Compared

Strategy	ROUGE-L	NDCG@5	Selected
Dense only	0.3171	0.9295	—
BM25 only	0.2805	0.7358	—
Hybrid + RRF + Reranking	0.3533	0.9276	✅

6. Prompting — 4 Strategies Compared

Strategy	ROUGE-L	Selected
Zero-shot	0.2193	—
Few-shot	—	—
Chain-of-thought	—	—
Structured/constrained	0.3533	✅

7. Evaluation — 23 Metrics

Generation: BLEU-1/2/4, ROUGE-1/2/L, BERTScore F1, METEOR, Answer F1, Exact Match, Answer Relevance, Answer Correctness

RAG-specific: Faithfulness (content-word overlap, stopwords removed, full chunks), Hallucination Rate, Context Precision

Retrieval: MRR@1/3/5, Recall@5/10, NDCG@5/10

Efficiency: Mean latency per query (~12s)

Statistical significance via paired t-tests (p < 0.001 for 16 out of 23 metrics).

Project Structure

rag-culinary-assistant/
├── app.py                           # Gradio UI (5 tabs)
├── run_inference.py                 # CLI inference — handles all JSON formats
├── requirements.txt
├── east_asia_sample_qa.json         # Demo day sample input
├── .env                             # HF_TOKEN (not committed)
│
├── src/
│   ├── chunker.py                   # Fixed, sentence, semantic chunking
│   ├── embedder.py                  # BGE embedding wrapper
│   ├── retriever.py                 # Dense, BM25, hybrid + RRF + reranking + MMR
│   ├── generator.py                 # Qwen + 4 prompting strategies
│   └── evaluator.py                 # All 23 metrics + corrected faithfulness
│
├── notebooks/
│   ├── 01 data_collection.ipynb
│   ├── 02 benchmark_creation.ipynb
│   ├── 03 ingestion.ipynb
│   ├── 04 inference.ipynb
│   ├── 05 evaluation.ipynb
│   └── 06 ablation_evaluation.ipynb
│
├── data/
│   ├── raw/                         # Scraped documents (not committed)
│   └── benchmark/
│       ├── benchmark_full.json      # All 116 QA pairs
│       ├── train_set.json           # 92 training pairs
│       └── test_set_queries_only.json
│
├── vector_store/
│   ├── index.faiss                  # FAISS dense index (not committed)
│   ├── chunks.pkl                   # Chunk text + metadata (not committed)
│   ├── bm25.pkl                     # BM25 index (not committed)
│   └── metadata.json
│
├── outputs/
│   ├── train_outputs.json           # Inference outputs (committed for reproducibility)
│   ├── output_payload.json          # Demo day submission file
│   ├── evaluation/
│   │   └── evaluation_report_complete.json
│   └── ablation/
│       └── ablation_report.json
│
└── docs/
    └── screenshots/                 # UI screenshots (add your own here)
        ├── ask_tab.png
        ├── compare_tab.png
        ├── results_tab.png
        ├── batch_tab.png
        └── system_tab.png

Tech Stack

Category	Technology
LLM	Qwen2.5-0.5B-Instruct
Embeddings	BAAI/bge-small-en-v1.5 (Sentence Transformers)
Reranker	cross-encoder/ms-marco-MiniLM-L-6-v2
Vector Store	FAISS IndexFlatIP
Sparse Retrieval	BM25 (rank-bm25)
UI	Gradio
Evaluation	ROUGE, BERTScore, METEOR, BLEU, SciPy
Scraping	BeautifulSoup4, Requests
NLP	NLTK, scikit-learn

Academic Context

Course: COMP64702 Transforming Text Into Meaning Institution: University of Manchester Year: 2024/25 Reference: Yu et al., 2025. Evaluation of Retrieval-Augmented Generation: A Survey. Springer.

Author

Kavin Sundarr, Rohan Inamdar

Built without RAG frameworks — every component designed, implemented, and ablation-tested from scratch.

23 evaluation metrics · 16 significant at p<0.001 · 7-configuration ablation study

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data/benchmark		data/benchmark
docs/screenshots		docs/screenshots
notebooks		notebooks
outputs		outputs
src		src
vector_store		vector_store
.gitignore		.gitignore
README.md		README.md
app.py		app.py
east_asia_sample_qa.json		east_asia_sample_qa.json
gitignore		gitignore
input_payload.json		input_payload.json
requirements.txt		requirements.txt
run_inference.py		run_inference.py

Folders and files

Latest commit

History

Repository files navigation

🍜 East Asian Cuisine RAG Assistant

Overview

What makes this project stand out

Results

Generation Quality — RAG System vs No-Context Baseline

Retrieval Quality

Results by Question Type

Summary Scorecard

UI Screenshots

Ask a Question

Compare Prompting Strategies

Results & Ablation

Batch Inference

System Info

Ablation Study

Architecture

Quick Start

Prerequisites

Installation

Run the UI

Command-line inference

Run notebooks in order

Methodology

1. Data Collection

2. Benchmark Dataset

3. Chunking — 4 Strategies Compared

4. Embedding — 2 Models Compared

5. Retrieval — 3 Strategies Compared

6. Prompting — 4 Strategies Compared

7. Evaluation — 23 Metrics

Project Structure

Tech Stack

Academic Context

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages