Reproducible benchmark evaluation of VRIN's Hybrid RAG system against industry-standard datasets.
| Benchmark | Samples | Accuracy | 95% CI | Best Competitor | Gap |
|---|---|---|---|---|---|
| MultiHop-RAG | 384 | 95.1% | [90.5%, 99.7%] | 78.9% (GPT 5.2 w/ evidence) | +16.2pp |
| RAGBench FinQA | 384 | 97.5% | ±4.5% | 47.2% (LLaMA 3.3-70B) | +50.3pp |
- Source: yixuantt/MultiHopRAG
- Paper: MultiHop-RAG: Benchmarking RAG for Multi-Hop Queries
- Dataset: 2,556 queries requiring cross-document reasoning (2-4 documents)
- Sampling: Stratified by
question_typefor representative evaluation - Evaluation: LLM-based answer normalization + semantic matching (see Evaluation Methodology)
- Source: rungalileo/ragbench
- Paper: RAGBench: Explainable Benchmark for RAG Systems
- Dataset: ~2,300 financial QA pairs requiring numerical reasoning over tables + text
- Evaluation: Numerical matching with 1% tolerance (handles percentage conversions)
- Purpose: Compare VRIN against raw GPT with evidence documents in context
- Setup: Same evidence documents given to GPT directly (simulating copy/paste into ChatGPT)
- Same evaluation: Uses identical LLM normalizer as VRIN for fair comparison
# Clone this repository
git clone https://github.com/Programmer7129/vrin-benchmarks.git
cd vrin-benchmarks
# Install dependencies
pip install -r requirements.txt
# Set your VRIN API key
export TEST_ACC_API_KEY="vrin_xxxx"
# Download datasets (first time only)
python multihop_rag/scripts/download_data.py
python ragbench_finqa/scripts/download_data.py
# Run benchmarks
python run_multihop_benchmark.py # MultiHop-RAG (384 samples)
python run_finqa_benchmark.py # FinQA (384 samples)
# Run GPT baseline comparison (requires OPENAI_API_KEY)
export OPENAI_API_KEY="sk-xxxx"
python run_gpt_baseline_benchmark.pyvrin-benchmarks/
├── README.md # This file
├── requirements.txt # Python dependencies
├── benchmark_utils.py # Shared utilities (statistics, evaluation, normalization)
├── run_multihop_benchmark.py # MultiHop-RAG evaluation script
├── run_finqa_benchmark.py # FinQA evaluation script
├── run_gpt_baseline_benchmark.py # GPT baseline comparison script
│
├── multihop_rag/
│ ├── data/ # Dataset (downloaded via script)
│ │ ├── queries_train.json
│ │ ├── corpus_train.json
│ │ └── dataset_info.json
│ ├── results/ # Benchmark result JSONs
│ ├── logs/ # Execution logs
│ └── scripts/
│ └── download_data.py # HuggingFace dataset downloader
│
├── ragbench_finqa/
│ ├── data/ # Dataset (downloaded via script)
│ ├── results/
│ ├── logs/
│ └── scripts/
│ └── download_data.py
│
└── gpt_baseline/
├── results/ # GPT comparison results
└── logs/
VRIN returns detailed, well-reasoned responses. For example, when the benchmark expects "Yes", VRIN might respond:
"Based on the evidence from both documents, the data strongly supports that the acquisition timeline aligns with the reported Q3 earnings..."
This is correct — VRIN is conveying "Yes" through detailed reasoning. To fairly evaluate this, we use a three-stage evaluation pipeline:
- Direct match: Check if the expected answer appears as a substring in the response
- LLM normalization: Use GPT-4o-mini to extract the core answer from the verbose response, then match
- Semantic fallback: Pattern-based detection of Yes/No/Similar/Different indicators
This ensures VRIN isn't penalized for being thorough — only the intent of the answer matters, not the exact keyword.
We follow BetterBench guidelines for rigorous AI benchmarking:
- Reproducible sampling: Seed=42, stratified by question type
- Confidence intervals: 95% CI with finite population correction
- Progress checkpoints: Results logged every 10 questions
- Full transparency: Raw per-question results in JSON
Margins are calculated dynamically using finite population correction:
Formula: MOE = z * sqrt(p(1-p)/n) * sqrt((N-n)/(N-1))
| Benchmark | Population (N) | Sample (n) | Calculated Margin |
|---|---|---|---|
| MultiHop-RAG | 2,556 | 384 | ±4.6% |
| FinQA | ~2,300 | 384 | ±4.5% |
| Metric | Score |
|---|---|
| Overall Accuracy | 95.1% (365/384) |
| 95% Confidence Interval | [90.5%, 99.7%] |
| Margin of Error | ±4.6% |
Accuracy by Question Type:
| Type | Accuracy | Detail |
|---|---|---|
| Inference | 99.2% | 122/123 |
| Comparison | 94.6% | 122/129 |
| Temporal | 89.8% | 79/88 |
| Null (insufficient info) | 95.5% | 42/44 |
Match Type Breakdown:
| Match Type | Count | Description |
|---|---|---|
| Direct match | 297 | Expected keyword found in response |
| LLM normalized (partial) | 31 | LLM extracted matching answer |
| LLM normalized (exact) | 20 | LLM extracted exact answer |
| Semantic (yes) | 3 | Yes indicators detected |
| No match | 33 | Incorrect answer |
| Metric | VRIN | GPT 5.2 (w/ evidence) | Delta |
|---|---|---|---|
| Overall Accuracy | 95.1% | 78.9% | +16.2pp |
| Inference queries | 99.2% | 98.4% | +0.8pp |
| Comparison queries | 94.6% | 79.1% | +15.5pp |
| Temporal queries | 89.8% | 40.9% | +48.9pp |
| Null queries | 95.5% | 100.0% | -4.5pp |
VRIN outperforms GPT 5.2 by +16.2 percentage points overall, with the largest gaps on temporal (+48.9pp) and comparison (+15.5pp) queries. GPT 5.2 receives oracle evidence documents directly in context; VRIN retrieves from a noisy 609-article corpus.
| System | Accuracy |
|---|---|
| VRIN (Hybrid RAG) | 95.1% |
| GPT 5.2 (w/ evidence in context) | 78.9% |
| Multi-Meta RAG + GPT-4 | 63.0% |
| IRCoT + GPT-4 | 58.2% |
| Standard RAG + GPT-4 | 47.3% |
Published baselines from MultiHop-RAG paper
| System | Accuracy |
|---|---|
| VRIN (Hybrid RAG) | 97.5% |
| LLaMA 3.3-70B | 47.2% |
| GPT-4 (baseline) | 42.8% |
| Claude 3 Opus | 39.1% |
Published baselines from RAGBench paper
- Entity-Centric Extraction: Structured facts (subject-predicate-object triples) instead of raw chunks
- Hybrid Retrieval: Knowledge graph traversal + vector search fusion with confidence-scored multi-hop
- Table-Aware Processing: Preserves row/column relationships during extraction
- Multi-Hop Reasoning: Graph traversal connects facts across documents automatically
# Use the same parameters as our published run
export TEST_ACC_API_KEY="vrin_xxxx"
python run_multihop_benchmark.py # Seed=42 (hardcoded)
python run_finqa_benchmark.py # Seed=42 (hardcoded)
# GPT baseline comparison
export OPENAI_API_KEY="sk-xxxx"
python run_gpt_baseline_benchmark.pyResults are saved to {benchmark}/results/ with timestamps.
- Complex nested tables may not extract perfectly
- Very large tables (50+ rows) can exceed chunk sizes
- "Null" queries (insufficient information) improved from 63.6% to 95.5% with adaptive bail-out
MIT License - Feel free to use, modify, and distribute.
@misc{vrin-benchmarks-2026,
title={VRIN Hybrid RAG Benchmark Evaluation},
author={VRIN Team},
year={2026},
url={https://github.com/Programmer7129/vrin-benchmarks}
}- Questions: Open a GitHub issue
- Website: vrin.cloud