VRIN Benchmark Suite

Reproducible benchmark evaluation of VRIN's Hybrid RAG system against industry-standard datasets.

Results Summary

Benchmark	Samples	Accuracy	95% CI	Best Competitor	Gap
MultiHop-RAG	384	95.1%	[90.5%, 99.7%]	78.9% (GPT 5.2 w/ evidence)	+16.2pp
RAGBench FinQA	384	97.5%	±4.5%	47.2% (LLaMA 3.3-70B)	+50.3pp

Benchmarks

1. MultiHop-RAG

Source: yixuantt/MultiHopRAG
Paper: MultiHop-RAG: Benchmarking RAG for Multi-Hop Queries
Dataset: 2,556 queries requiring cross-document reasoning (2-4 documents)
Sampling: Stratified by question_type for representative evaluation
Evaluation: LLM-based answer normalization + semantic matching (see Evaluation Methodology)

2. RAGBench FinQA

Source: rungalileo/ragbench
Paper: RAGBench: Explainable Benchmark for RAG Systems
Dataset: ~2,300 financial QA pairs requiring numerical reasoning over tables + text
Evaluation: Numerical matching with 1% tolerance (handles percentage conversions)

3. GPT Baseline (Comparison)

Purpose: Compare VRIN against raw GPT with evidence documents in context
Setup: Same evidence documents given to GPT directly (simulating copy/paste into ChatGPT)
Same evaluation: Uses identical LLM normalizer as VRIN for fair comparison

Quick Start

# Clone this repository
git clone https://github.com/Programmer7129/vrin-benchmarks.git
cd vrin-benchmarks

# Install dependencies
pip install -r requirements.txt

# Set your VRIN API key
export TEST_ACC_API_KEY="vrin_xxxx"

# Download datasets (first time only)
python multihop_rag/scripts/download_data.py
python ragbench_finqa/scripts/download_data.py

# Run benchmarks
python run_multihop_benchmark.py    # MultiHop-RAG (384 samples)
python run_finqa_benchmark.py       # FinQA (384 samples)

# Run GPT baseline comparison (requires OPENAI_API_KEY)
export OPENAI_API_KEY="sk-xxxx"
python run_gpt_baseline_benchmark.py

Repository Structure

vrin-benchmarks/
├── README.md                          # This file
├── requirements.txt                   # Python dependencies
├── benchmark_utils.py                 # Shared utilities (statistics, evaluation, normalization)
├── run_multihop_benchmark.py          # MultiHop-RAG evaluation script
├── run_finqa_benchmark.py             # FinQA evaluation script
├── run_gpt_baseline_benchmark.py      # GPT baseline comparison script
│
├── multihop_rag/
│   ├── data/                          # Dataset (downloaded via script)
│   │   ├── queries_train.json
│   │   ├── corpus_train.json
│   │   └── dataset_info.json
│   ├── results/                       # Benchmark result JSONs
│   ├── logs/                          # Execution logs
│   └── scripts/
│       └── download_data.py           # HuggingFace dataset downloader
│
├── ragbench_finqa/
│   ├── data/                          # Dataset (downloaded via script)
│   ├── results/
│   ├── logs/
│   └── scripts/
│       └── download_data.py
│
└── gpt_baseline/
    ├── results/                       # GPT comparison results
    └── logs/

Evaluation Methodology

LLM-Based Answer Normalization

VRIN returns detailed, well-reasoned responses. For example, when the benchmark expects "Yes", VRIN might respond:

"Based on the evidence from both documents, the data strongly supports that the acquisition timeline aligns with the reported Q3 earnings..."

This is correct — VRIN is conveying "Yes" through detailed reasoning. To fairly evaluate this, we use a three-stage evaluation pipeline:

Direct match: Check if the expected answer appears as a substring in the response
LLM normalization: Use GPT-4o-mini to extract the core answer from the verbose response, then match
Semantic fallback: Pattern-based detection of Yes/No/Similar/Different indicators

This ensures VRIN isn't penalized for being thorough — only the intent of the answer matters, not the exact keyword.

Statistical Approach

We follow BetterBench guidelines for rigorous AI benchmarking:

Reproducible sampling: Seed=42, stratified by question type
Confidence intervals: 95% CI with finite population correction
Progress checkpoints: Results logged every 10 questions
Full transparency: Raw per-question results in JSON

Sample Size Calculation

Margins are calculated dynamically using finite population correction:

Formula: MOE = z * sqrt(p(1-p)/n) * sqrt((N-n)/(N-1))

Benchmark	Population (N)	Sample (n)	Calculated Margin
MultiHop-RAG	2,556	384	±4.6%
FinQA	~2,300	384	±4.5%

Detailed Results

MultiHop-RAG (Latest: Feb 2026)

Metric	Score
Overall Accuracy	95.1% (365/384)
95% Confidence Interval	[90.5%, 99.7%]
Margin of Error	±4.6%

Accuracy by Question Type:

Type	Accuracy	Detail
Inference	99.2%	122/123
Comparison	94.6%	122/129
Temporal	89.8%	79/88
Null (insufficient info)	95.5%	42/44

Match Type Breakdown:

Match Type	Count	Description
Direct match	297	Expected keyword found in response
LLM normalized (partial)	31	LLM extracted matching answer
LLM normalized (exact)	20	LLM extracted exact answer
Semantic (yes)	3	Yes indicators detected
No match	33	Incorrect answer

GPT 5.2 Baseline (Feb 2026, 384 samples)

Metric	VRIN	GPT 5.2 (w/ evidence)	Delta
Overall Accuracy	95.1%	78.9%	+16.2pp
Inference queries	99.2%	98.4%	+0.8pp
Comparison queries	94.6%	79.1%	+15.5pp
Temporal queries	89.8%	40.9%	+48.9pp
Null queries	95.5%	100.0%	-4.5pp

VRIN outperforms GPT 5.2 by +16.2 percentage points overall, with the largest gaps on temporal (+48.9pp) and comparison (+15.5pp) queries. GPT 5.2 receives oracle evidence documents directly in context; VRIN retrieves from a noisy 609-article corpus.

Comparison with Published Baselines

MultiHop-RAG Leaderboard

System	Accuracy
VRIN (Hybrid RAG)	95.1%
GPT 5.2 (w/ evidence in context)	78.9%
Multi-Meta RAG + GPT-4	63.0%
IRCoT + GPT-4	58.2%
Standard RAG + GPT-4	47.3%

Published baselines from MultiHop-RAG paper

RAGBench FinQA Leaderboard

System	Accuracy
VRIN (Hybrid RAG)	97.5%
LLaMA 3.3-70B	47.2%
GPT-4 (baseline)	42.8%
Claude 3 Opus	39.1%

Published baselines from RAGBench paper

Why VRIN Performs Better

Entity-Centric Extraction: Structured facts (subject-predicate-object triples) instead of raw chunks
Hybrid Retrieval: Knowledge graph traversal + vector search fusion with confidence-scored multi-hop
Table-Aware Processing: Preserves row/column relationships during extraction
Multi-Hop Reasoning: Graph traversal connects facts across documents automatically

Reproducing Results

# Use the same parameters as our published run
export TEST_ACC_API_KEY="vrin_xxxx"
python run_multihop_benchmark.py    # Seed=42 (hardcoded)
python run_finqa_benchmark.py       # Seed=42 (hardcoded)

# GPT baseline comparison
export OPENAI_API_KEY="sk-xxxx"
python run_gpt_baseline_benchmark.py

Results are saved to {benchmark}/results/ with timestamps.

Known Limitations

Complex nested tables may not extract perfectly
Very large tables (50+ rows) can exceed chunk sizes
"Null" queries (insufficient information) improved from 63.6% to 95.5% with adaptive bail-out

License

MIT License - Feel free to use, modify, and distribute.

Citation

@misc{vrin-benchmarks-2026,
  title={VRIN Hybrid RAG Benchmark Evaluation},
  author={VRIN Team},
  year={2026},
  url={https://github.com/Programmer7129/vrin-benchmarks}
}

Contact

Questions: Open a GitHub issue
Website: vrin.cloud

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VRIN Benchmark Suite

Results Summary

Benchmarks

1. MultiHop-RAG

2. RAGBench FinQA

3. GPT Baseline (Comparison)

Quick Start

Repository Structure

Evaluation Methodology

LLM-Based Answer Normalization

Statistical Approach

Sample Size Calculation

Detailed Results

MultiHop-RAG (Latest: Feb 2026)

GPT 5.2 Baseline (Feb 2026, 384 samples)

Comparison with Published Baselines

MultiHop-RAG Leaderboard

RAGBench FinQA Leaderboard

Why VRIN Performs Better

Reproducing Results

Known Limitations

License

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
financial-analyst		financial-analyst
gpt_baseline		gpt_baseline
multihop_rag		multihop_rag
musique		musique
ragbench_finqa		ragbench_finqa
.gitignore		.gitignore
README.md		README.md
benchmark_utils.py		benchmark_utils.py
requirements.txt		requirements.txt
run_finqa_benchmark.py		run_finqa_benchmark.py
run_gpt_baseline_benchmark.py		run_gpt_baseline_benchmark.py
run_multihop_benchmark.py		run_multihop_benchmark.py
run_multihop_query_only.py		run_multihop_query_only.py
run_null_queries_only.py		run_null_queries_only.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

VRIN Benchmark Suite

Results Summary

Benchmarks

1. MultiHop-RAG

2. RAGBench FinQA

3. GPT Baseline (Comparison)

Quick Start

Repository Structure

Evaluation Methodology

LLM-Based Answer Normalization

Statistical Approach

Sample Size Calculation

Detailed Results

MultiHop-RAG (Latest: Feb 2026)

GPT 5.2 Baseline (Feb 2026, 384 samples)

Comparison with Published Baselines

MultiHop-RAG Leaderboard

RAGBench FinQA Leaderboard

Why VRIN Performs Better

Reproducing Results

Known Limitations

License

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages