Skip to content

Vrin-cloud/vrin-benchmarks

Repository files navigation

VRIN Benchmark Suite

Reproducible benchmark evaluation of VRIN's Hybrid RAG system against industry-standard datasets.

Results Summary

Benchmark Samples Accuracy 95% CI Best Competitor Gap
MultiHop-RAG 384 95.1% [90.5%, 99.7%] 78.9% (GPT 5.2 w/ evidence) +16.2pp
RAGBench FinQA 384 97.5% ±4.5% 47.2% (LLaMA 3.3-70B) +50.3pp

Benchmarks

1. MultiHop-RAG

2. RAGBench FinQA

3. GPT Baseline (Comparison)

  • Purpose: Compare VRIN against raw GPT with evidence documents in context
  • Setup: Same evidence documents given to GPT directly (simulating copy/paste into ChatGPT)
  • Same evaluation: Uses identical LLM normalizer as VRIN for fair comparison

Quick Start

# Clone this repository
git clone https://github.com/Programmer7129/vrin-benchmarks.git
cd vrin-benchmarks

# Install dependencies
pip install -r requirements.txt

# Set your VRIN API key
export TEST_ACC_API_KEY="vrin_xxxx"

# Download datasets (first time only)
python multihop_rag/scripts/download_data.py
python ragbench_finqa/scripts/download_data.py

# Run benchmarks
python run_multihop_benchmark.py    # MultiHop-RAG (384 samples)
python run_finqa_benchmark.py       # FinQA (384 samples)

# Run GPT baseline comparison (requires OPENAI_API_KEY)
export OPENAI_API_KEY="sk-xxxx"
python run_gpt_baseline_benchmark.py

Repository Structure

vrin-benchmarks/
├── README.md                          # This file
├── requirements.txt                   # Python dependencies
├── benchmark_utils.py                 # Shared utilities (statistics, evaluation, normalization)
├── run_multihop_benchmark.py          # MultiHop-RAG evaluation script
├── run_finqa_benchmark.py             # FinQA evaluation script
├── run_gpt_baseline_benchmark.py      # GPT baseline comparison script
│
├── multihop_rag/
│   ├── data/                          # Dataset (downloaded via script)
│   │   ├── queries_train.json
│   │   ├── corpus_train.json
│   │   └── dataset_info.json
│   ├── results/                       # Benchmark result JSONs
│   ├── logs/                          # Execution logs
│   └── scripts/
│       └── download_data.py           # HuggingFace dataset downloader
│
├── ragbench_finqa/
│   ├── data/                          # Dataset (downloaded via script)
│   ├── results/
│   ├── logs/
│   └── scripts/
│       └── download_data.py
│
└── gpt_baseline/
    ├── results/                       # GPT comparison results
    └── logs/

Evaluation Methodology

LLM-Based Answer Normalization

VRIN returns detailed, well-reasoned responses. For example, when the benchmark expects "Yes", VRIN might respond:

"Based on the evidence from both documents, the data strongly supports that the acquisition timeline aligns with the reported Q3 earnings..."

This is correct — VRIN is conveying "Yes" through detailed reasoning. To fairly evaluate this, we use a three-stage evaluation pipeline:

  1. Direct match: Check if the expected answer appears as a substring in the response
  2. LLM normalization: Use GPT-4o-mini to extract the core answer from the verbose response, then match
  3. Semantic fallback: Pattern-based detection of Yes/No/Similar/Different indicators

This ensures VRIN isn't penalized for being thorough — only the intent of the answer matters, not the exact keyword.

Statistical Approach

We follow BetterBench guidelines for rigorous AI benchmarking:

  1. Reproducible sampling: Seed=42, stratified by question type
  2. Confidence intervals: 95% CI with finite population correction
  3. Progress checkpoints: Results logged every 10 questions
  4. Full transparency: Raw per-question results in JSON

Sample Size Calculation

Margins are calculated dynamically using finite population correction:

Formula: MOE = z * sqrt(p(1-p)/n) * sqrt((N-n)/(N-1))

Benchmark Population (N) Sample (n) Calculated Margin
MultiHop-RAG 2,556 384 ±4.6%
FinQA ~2,300 384 ±4.5%

Detailed Results

MultiHop-RAG (Latest: Feb 2026)

Metric Score
Overall Accuracy 95.1% (365/384)
95% Confidence Interval [90.5%, 99.7%]
Margin of Error ±4.6%

Accuracy by Question Type:

Type Accuracy Detail
Inference 99.2% 122/123
Comparison 94.6% 122/129
Temporal 89.8% 79/88
Null (insufficient info) 95.5% 42/44

Match Type Breakdown:

Match Type Count Description
Direct match 297 Expected keyword found in response
LLM normalized (partial) 31 LLM extracted matching answer
LLM normalized (exact) 20 LLM extracted exact answer
Semantic (yes) 3 Yes indicators detected
No match 33 Incorrect answer

GPT 5.2 Baseline (Feb 2026, 384 samples)

Metric VRIN GPT 5.2 (w/ evidence) Delta
Overall Accuracy 95.1% 78.9% +16.2pp
Inference queries 99.2% 98.4% +0.8pp
Comparison queries 94.6% 79.1% +15.5pp
Temporal queries 89.8% 40.9% +48.9pp
Null queries 95.5% 100.0% -4.5pp

VRIN outperforms GPT 5.2 by +16.2 percentage points overall, with the largest gaps on temporal (+48.9pp) and comparison (+15.5pp) queries. GPT 5.2 receives oracle evidence documents directly in context; VRIN retrieves from a noisy 609-article corpus.

Comparison with Published Baselines

MultiHop-RAG Leaderboard

System Accuracy
VRIN (Hybrid RAG) 95.1%
GPT 5.2 (w/ evidence in context) 78.9%
Multi-Meta RAG + GPT-4 63.0%
IRCoT + GPT-4 58.2%
Standard RAG + GPT-4 47.3%

Published baselines from MultiHop-RAG paper

RAGBench FinQA Leaderboard

System Accuracy
VRIN (Hybrid RAG) 97.5%
LLaMA 3.3-70B 47.2%
GPT-4 (baseline) 42.8%
Claude 3 Opus 39.1%

Published baselines from RAGBench paper

Why VRIN Performs Better

  1. Entity-Centric Extraction: Structured facts (subject-predicate-object triples) instead of raw chunks
  2. Hybrid Retrieval: Knowledge graph traversal + vector search fusion with confidence-scored multi-hop
  3. Table-Aware Processing: Preserves row/column relationships during extraction
  4. Multi-Hop Reasoning: Graph traversal connects facts across documents automatically

Reproducing Results

# Use the same parameters as our published run
export TEST_ACC_API_KEY="vrin_xxxx"
python run_multihop_benchmark.py    # Seed=42 (hardcoded)
python run_finqa_benchmark.py       # Seed=42 (hardcoded)

# GPT baseline comparison
export OPENAI_API_KEY="sk-xxxx"
python run_gpt_baseline_benchmark.py

Results are saved to {benchmark}/results/ with timestamps.

Known Limitations

  • Complex nested tables may not extract perfectly
  • Very large tables (50+ rows) can exceed chunk sizes
  • "Null" queries (insufficient information) improved from 63.6% to 95.5% with adaptive bail-out

License

MIT License - Feel free to use, modify, and distribute.

Citation

@misc{vrin-benchmarks-2026,
  title={VRIN Hybrid RAG Benchmark Evaluation},
  author={VRIN Team},
  year={2026},
  url={https://github.com/Programmer7129/vrin-benchmarks}
}

Contact

About

Files related to benchmark testing for Vrin

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages