Evaluation tool for the FinRetrieval financial QA benchmark.
FinRetrieval is a benchmark for evaluating LLM agents on financial question-answering tasks that require retrieving historic data from external sources (web search, structured databases via MCP).
📄 Technical Paper · 📊 Dataset
git clone https://github.com/daloopa/finretrieval.git
cd finretrieval
pip install -e .Create a .env file (auto-loaded by CLI):
cp .env.example .env
# Edit .env with your API keysOr export directly:
export ANTHROPIC_API_KEY=... # For Claude models
export OPENAI_API_KEY=... # For OpenAI models and LLM judge
export GOOGLE_API_KEY=... # For Gemini modelspip install datasets
python -c "
from datasets import load_dataset
questions = load_dataset('daloopa/finretrieval', data_files='questions.parquet', split='train')
questions.to_parquet('questions.parquet')
print('Saved questions.parquet')
"To use Daloopa's MCP tools for structured financial data:
# Copy the example config
cp .mcp.example.json .mcp.json
# Edit .mcp.json and replace YOUR_BEARER_TOKEN_HERE with your token
# See "MCP Server Access" section below for how to get a tokenfinretrieval collect \
--questions questions.parquet \
--model opus-4.5 \
--output-dir output/finretrieval score \
--questions questions.parquet \
--responses output/opus4.5-responses.jsonl \
--output-dir output/Collect agent responses for benchmark questions.
# Basic collection
finretrieval collect -q questions.parquet -m opus-4.5
# With reasoning/extended thinking enabled
finretrieval collect -q questions.parquet -m gpt-5.2 --reasoning
# WebSearch only (no MCP tools)
finretrieval collect -q questions.parquet -m gemini-3 --no-mcp
# Resume interrupted collection
finretrieval collect -q questions.parquet -m opus-4.5 --resume
# Custom MCP config path
finretrieval collect -q questions.parquet -m opus-4.5 --mcp-config my-mcp.json
# Custom timeout (default: 600s)
finretrieval collect -q questions.parquet -m opus-4.5 --timeout 300
# Test with limited questions (for validation before full run)
finretrieval collect -q questions.parquet -m opus-4.5 --limit 5Output files:
output/{config}-responses.jsonl- Agent responsesoutput/{config}-tool_traces.jsonl- Tool call traces
Score agent responses against ground truth using an LLM judge.
# Score responses (uses gpt-5.2 as default judge)
finretrieval score -q questions.parquet -r output/opus4.5-responses.jsonl
# Custom judge model
finretrieval score -q questions.parquet -r output/opus4.5-responses.jsonl --model gpt-4o-miniOutput files:
output/{config}-scores.jsonl- Scoring results with accuracy metrics
The benchmark evaluates 14 configurations across 3 providers:
| Configuration | Provider | Model | Tools | Reasoning |
|---|---|---|---|---|
opus4.5 |
Anthropic | Claude Opus 4.5 | MCP | No |
opus4.5_reasoning |
Anthropic | Claude Opus 4.5 | MCP | Yes |
opus4.5_webonly |
Anthropic | Claude Opus 4.5 | WebSearch | No |
opus4.5_webonly_reasoning |
Anthropic | Claude Opus 4.5 | WebSearch | Yes |
sonnet4.5 |
Anthropic | Claude Sonnet 4.5 | MCP | No |
sonnet4.5_reasoning |
Anthropic | Claude Sonnet 4.5 | MCP | Yes |
gpt5.2 |
OpenAI | GPT-5.2 | MCP | No |
gpt5.2_reasoning |
OpenAI | GPT-5.2 | MCP | Yes |
gpt5.2_webonly |
OpenAI | GPT-5.2 | WebSearch | No |
gpt5.2_webonly_reasoning |
OpenAI | GPT-5.2 | WebSearch | Yes |
gemini3pro |
Gemini 3 Pro | MCP | No | |
gemini3pro_reasoning |
Gemini 3 Pro | MCP | Yes | |
gemini3pro_webonly |
Gemini 3 Pro | WebSearch | No | |
gemini3pro_webonly_reasoning |
Gemini 3 Pro | WebSearch | Yes |
Notes:
- Sonnet 4.5 does not support WebSearch-only mode (only MCP variants are available)
- Configuration names are auto-derived from CLI flags (
--model,--reasoning,--no-mcp)
The benchmark uses Daloopa's MCP server for structured financial data access.
Server URL: https://mcp.daloopa.com/server/mcp
MCP requires a bearer token for authentication. Two options:
-
From API key:
curl -X POST https://mcp.daloopa.com/auth/token \ -H "Content-Type: application/json" \ -d '{"api_key": "YOUR_API_KEY"}'
-
Via OAuth: Interactive login at Daloopa MCP (used by Claude.ai/ChatGPT connectors)
- Copy the example config:
cp .mcp.example.json .mcp.json - Replace
YOUR_BEARER_TOKEN_HEREwith your token
See Daloopa MCP Documentation for API key setup and available tools.
| Variable | Required For | Description |
|---|---|---|
ANTHROPIC_API_KEY |
Claude models | Anthropic API key |
OPENAI_API_KEY |
OpenAI models, scoring | OpenAI API key |
GOOGLE_API_KEY |
Gemini models | Google AI API key |
MIT License - see LICENSE for details.