Skip to content

daloopa/finretrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FinRetrieval Evaluation Tool

Evaluation tool for the FinRetrieval financial QA benchmark.

FinRetrieval is a benchmark for evaluating LLM agents on financial question-answering tasks that require retrieving historic data from external sources (web search, structured databases via MCP).

📄 Technical Paper · 📊 Dataset

Installation

git clone https://github.com/daloopa/finretrieval.git
cd finretrieval
pip install -e .

Quick Start

1. Set API Keys

Create a .env file (auto-loaded by CLI):

cp .env.example .env
# Edit .env with your API keys

Or export directly:

export ANTHROPIC_API_KEY=...  # For Claude models
export OPENAI_API_KEY=...     # For OpenAI models and LLM judge
export GOOGLE_API_KEY=...     # For Gemini models

2. Download Benchmark Questions

pip install datasets
python -c "
from datasets import load_dataset
questions = load_dataset('daloopa/finretrieval', data_files='questions.parquet', split='train')
questions.to_parquet('questions.parquet')
print('Saved questions.parquet')
"

3. Configure MCP (Optional)

To use Daloopa's MCP tools for structured financial data:

# Copy the example config
cp .mcp.example.json .mcp.json

# Edit .mcp.json and replace YOUR_BEARER_TOKEN_HERE with your token
# See "MCP Server Access" section below for how to get a token

4. Run Collection

finretrieval collect \
  --questions questions.parquet \
  --model opus-4.5 \
  --output-dir output/

5. Score Responses

finretrieval score \
  --questions questions.parquet \
  --responses output/opus4.5-responses.jsonl \
  --output-dir output/

Commands

Collection

Collect agent responses for benchmark questions.

# Basic collection
finretrieval collect -q questions.parquet -m opus-4.5

# With reasoning/extended thinking enabled
finretrieval collect -q questions.parquet -m gpt-5.2 --reasoning

# WebSearch only (no MCP tools)
finretrieval collect -q questions.parquet -m gemini-3 --no-mcp

# Resume interrupted collection
finretrieval collect -q questions.parquet -m opus-4.5 --resume

# Custom MCP config path
finretrieval collect -q questions.parquet -m opus-4.5 --mcp-config my-mcp.json

# Custom timeout (default: 600s)
finretrieval collect -q questions.parquet -m opus-4.5 --timeout 300

# Test with limited questions (for validation before full run)
finretrieval collect -q questions.parquet -m opus-4.5 --limit 5

Output files:

  • output/{config}-responses.jsonl - Agent responses
  • output/{config}-tool_traces.jsonl - Tool call traces

Scoring

Score agent responses against ground truth using an LLM judge.

# Score responses (uses gpt-5.2 as default judge)
finretrieval score -q questions.parquet -r output/opus4.5-responses.jsonl

# Custom judge model
finretrieval score -q questions.parquet -r output/opus4.5-responses.jsonl --model gpt-4o-mini

Output files:

  • output/{config}-scores.jsonl - Scoring results with accuracy metrics

Configurations

The benchmark evaluates 14 configurations across 3 providers:

Configuration Provider Model Tools Reasoning
opus4.5 Anthropic Claude Opus 4.5 MCP No
opus4.5_reasoning Anthropic Claude Opus 4.5 MCP Yes
opus4.5_webonly Anthropic Claude Opus 4.5 WebSearch No
opus4.5_webonly_reasoning Anthropic Claude Opus 4.5 WebSearch Yes
sonnet4.5 Anthropic Claude Sonnet 4.5 MCP No
sonnet4.5_reasoning Anthropic Claude Sonnet 4.5 MCP Yes
gpt5.2 OpenAI GPT-5.2 MCP No
gpt5.2_reasoning OpenAI GPT-5.2 MCP Yes
gpt5.2_webonly OpenAI GPT-5.2 WebSearch No
gpt5.2_webonly_reasoning OpenAI GPT-5.2 WebSearch Yes
gemini3pro Google Gemini 3 Pro MCP No
gemini3pro_reasoning Google Gemini 3 Pro MCP Yes
gemini3pro_webonly Google Gemini 3 Pro WebSearch No
gemini3pro_webonly_reasoning Google Gemini 3 Pro WebSearch Yes

Notes:

  • Sonnet 4.5 does not support WebSearch-only mode (only MCP variants are available)
  • Configuration names are auto-derived from CLI flags (--model, --reasoning, --no-mcp)

MCP Server Access

The benchmark uses Daloopa's MCP server for structured financial data access.

Server URL: https://mcp.daloopa.com/server/mcp

Getting a Bearer Token

MCP requires a bearer token for authentication. Two options:

  1. From API key:

    curl -X POST https://mcp.daloopa.com/auth/token \
      -H "Content-Type: application/json" \
      -d '{"api_key": "YOUR_API_KEY"}'
  2. Via OAuth: Interactive login at Daloopa MCP (used by Claude.ai/ChatGPT connectors)

Configuration

  1. Copy the example config: cp .mcp.example.json .mcp.json
  2. Replace YOUR_BEARER_TOKEN_HERE with your token

See Daloopa MCP Documentation for API key setup and available tools.

Environment Variables

Variable Required For Description
ANTHROPIC_API_KEY Claude models Anthropic API key
OPENAI_API_KEY OpenAI models, scoring OpenAI API key
GOOGLE_API_KEY Gemini models Google AI API key

License

MIT License - see LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages