Skip to content

ahayat16/LemmaBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LemmaBench

Original implementation of the paper: LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics
https://arxiv.org/html/2602.24173v1

Setup

# Python 3.10+ recommended
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Conda alternative:

conda create -n LemmaBench python=3.10 -y
conda activate LemmaBench
pip install -r requirements.txt

Set the API key(s) for the model providers you use (for example OPENAI_API_KEY, GEMINI_API_KEY, MISTRAL_API_KEY, DEEPSEEK_API_KEY).

Default model in this repo is now gpt-5 for extraction, self-containment judging, proof generation, and proof judging. OpenAI gpt-5.* variants (for example gpt-5.5) are also supported.

Pipeline Commands

Run from the repository root.

  1. Download papers for past week:
python -m src.download --timeframe pastweek --category math --out_dir ../data/arxiv

Alternative (latest recent listing):

python -m src.download --timeframe recent --category math --out_dir ../data/arxiv
  1. Extract lemmas:
python -m src.extract --input_dir ../data/arxiv --output_file ../data/arxiv/lemmas.jsonl --model_extract gpt-5
  1. Judge self-contained lemmas:
python -m src.judge_self_contained --input_file ../data/arxiv/lemmas.jsonl --output_file ../data/arxiv/lemmas_judged.jsonl --model_name gpt-5
  1. Generate proofs:
python -m src.prove --input_file ../data/arxiv/lemmas_judged.jsonl --output_file ../data/arxiv/proofs.jsonl --model_name gpt-5
  1. Judge proofs:
python -m src.judge --input_path ../data/arxiv/proofs.jsonl --output_path ../data/arxiv/judged.jsonl --model_name gpt-5

Proof judging modes:

  • Default (--full_proof_judge false): judge each proof step separately (one LLM call per step), then generate global feedback. This is the option by default and is expected to lead to a stricter judge.
  • Optional (--full_proof_judge true): judge the full proof in a single LLM call. This reduces the number of LLM calls

Example (single-call full-proof mode):

python -m src.judge --input_path ../data/arxiv/proofs.jsonl --output_path ../data/arxiv/judged.jsonl --model_name gpt-5 --full_proof_judge true
  1. Compute stats:
python -m utils.stats ../data/arxiv --xlsx

utils.stats scans recursively for files ending with judged.jsonl.

Test your own model on the live Benchmark

Option A: External prover (recommended if your prover is fully separate)

You can use this repo for extraction + judging, while generating proofs in an external system.

  1. Run up to self-contained filtering:
python -m src.extract --input_dir ../data/arxiv --output_file ../data/arxiv/lemmas.jsonl --model_extract gpt-5
python -m src.judge_self_contained --input_file ../data/arxiv/lemmas.jsonl --output_file ../data/arxiv/lemmas_judged.jsonl --model_name gpt-5
  1. Generate your own /data/arxiv/proofs.jsonl externally from /data/arxiv/lemmas_judged.jsonl

Each line should be a JSON object with at least:

  • lemma (string)
  • assumptions (string)
  • proof_steps (list of strings)

For traceability/stats we recommend to keep the fields of lemmas_judged.jsonl, that is:

  • filename, lemma_id, mode, extract_model, judge_verdicts and to add the field:
  • proof_model (example: "my_external_prover_v1")

Minimal example line:

{"lemma":"\\begin{lemma} ... \\end{lemma}","assumptions":"\\begin{itemize} ... \\end{itemize}","proof_steps":["Step 1 ...","Step 2 ..."],"proof_model":"my_external_prover_v1"}
  1. Use the judge model(s) on your external proofs:
python -m src.judge --input_path ../data/arxiv/proofs.jsonl --output_path ../data/arxiv/judged.jsonl --model_name gpt-5

Option B: Add another LLM provider to this repo

The central integration point is src/llm.py (LLMClient).

To add a new provider:

  1. Add provider defaults in PROVIDER_DEFAULTS (base URL + env var key).
  2. Extend infer_provider_from_model to route your model name to the provider.
  3. Initialize the provider SDK/client in LLMClient.__init__.
  4. Add the inference call branch in LLMClient.chat.

After that, you can use your model everywhere via existing CLI flags:

  • src.extract --model_extract ...
  • src.judge_self_contained --model_name ...
  • src.prove --model_name ...
  • src.judge --model_name ...

Stats Output Format

utils.stats supports both Excel and CSV outputs.

  • --xlsx writes a multi-sheet .xlsx file (good for quick analysis and sharing).
  • --csv-dir <dir> writes multiple CSV files.
  • --combined-csv <file> writes one combined domain-level CSV.

Examples:

python -m utils.stats ../data/arxiv --xlsx
python -m utils.stats ../data/arxiv --csv-dir ../data/arxiv/stats_csv
python -m utils.stats ../data/arxiv --xlsx --csv-dir ../data/arxiv/stats_csv --combined-csv ../data/arxiv/summary.csv

About

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics (https://arxiv.org/html/2602.24173v1)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages