LemmaBench

Original implementation of the paper: LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics
https://arxiv.org/html/2602.24173v1

Setup

# Python 3.10+ recommended
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Conda alternative:

conda create -n LemmaBench python=3.10 -y
conda activate LemmaBench
pip install -r requirements.txt

Set the API key(s) for the model providers you use (for example OPENAI_API_KEY, GEMINI_API_KEY, MISTRAL_API_KEY, DEEPSEEK_API_KEY).

Default model in this repo is now gpt-5 for extraction, self-containment judging, proof generation, and proof judging. OpenAI gpt-5.* variants (for example gpt-5.5) are also supported.

Pipeline Commands

Run from the repository root.

Download papers for past week:

python -m src.download --timeframe pastweek --category math --out_dir ../data/arxiv

Alternative (latest recent listing):

python -m src.download --timeframe recent --category math --out_dir ../data/arxiv

Extract lemmas:

python -m src.extract --input_dir ../data/arxiv --output_file ../data/arxiv/lemmas.jsonl --model_extract gpt-5

Judge self-contained lemmas:

python -m src.judge_self_contained --input_file ../data/arxiv/lemmas.jsonl --output_file ../data/arxiv/lemmas_judged.jsonl --model_name gpt-5

Generate proofs:

python -m src.prove --input_file ../data/arxiv/lemmas_judged.jsonl --output_file ../data/arxiv/proofs.jsonl --model_name gpt-5

Judge proofs:

python -m src.judge --input_path ../data/arxiv/proofs.jsonl --output_path ../data/arxiv/judged.jsonl --model_name gpt-5

Proof judging modes:

Default (--full_proof_judge false): judge each proof step separately (one LLM call per step), then generate global feedback. This is the option by default and is expected to lead to a stricter judge.
Optional (--full_proof_judge true): judge the full proof in a single LLM call. This reduces the number of LLM calls

Example (single-call full-proof mode):

python -m src.judge --input_path ../data/arxiv/proofs.jsonl --output_path ../data/arxiv/judged.jsonl --model_name gpt-5 --full_proof_judge true

Compute stats:

python -m utils.stats ../data/arxiv --xlsx

utils.stats scans recursively for files ending with judged.jsonl.

Test your own model on the live Benchmark

Option A: External prover (recommended if your prover is fully separate)

You can use this repo for extraction + judging, while generating proofs in an external system.

Run up to self-contained filtering:

python -m src.extract --input_dir ../data/arxiv --output_file ../data/arxiv/lemmas.jsonl --model_extract gpt-5
python -m src.judge_self_contained --input_file ../data/arxiv/lemmas.jsonl --output_file ../data/arxiv/lemmas_judged.jsonl --model_name gpt-5

Generate your own /data/arxiv/proofs.jsonl externally from /data/arxiv/lemmas_judged.jsonl

Each line should be a JSON object with at least:

lemma (string)
assumptions (string)
proof_steps (list of strings)

For traceability/stats we recommend to keep the fields of lemmas_judged.jsonl, that is:

filename, lemma_id, mode, extract_model, judge_verdicts and to add the field:
proof_model (example: "my_external_prover_v1")

Minimal example line:

{"lemma":"\\begin{lemma} ... \\end{lemma}","assumptions":"\\begin{itemize} ... \\end{itemize}","proof_steps":["Step 1 ...","Step 2 ..."],"proof_model":"my_external_prover_v1"}

Use the judge model(s) on your external proofs:

python -m src.judge --input_path ../data/arxiv/proofs.jsonl --output_path ../data/arxiv/judged.jsonl --model_name gpt-5

Option B: Add another LLM provider to this repo

The central integration point is src/llm.py (LLMClient).

To add a new provider:

Add provider defaults in PROVIDER_DEFAULTS (base URL + env var key).
Extend infer_provider_from_model to route your model name to the provider.
Initialize the provider SDK/client in LLMClient.__init__.
Add the inference call branch in LLMClient.chat.

After that, you can use your model everywhere via existing CLI flags:

src.extract --model_extract ...
src.judge_self_contained --model_name ...
src.prove --model_name ...
src.judge --model_name ...

Stats Output Format

utils.stats supports both Excel and CSV outputs.

--xlsx writes a multi-sheet .xlsx file (good for quick analysis and sharing).
--csv-dir <dir> writes multiple CSV files.
--combined-csv <file> writes one combined domain-level CSV.

Examples:

python -m utils.stats ../data/arxiv --xlsx
python -m utils.stats ../data/arxiv --csv-dir ../data/arxiv/stats_csv
python -m utils.stats ../data/arxiv --xlsx --csv-dir ../data/arxiv/stats_csv --combined-csv ../data/arxiv/summary.csv

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LemmaBench

Setup

Pipeline Commands

Test your own model on the live Benchmark

Option A: External prover (recommended if your prover is fully separate)

Option B: Add another LLM provider to this repo

Stats Output Format

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LemmaBench

Setup

Pipeline Commands

Test your own model on the live Benchmark

Option A: External prover (recommended if your prover is fully separate)

Option B: Add another LLM provider to this repo

Stats Output Format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages