Original implementation of the paper:
LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics
https://arxiv.org/html/2602.24173v1
# Python 3.10+ recommended
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtConda alternative:
conda create -n LemmaBench python=3.10 -y
conda activate LemmaBench
pip install -r requirements.txtSet the API key(s) for the model providers you use (for example OPENAI_API_KEY, GEMINI_API_KEY, MISTRAL_API_KEY, DEEPSEEK_API_KEY).
Default model in this repo is now gpt-5 for extraction, self-containment judging, proof generation, and proof judging.
OpenAI gpt-5.* variants (for example gpt-5.5) are also supported.
Run from the repository root.
- Download papers for past week:
python -m src.download --timeframe pastweek --category math --out_dir ../data/arxivAlternative (latest recent listing):
python -m src.download --timeframe recent --category math --out_dir ../data/arxiv- Extract lemmas:
python -m src.extract --input_dir ../data/arxiv --output_file ../data/arxiv/lemmas.jsonl --model_extract gpt-5- Judge self-contained lemmas:
python -m src.judge_self_contained --input_file ../data/arxiv/lemmas.jsonl --output_file ../data/arxiv/lemmas_judged.jsonl --model_name gpt-5- Generate proofs:
python -m src.prove --input_file ../data/arxiv/lemmas_judged.jsonl --output_file ../data/arxiv/proofs.jsonl --model_name gpt-5- Judge proofs:
python -m src.judge --input_path ../data/arxiv/proofs.jsonl --output_path ../data/arxiv/judged.jsonl --model_name gpt-5Proof judging modes:
- Default (
--full_proof_judge false): judge each proof step separately (one LLM call per step), then generate global feedback. This is the option by default and is expected to lead to a stricter judge. - Optional (
--full_proof_judge true): judge the full proof in a single LLM call. This reduces the number of LLM calls
Example (single-call full-proof mode):
python -m src.judge --input_path ../data/arxiv/proofs.jsonl --output_path ../data/arxiv/judged.jsonl --model_name gpt-5 --full_proof_judge true- Compute stats:
python -m utils.stats ../data/arxiv --xlsxutils.stats scans recursively for files ending with judged.jsonl.
You can use this repo for extraction + judging, while generating proofs in an external system.
- Run up to self-contained filtering:
python -m src.extract --input_dir ../data/arxiv --output_file ../data/arxiv/lemmas.jsonl --model_extract gpt-5
python -m src.judge_self_contained --input_file ../data/arxiv/lemmas.jsonl --output_file ../data/arxiv/lemmas_judged.jsonl --model_name gpt-5- Generate your own
/data/arxiv/proofs.jsonlexternally from/data/arxiv/lemmas_judged.jsonl
Each line should be a JSON object with at least:
lemma(string)assumptions(string)proof_steps(list of strings)
For traceability/stats we recommend to keep the fields of lemmas_judged.jsonl, that is:
filename,lemma_id,mode,extract_model,judge_verdictsand to add the field:proof_model(example:"my_external_prover_v1")
Minimal example line:
{"lemma":"\\begin{lemma} ... \\end{lemma}","assumptions":"\\begin{itemize} ... \\end{itemize}","proof_steps":["Step 1 ...","Step 2 ..."],"proof_model":"my_external_prover_v1"}- Use the judge model(s) on your external proofs:
python -m src.judge --input_path ../data/arxiv/proofs.jsonl --output_path ../data/arxiv/judged.jsonl --model_name gpt-5The central integration point is src/llm.py (LLMClient).
To add a new provider:
- Add provider defaults in
PROVIDER_DEFAULTS(base URL + env var key). - Extend
infer_provider_from_modelto route your model name to the provider. - Initialize the provider SDK/client in
LLMClient.__init__. - Add the inference call branch in
LLMClient.chat.
After that, you can use your model everywhere via existing CLI flags:
src.extract --model_extract ...src.judge_self_contained --model_name ...src.prove --model_name ...src.judge --model_name ...
utils.stats supports both Excel and CSV outputs.
--xlsxwrites a multi-sheet.xlsxfile (good for quick analysis and sharing).--csv-dir <dir>writes multiple CSV files.--combined-csv <file>writes one combined domain-level CSV.
Examples:
python -m utils.stats ../data/arxiv --xlsx
python -m utils.stats ../data/arxiv --csv-dir ../data/arxiv/stats_csv
python -m utils.stats ../data/arxiv --xlsx --csv-dir ../data/arxiv/stats_csv --combined-csv ../data/arxiv/summary.csv