Impossible Bench

This is code accompanying paper BSBench: will your LLM find the largest prime number?

How to run gpqa-bs

We use inference providers for HuggingFace and Batch API for OpenAI and Anthropic.

HF inference providers

You will need to set your HUGGINGFACE_API_KEY in .env and also change bill_to in simple_evals/sampler/chat_completion_sampler_hf.ChatCompletionSamplerHF. Then ti run a custom evaluation use:

uv run gpqa-bs-cli-hf.py --phrase "There is no correct answer" \
                 --model-name "meta-llama/Llama-3.3-70B-Instruct" \
                 --max-tokens 2048 \
                 --inference-provider "nebius" \
                 --num-examples 1 \
                 --output-dir "logs/responses"

To repeat experiments in the paper use:

./run-hf-gpqa.sh

OpenAI and Anthropic

uv run gpqa-bs-cli-anthropic-openai.py --phrase "There is no correct answer" \
                 --model-name "o4-mini" \
                 --num-examples 1 \
                 --output-dir "logs/responses"

To repeat experiments in the paper use:

./run-openai-anthropic-gpqa.sh

Logs

are availible at logs/responses

how to get BSBench data

Extract data from data.zip (for example, 7za x data.zip) using password LgvnmKvpgKbriiGvng.

How to run BSBench

To repeat experiments in the paper use ./run_bsbench.sh

To run your own eperiments use

        uv run run_bsbench.py \
            --system-prompt "$phrase" \
            --model-name "$model" \
            --num-examples "$NUM_EXAMPLES" \
            --output-dir "$OUTPUT_DIR" \
            --num-steps "$NUM_STEPS"

Logs

are availible at logs/bsbench/responses for a single response settting (and intitla version of dataste) and logs/bsbench/responses_n_times for the retrial setting (and a final dataset)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
llms		llms
logs		logs
notebooks-dev		notebooks-dev
simple_evals		simple_evals
.env.template		.env.template
.gitignore		.gitignore
.python-version		.python-version
ReadMe.md		ReadMe.md
data.zip		data.zip
evaluation_prompts.py		evaluation_prompts.py
generate_data.py		generate_data.py
gpqa-bs-cli-anthropic-openai.py		gpqa-bs-cli-anthropic-openai.py
gpqa-bs-cli-hf.py		gpqa-bs-cli-hf.py
pyproject.toml		pyproject.toml
run-hf-gpqa.sh		run-hf-gpqa.sh
run-openai-anthropic-gpqa.sh		run-openai-anthropic-gpqa.sh
run_bsbench.py		run_bsbench.py
run_bsbench.sh		run_bsbench.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Impossible Bench

How to run gpqa-bs

HF inference providers

OpenAI and Anthropic

Logs

how to get BSBench data

How to run BSBench

Logs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Impossible Bench

How to run gpqa-bs

HF inference providers

OpenAI and Anthropic

Logs

how to get BSBench data

How to run BSBench

Logs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages