This is code accompanying paper BSBench: will your LLM find the largest prime number?
We use inference providers for HuggingFace and Batch API for OpenAI and Anthropic.
You will need to set your HUGGINGFACE_API_KEY in .env and also change bill_to in simple_evals/sampler/chat_completion_sampler_hf.ChatCompletionSamplerHF. Then ti run a custom evaluation use:
uv run gpqa-bs-cli-hf.py --phrase "There is no correct answer" \
--model-name "meta-llama/Llama-3.3-70B-Instruct" \
--max-tokens 2048 \
--inference-provider "nebius" \
--num-examples 1 \
--output-dir "logs/responses"To repeat experiments in the paper use:
./run-hf-gpqa.shuv run gpqa-bs-cli-anthropic-openai.py --phrase "There is no correct answer" \
--model-name "o4-mini" \
--num-examples 1 \
--output-dir "logs/responses"To repeat experiments in the paper use:
./run-openai-anthropic-gpqa.share availible at logs/responses
Extract data from data.zip (for example, 7za x data.zip) using password LgvnmKvpgKbriiGvng.
To repeat experiments in the paper use ./run_bsbench.sh
To run your own eperiments use
uv run run_bsbench.py \
--system-prompt "$phrase" \
--model-name "$model" \
--num-examples "$NUM_EXAMPLES" \
--output-dir "$OUTPUT_DIR" \
--num-steps "$NUM_STEPS"are availible at logs/bsbench/responses for a single response settting (and intitla version of dataste) and logs/bsbench/responses_n_times for the retrial setting (and a final dataset)