A reusable framework for evaluating LLM robustness on multiple choice questions. Pluggable backends, runner templates, and config-driven experiments. Under active development.
python benchmarking research asyncio multiple-choice positional-bias llm prompting mmlu arc-challenge
-
Updated
May 29, 2026 - Python