Summary
The mcqa resources server's default grading mode, strict_single_letter_boxed, fails to extract the answer letter when a model boxes its answer inside a LaTeX \text{} wrapper (e.g. \boxed{\text{E}} or \boxed{\text{C: <option text>}}). These responses are scored as no_answer (reward 0) even though the model chose the correct option. This silently undercounts capable models on MCQA.
Impact
In a 5-task smoke run (gpt-4.1-2025-04-14, mcqa_simple_agent, --limit 5 --num-repeats 1), the model selected the correct option on all 5/5 tasks, but the report showed:
pass@1/accuracy: 60.0
pass@1/no_answer: 40.0 (== the failure rate; zero extracted-but-wrong tasks)
The 2 "failures" were both correct answers the extractor could not parse. True accuracy was 100%; reported accuracy was 60%.
Reproduction
gym env start --resources-server mcqa --model-type openai_model
# new terminal
gym eval run --no-serve \
--agent mcqa_simple_agent \
--input resources_servers/mcqa/data/example.jsonl \
--output results/mcqa_rollouts.jsonl \
--limit 5 --num-repeats 1
Observed failing rollouts (from results/mcqa_rollouts.jsonl):
| task |
gold |
model output (final box) |
extracted |
reward |
| 1 |
C |
\boxed{\text{C: An individual with a family history of Huntington's disease...}} |
null |
0.0 |
| 4 |
E |
\boxed{\text{E}} |
null |
0.0 |
Root cause
resources_servers/mcqa/app.py:
STRICT_BOXED_PATTERN = re.compile(r"\\boxed\{\s*[^A-Za-z]*([A-Z])[^A-Za-z]*\s*\}")
The pattern only allows non-letters between \boxed{ and the captured uppercase letter. In \boxed{\text{E}} the \text{ prefix contains letters, so the regex never reaches E. _parse_answer_letter_strict_boxed() returns None, and verify() records extracted_answer = None, reward 0.
The repo already contains the helpers needed to handle this case (_strip_latex_wrappers, BOXED_CONTENT_PATTERN, _match_option_text), but they are only wired into the lenient_boxed / lenient_answer_colon modes — not the strict_single_letter_boxed default.
Note: lenient_boxed would recover task 1 (boxed content contains option C's full text) but NOT task 4 (\boxed{\text{E}} is a bare letter, not option text), so switching modes is only a partial workaround.
Proposed fix
In _parse_answer_letter_strict_boxed, extract the boxed inner content with BOXED_CONTENT_PATTERN, run _strip_latex_wrappers on it, then match a single letter (optionally a leading letter followed by : + option text). This recovers both \boxed{\text{E}} and \boxed{\text{C: ...}} while keeping strict single-letter semantics.
Add regression tests in resources_servers/mcqa/tests/test_app.py for:
\boxed{\text{E}} -> E
\boxed{\text{C: <full option text>}} -> C
- existing
\boxed{B} behavior unchanged
Severity
Medium. Correctness bug that deflates scores for any model that formats boxed answers with \text{} — common for LaTeX-tuned models. Affects benchmark fidelity and model comparisons on MCQA.
Found while validating the new evaluation/diagnose-results (BLADE) docs page; unrelated to that docs change.
cc @fsiino-nvidia @bxyu-nvidia (mcqa authors) for a look.
Summary
The
mcqaresources server's default grading mode,strict_single_letter_boxed, fails to extract the answer letter when a model boxes its answer inside a LaTeX\text{}wrapper (e.g.\boxed{\text{E}}or\boxed{\text{C: <option text>}}). These responses are scored asno_answer(reward 0) even though the model chose the correct option. This silently undercounts capable models on MCQA.Impact
In a 5-task smoke run (
gpt-4.1-2025-04-14,mcqa_simple_agent,--limit 5 --num-repeats 1), the model selected the correct option on all 5/5 tasks, but the report showed:pass@1/accuracy: 60.0pass@1/no_answer: 40.0 (== the failure rate; zero extracted-but-wrong tasks)The 2 "failures" were both correct answers the extractor could not parse. True accuracy was 100%; reported accuracy was 60%.
Reproduction
Observed failing rollouts (from
results/mcqa_rollouts.jsonl):\boxed{\text{C: An individual with a family history of Huntington's disease...}}null\boxed{\text{E}}nullRoot cause
resources_servers/mcqa/app.py:The pattern only allows non-letters between
\boxed{and the captured uppercase letter. In\boxed{\text{E}}the\text{prefix contains letters, so the regex never reachesE._parse_answer_letter_strict_boxed()returnsNone, andverify()recordsextracted_answer = None, reward 0.The repo already contains the helpers needed to handle this case (
_strip_latex_wrappers,BOXED_CONTENT_PATTERN,_match_option_text), but they are only wired into thelenient_boxed/lenient_answer_colonmodes — not thestrict_single_letter_boxeddefault.Note:
lenient_boxedwould recover task 1 (boxed content contains option C's full text) but NOT task 4 (\boxed{\text{E}}is a bare letter, not option text), so switching modes is only a partial workaround.Proposed fix
In
_parse_answer_letter_strict_boxed, extract the boxed inner content withBOXED_CONTENT_PATTERN, run_strip_latex_wrapperson it, then match a single letter (optionally a leading letter followed by:+ option text). This recovers both\boxed{\text{E}}and\boxed{\text{C: ...}}while keeping strict single-letter semantics.Add regression tests in
resources_servers/mcqa/tests/test_app.pyfor:\boxed{\text{E}}->E\boxed{\text{C: <full option text>}}->C\boxed{B}behavior unchangedSeverity
Medium. Correctness bug that deflates scores for any model that formats boxed answers with
\text{}— common for LaTeX-tuned models. Affects benchmark fidelity and model comparisons on MCQA.Found while validating the new
evaluation/diagnose-results(BLADE) docs page; unrelated to that docs change.cc @fsiino-nvidia @bxyu-nvidia (mcqa authors) for a look.