Skip to content

bug: mcqa strict_single_letter_boxed undercounts correct answers wrapped in LaTeX \text{} #1790

Description

@cwing-nvidia

Summary

The mcqa resources server's default grading mode, strict_single_letter_boxed, fails to extract the answer letter when a model boxes its answer inside a LaTeX \text{} wrapper (e.g. \boxed{\text{E}} or \boxed{\text{C: <option text>}}). These responses are scored as no_answer (reward 0) even though the model chose the correct option. This silently undercounts capable models on MCQA.

Impact

In a 5-task smoke run (gpt-4.1-2025-04-14, mcqa_simple_agent, --limit 5 --num-repeats 1), the model selected the correct option on all 5/5 tasks, but the report showed:

  • pass@1/accuracy: 60.0
  • pass@1/no_answer: 40.0 (== the failure rate; zero extracted-but-wrong tasks)

The 2 "failures" were both correct answers the extractor could not parse. True accuracy was 100%; reported accuracy was 60%.

Reproduction

gym env start --resources-server mcqa --model-type openai_model
# new terminal
gym eval run --no-serve \
    --agent mcqa_simple_agent \
    --input resources_servers/mcqa/data/example.jsonl \
    --output results/mcqa_rollouts.jsonl \
    --limit 5 --num-repeats 1

Observed failing rollouts (from results/mcqa_rollouts.jsonl):

task gold model output (final box) extracted reward
1 C \boxed{\text{C: An individual with a family history of Huntington's disease...}} null 0.0
4 E \boxed{\text{E}} null 0.0

Root cause

resources_servers/mcqa/app.py:

STRICT_BOXED_PATTERN = re.compile(r"\\boxed\{\s*[^A-Za-z]*([A-Z])[^A-Za-z]*\s*\}")

The pattern only allows non-letters between \boxed{ and the captured uppercase letter. In \boxed{\text{E}} the \text{ prefix contains letters, so the regex never reaches E. _parse_answer_letter_strict_boxed() returns None, and verify() records extracted_answer = None, reward 0.

The repo already contains the helpers needed to handle this case (_strip_latex_wrappers, BOXED_CONTENT_PATTERN, _match_option_text), but they are only wired into the lenient_boxed / lenient_answer_colon modes — not the strict_single_letter_boxed default.

Note: lenient_boxed would recover task 1 (boxed content contains option C's full text) but NOT task 4 (\boxed{\text{E}} is a bare letter, not option text), so switching modes is only a partial workaround.

Proposed fix

In _parse_answer_letter_strict_boxed, extract the boxed inner content with BOXED_CONTENT_PATTERN, run _strip_latex_wrappers on it, then match a single letter (optionally a leading letter followed by : + option text). This recovers both \boxed{\text{E}} and \boxed{\text{C: ...}} while keeping strict single-letter semantics.

Add regression tests in resources_servers/mcqa/tests/test_app.py for:

  • \boxed{\text{E}} -> E
  • \boxed{\text{C: <full option text>}} -> C
  • existing \boxed{B} behavior unchanged

Severity

Medium. Correctness bug that deflates scores for any model that formats boxed answers with \text{} — common for LaTeX-tuned models. Affects benchmark fidelity and model comparisons on MCQA.


Found while validating the new evaluation/diagnose-results (BLADE) docs page; unrelated to that docs change.

cc @fsiino-nvidia @bxyu-nvidia (mcqa authors) for a look.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions