bug: mcqa strict_single_letter_boxed undercounts correct answers wrapped in LaTeX \text{}

## Summary
The `mcqa` resources server's default grading mode, `strict_single_letter_boxed`, fails to extract the answer letter when a model boxes its answer inside a LaTeX `\text{}` wrapper (e.g. `\boxed{\text{E}}` or `\boxed{\text{C: <option text>}}`). These responses are scored as `no_answer` (reward 0) even though the model chose the correct option. This silently **undercounts capable models** on MCQA.

## Impact
In a 5-task smoke run (`gpt-4.1-2025-04-14`, `mcqa_simple_agent`, `--limit 5 --num-repeats 1`), the model selected the correct option on **all 5/5** tasks, but the report showed:

- `pass@1/accuracy`: 60.0
- `pass@1/no_answer`: 40.0  (== the failure rate; **zero** extracted-but-wrong tasks)

The 2 "failures" were both correct answers the extractor could not parse. True accuracy was 100%; reported accuracy was 60%.

## Reproduction
```bash
gym env start --resources-server mcqa --model-type openai_model
# new terminal
gym eval run --no-serve \
    --agent mcqa_simple_agent \
    --input resources_servers/mcqa/data/example.jsonl \
    --output results/mcqa_rollouts.jsonl \
    --limit 5 --num-repeats 1
```

Observed failing rollouts (from `results/mcqa_rollouts.jsonl`):

| task | gold | model output (final box) | extracted | reward |
|------|------|--------------------------|-----------|--------|
| 1 | C | `\boxed{\text{C: An individual with a family history of Huntington's disease...}}` | `null` | 0.0 |
| 4 | E | `\boxed{\text{E}}` | `null` | 0.0 |

## Root cause
`resources_servers/mcqa/app.py`:

```python
STRICT_BOXED_PATTERN = re.compile(r"\\boxed\{\s*[^A-Za-z]*([A-Z])[^A-Za-z]*\s*\}")
```

The pattern only allows non-letters between `\boxed{` and the captured uppercase letter. In `\boxed{\text{E}}` the `\text{` prefix contains letters, so the regex never reaches `E`. `_parse_answer_letter_strict_boxed()` returns `None`, and `verify()` records `extracted_answer = None`, reward 0.

The repo already contains the helpers needed to handle this case (`_strip_latex_wrappers`, `BOXED_CONTENT_PATTERN`, `_match_option_text`), but they are only wired into the `lenient_boxed` / `lenient_answer_colon` modes — not the `strict_single_letter_boxed` default.

Note: `lenient_boxed` would recover task 1 (boxed content contains option C's full text) but NOT task 4 (`\boxed{\text{E}}` is a bare letter, not option text), so switching modes is only a partial workaround.

## Proposed fix
In `_parse_answer_letter_strict_boxed`, extract the boxed inner content with `BOXED_CONTENT_PATTERN`, run `_strip_latex_wrappers` on it, then match a single letter (optionally a leading letter followed by `:` + option text). This recovers both `\boxed{\text{E}}` and `\boxed{\text{C: ...}}` while keeping strict single-letter semantics.

Add regression tests in `resources_servers/mcqa/tests/test_app.py` for:
- `\boxed{\text{E}}` -> `E`
- `\boxed{\text{C: <full option text>}}` -> `C`
- existing `\boxed{B}` behavior unchanged

## Severity
Medium. Correctness bug that deflates scores for any model that formats boxed answers with `\text{}` — common for LaTeX-tuned models. Affects benchmark fidelity and model comparisons on MCQA.

---
Found while validating the new `evaluation/diagnose-results` (BLADE) docs page; unrelated to that docs change.

cc @fsiino-nvidia @bxyu-nvidia (mcqa authors) for a look.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: mcqa strict_single_letter_boxed undercounts correct answers wrapped in LaTeX \text{} #1790

Summary

Impact

Reproduction

Root cause

Proposed fix

Severity

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

task	gold	model output (final box)	extracted	reward
1	C	`\boxed{\text{C: An individual with a family history of Huntington's disease...}}`	`null`	0.0
4	E	`\boxed{\text{E}}`	`null`	0.0

Uh oh!

bug: mcqa strict_single_letter_boxed undercounts correct answers wrapped in LaTeX \text{} #1790

Description

Summary

Impact

Reproduction

Root cause

Proposed fix

Severity

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions