Add apex_2025 benchmark#1724
Conversation
Signed-off-by: Wei Du <wedu@nvidia.com>
gwarmstrong
left a comment
There was a problem hiding this comment.
Thanks for this — it's a clean, minimal benchmark-only addition on top of the already-verified math_with_judge server, and the prompt + data-schema parity check out. I ran the README commands against a hosted Nemotron endpoint: ng_prepare_benchmark (12 problems) and ng_run (servers healthy) work as written, and once the fix below is applied the rollout path runs end-to-end (\boxed{} extraction + symbolic verify produce valid rewards).
One blocker before merge: the "Collecting rollouts" command crashes as written — it's missing +prompt_config=. Left a committable suggestion. A couple of small non-blocking nits inline too, and a heads-up that the branch is currently behind main.
| ng_collect_rollouts \ | ||
| +agent_name=apex_2025_math_with_judge_simple_agent \ | ||
| +input_jsonl_fpath=benchmarks/apex_2025/data/apex_2025_benchmark.jsonl \ | ||
| +output_jsonl_fpath=results/apex_2025_rollouts.jsonl \ | ||
| +num_repeats=4 |
There was a problem hiding this comment.
This command crashes when run as written:
KeyError: 'responses_create_params' at nemo_gym/rollout_collection.py:274.
prepare.py writes raw rows (no responses_create_params), and ng_collect_rollouts reads prompt_config only from its own CLI overrides — not from config.yaml's prompt_config: field (that one is consumed by ng_run). So with no template to apply it errors out. Adding +prompt_config=benchmarks/prompts/generic/math.yaml fixes it (verified locally end-to-end against a hosted endpoint).
| ng_collect_rollouts \ | |
| +agent_name=apex_2025_math_with_judge_simple_agent \ | |
| +input_jsonl_fpath=benchmarks/apex_2025/data/apex_2025_benchmark.jsonl \ | |
| +output_jsonl_fpath=results/apex_2025_rollouts.jsonl \ | |
| +num_repeats=4 | |
| ng_collect_rollouts \ | |
| +agent_name=apex_2025_math_with_judge_simple_agent \ | |
| +input_jsonl_fpath=benchmarks/apex_2025/data/apex_2025_benchmark.jsonl \ | |
| +output_jsonl_fpath=results/apex_2025_rollouts.jsonl \ | |
| +prompt_config=benchmarks/prompts/generic/math.yaml \ | |
| +num_repeats=4 |
There was a problem hiding this comment.
also, this should be updated to use the new gym ... cli rather than the outdated ng_collect_rollouts cli
| Reuses the `math_with_judge` resource server in **symbolic-only** mode | ||
| (`should_use_judge: false`) to mirror NeMo Skills' `eval_type=math` | ||
| default for this benchmark. The HuggingFace `math-verify` library does | ||
| symbolic equivalence of the model-extracted `\boxed{...}` answer against | ||
| `expected_answer`. |
There was a problem hiding this comment.
Small wording nit: "to mirror NeMo Skills' eval_type=math default for this benchmark" reads as if Skills has an apex_2025 benchmark, but it doesn't (Skills only has apex-shortlist). Suggest phrasing it as matching Skills' default math eval generally:
| Reuses the `math_with_judge` resource server in **symbolic-only** mode | |
| (`should_use_judge: false`) to mirror NeMo Skills' `eval_type=math` | |
| default for this benchmark. The HuggingFace `math-verify` library does | |
| symbolic equivalence of the model-extracted `\boxed{...}` answer against | |
| `expected_answer`. | |
| Reuses the `math_with_judge` resource server in **symbolic-only** mode | |
| (`should_use_judge: false`), matching NeMo Skills' default math eval | |
| (`eval_type=math` with no judge). The HuggingFace `math-verify` library does | |
| symbolic equivalence of the model-extracted `\boxed{...}` answer against | |
| `expected_answer`. |
| @@ -0,0 +1,24 @@ | |||
| # Chain to existing resource server + agent config | |||
There was a problem hiding this comment.
Nit: this header comment can go — the file is self-documenting, and reviewers tend to ask for rationale/narrative comments to be stripped from configs since they drift out of sync with the code.
| # Chain to existing resource server + agent config |
| # We use `_inherit_from` directives to inherit from and not use the generic config | ||
| # above to ensure this benchmark config is isolated. |
There was a problem hiding this comment.
Nit: please drop this rationale comment as well — _inherit_from is self-documenting, and config-level rationale tends to rot out of sync.
| # We use `_inherit_from` directives to inherit from and not use the generic config | |
| # above to ensure this benchmark config is isolated. |
No description provided.