Add apex_2025 benchmark by wedu-nvidia · Pull Request #1724 · NVIDIA-NeMo/Gym

wedu-nvidia · 2026-06-25T04:27:25Z

No description provided.

Signed-off-by: Wei Du <wedu@nvidia.com>

copy-pr-bot · 2026-06-25T04:27:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

gwarmstrong

Thanks for this — it's a clean, minimal benchmark-only addition on top of the already-verified math_with_judge server, and the prompt + data-schema parity check out. I ran the README commands against a hosted Nemotron endpoint: ng_prepare_benchmark (12 problems) and ng_run (servers healthy) work as written, and once the fix below is applied the rollout path runs end-to-end (\boxed{} extraction + symbolic verify produce valid rewards).

One blocker before merge: the "Collecting rollouts" command crashes as written — it's missing +prompt_config=. Left a committable suggestion. A couple of small non-blocking nits inline too, and a heads-up that the branch is currently behind main.

gwarmstrong · 2026-06-25T21:50:25Z

+ng_collect_rollouts \
+    +agent_name=apex_2025_math_with_judge_simple_agent \
+    +input_jsonl_fpath=benchmarks/apex_2025/data/apex_2025_benchmark.jsonl \
+    +output_jsonl_fpath=results/apex_2025_rollouts.jsonl \
+    +num_repeats=4


This command crashes when run as written:

KeyError: 'responses_create_params' at nemo_gym/rollout_collection.py:274.

prepare.py writes raw rows (no responses_create_params), and ng_collect_rollouts reads prompt_config only from its own CLI overrides — not from config.yaml's prompt_config: field (that one is consumed by ng_run). So with no template to apply it errors out. Adding +prompt_config=benchmarks/prompts/generic/math.yaml fixes it (verified locally end-to-end against a hosted endpoint).

Suggested change

ng_collect_rollouts \

+agent_name=apex_2025_math_with_judge_simple_agent \

+input_jsonl_fpath=benchmarks/apex_2025/data/apex_2025_benchmark.jsonl \

+output_jsonl_fpath=results/apex_2025_rollouts.jsonl \

+num_repeats=4

ng_collect_rollouts \

+agent_name=apex_2025_math_with_judge_simple_agent \

+input_jsonl_fpath=benchmarks/apex_2025/data/apex_2025_benchmark.jsonl \

+output_jsonl_fpath=results/apex_2025_rollouts.jsonl \

+prompt_config=benchmarks/prompts/generic/math.yaml \

+num_repeats=4

also, this should be updated to use the new gym ... cli rather than the outdated ng_collect_rollouts cli

gwarmstrong · 2026-06-25T21:50:40Z

+Reuses the `math_with_judge` resource server in **symbolic-only** mode
+(`should_use_judge: false`) to mirror NeMo Skills' `eval_type=math`
+default for this benchmark. The HuggingFace `math-verify` library does
+symbolic equivalence of the model-extracted `\boxed{...}` answer against
+`expected_answer`.


Small wording nit: "to mirror NeMo Skills' eval_type=math default for this benchmark" reads as if Skills has an apex_2025 benchmark, but it doesn't (Skills only has apex-shortlist). Suggest phrasing it as matching Skills' default math eval generally:

Suggested change

Reuses the `math_with_judge` resource server in **symbolic-only** mode

(`should_use_judge: false`) to mirror NeMo Skills' `eval_type=math`

default for this benchmark. The HuggingFace `math-verify` library does

symbolic equivalence of the model-extracted `\boxed{...}` answer against

`expected_answer`.

Reuses the `math_with_judge` resource server in **symbolic-only** mode

(`should_use_judge: false`), matching NeMo Skills' default math eval

(`eval_type=math` with no judge). The HuggingFace `math-verify` library does

symbolic equivalence of the model-extracted `\boxed{...}` answer against

`expected_answer`.

gwarmstrong · 2026-06-25T21:50:48Z

@@ -0,0 +1,24 @@
+# Chain to existing resource server + agent config


Nit: this header comment can go — the file is self-documenting, and reviewers tend to ask for rationale/narrative comments to be stripped from configs since they drift out of sync with the code.

Suggested change

# Chain to existing resource server + agent config

gwarmstrong · 2026-06-25T21:50:56Z

+# We use `_inherit_from` directives to inherit from and not use the generic config
+# above to ensure this benchmark config is isolated.


Nit: please drop this rationale comment as well — _inherit_from is self-documenting, and config-level rationale tends to rot out of sync.

Suggested change

# We use `_inherit_from` directives to inherit from and not use the generic config

# above to ensure this benchmark config is isolated.

Add apex_2025 benchmark

9c25919

Signed-off-by: Wei Du <wedu@nvidia.com>

wedu-nvidia requested a review from gwarmstrong June 25, 2026 04:27

gwarmstrong requested changes Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add apex_2025 benchmark#1724

Add apex_2025 benchmark#1724
wedu-nvidia wants to merge 1 commit into
mainfrom
wedu/apex-2025

wedu-nvidia commented Jun 25, 2026

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

gwarmstrong left a comment

Uh oh!

gwarmstrong Jun 25, 2026

Uh oh!

gwarmstrong Jun 25, 2026

Uh oh!

gwarmstrong Jun 25, 2026

Uh oh!

gwarmstrong Jun 25, 2026

Uh oh!

gwarmstrong Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,24 @@
		# Chain to existing resource server + agent config

		# We use `_inherit_from` directives to inherit from and not use the generic config
		# above to ensure this benchmark config is isolated.

Uh oh!

Conversation

wedu-nvidia commented Jun 25, 2026

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants