Skip to content

Add apex_2025 benchmark#1724

Open
wedu-nvidia wants to merge 1 commit into
mainfrom
wedu/apex-2025
Open

Add apex_2025 benchmark#1724
wedu-nvidia wants to merge 1 commit into
mainfrom
wedu/apex-2025

Conversation

@wedu-nvidia

Copy link
Copy Markdown
Contributor

No description provided.

Signed-off-by: Wei Du <wedu@nvidia.com>
@wedu-nvidia wedu-nvidia requested a review from gwarmstrong June 25, 2026 04:27
@copy-pr-bot

copy-pr-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@gwarmstrong gwarmstrong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — it's a clean, minimal benchmark-only addition on top of the already-verified math_with_judge server, and the prompt + data-schema parity check out. I ran the README commands against a hosted Nemotron endpoint: ng_prepare_benchmark (12 problems) and ng_run (servers healthy) work as written, and once the fix below is applied the rollout path runs end-to-end (\boxed{} extraction + symbolic verify produce valid rewards).

One blocker before merge: the "Collecting rollouts" command crashes as written — it's missing +prompt_config=. Left a committable suggestion. A couple of small non-blocking nits inline too, and a heads-up that the branch is currently behind main.

Comment on lines +46 to +50
ng_collect_rollouts \
+agent_name=apex_2025_math_with_judge_simple_agent \
+input_jsonl_fpath=benchmarks/apex_2025/data/apex_2025_benchmark.jsonl \
+output_jsonl_fpath=results/apex_2025_rollouts.jsonl \
+num_repeats=4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command crashes when run as written:

KeyError: 'responses_create_params' at nemo_gym/rollout_collection.py:274.

prepare.py writes raw rows (no responses_create_params), and ng_collect_rollouts reads prompt_config only from its own CLI overrides — not from config.yaml's prompt_config: field (that one is consumed by ng_run). So with no template to apply it errors out. Adding +prompt_config=benchmarks/prompts/generic/math.yaml fixes it (verified locally end-to-end against a hosted endpoint).

Suggested change
ng_collect_rollouts \
+agent_name=apex_2025_math_with_judge_simple_agent \
+input_jsonl_fpath=benchmarks/apex_2025/data/apex_2025_benchmark.jsonl \
+output_jsonl_fpath=results/apex_2025_rollouts.jsonl \
+num_repeats=4
ng_collect_rollouts \
+agent_name=apex_2025_math_with_judge_simple_agent \
+input_jsonl_fpath=benchmarks/apex_2025/data/apex_2025_benchmark.jsonl \
+output_jsonl_fpath=results/apex_2025_rollouts.jsonl \
+prompt_config=benchmarks/prompts/generic/math.yaml \
+num_repeats=4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this should be updated to use the new gym ... cli rather than the outdated ng_collect_rollouts cli

Comment on lines +9 to +13
Reuses the `math_with_judge` resource server in **symbolic-only** mode
(`should_use_judge: false`) to mirror NeMo Skills' `eval_type=math`
default for this benchmark. The HuggingFace `math-verify` library does
symbolic equivalence of the model-extracted `\boxed{...}` answer against
`expected_answer`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small wording nit: "to mirror NeMo Skills' eval_type=math default for this benchmark" reads as if Skills has an apex_2025 benchmark, but it doesn't (Skills only has apex-shortlist). Suggest phrasing it as matching Skills' default math eval generally:

Suggested change
Reuses the `math_with_judge` resource server in **symbolic-only** mode
(`should_use_judge: false`) to mirror NeMo Skills' `eval_type=math`
default for this benchmark. The HuggingFace `math-verify` library does
symbolic equivalence of the model-extracted `\boxed{...}` answer against
`expected_answer`.
Reuses the `math_with_judge` resource server in **symbolic-only** mode
(`should_use_judge: false`), matching NeMo Skills' default math eval
(`eval_type=math` with no judge). The HuggingFace `math-verify` library does
symbolic equivalence of the model-extracted `\boxed{...}` answer against
`expected_answer`.

@@ -0,0 +1,24 @@
# Chain to existing resource server + agent config

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this header comment can go — the file is self-documenting, and reviewers tend to ask for rationale/narrative comments to be stripped from configs since they drift out of sync with the code.

Suggested change
# Chain to existing resource server + agent config

Comment on lines +5 to +6
# We use `_inherit_from` directives to inherit from and not use the generic config
# above to ensure this benchmark config is isolated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please drop this rationale comment as well — _inherit_from is self-documenting, and config-level rationale tends to rot out of sync.

Suggested change
# We use `_inherit_from` directives to inherit from and not use the generic config
# above to ensure this benchmark config is isolated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants