Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions benchmarks/apex_2025/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# APEX 2025

Math problems from MathArena's APEX 2025 benchmark, sourced from
`MathArena/apex_2025` on HuggingFace. This benchmark is intended as a
newer alternative to `apex_shortlist`.

## Verification

Reuses the `math_with_judge` resource server in **symbolic-only** mode
(`should_use_judge: false`) to mirror NeMo Skills' `eval_type=math`
default for this benchmark. The HuggingFace `math-verify` library does
symbolic equivalence of the model-extracted `\boxed{...}` answer against
`expected_answer`.
Comment on lines +9 to +13

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small wording nit: "to mirror NeMo Skills' eval_type=math default for this benchmark" reads as if Skills has an apex_2025 benchmark, but it doesn't (Skills only has apex-shortlist). Suggest phrasing it as matching Skills' default math eval generally:

Suggested change
Reuses the `math_with_judge` resource server in **symbolic-only** mode
(`should_use_judge: false`) to mirror NeMo Skills' `eval_type=math`
default for this benchmark. The HuggingFace `math-verify` library does
symbolic equivalence of the model-extracted `\boxed{...}` answer against
`expected_answer`.
Reuses the `math_with_judge` resource server in **symbolic-only** mode
(`should_use_judge: false`), matching NeMo Skills' default math eval
(`eval_type=math` with no judge). The HuggingFace `math-verify` library does
symbolic equivalence of the model-extracted `\boxed{...}` answer against
`expected_answer`.


## Prompt

User-only prompt, character-for-character match with NeMo Skills'
`generic/math.yaml`:

```
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.

<question>
```

## Data preparation

```bash
ng_prepare_benchmark '+config_paths=[benchmarks/apex_2025/config.yaml]'
```

Writes `data/apex_2025_benchmark.jsonl` with one row per problem:
`{"problem_idx": 1, "source": "...", "question": "...", "expected_answer": "..."}`.

## Running servers

```bash
config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
benchmarks/apex_2025/config.yaml"
ng_run "+config_paths=[$config_paths]"
```

## Collecting rollouts

```bash
ng_collect_rollouts \
+agent_name=apex_2025_math_with_judge_simple_agent \
+input_jsonl_fpath=benchmarks/apex_2025/data/apex_2025_benchmark.jsonl \
+output_jsonl_fpath=results/apex_2025_rollouts.jsonl \
+num_repeats=4
Comment on lines +46 to +50

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command crashes when run as written:

KeyError: 'responses_create_params' at nemo_gym/rollout_collection.py:274.

prepare.py writes raw rows (no responses_create_params), and ng_collect_rollouts reads prompt_config only from its own CLI overrides — not from config.yaml's prompt_config: field (that one is consumed by ng_run). So with no template to apply it errors out. Adding +prompt_config=benchmarks/prompts/generic/math.yaml fixes it (verified locally end-to-end against a hosted endpoint).

Suggested change
ng_collect_rollouts \
+agent_name=apex_2025_math_with_judge_simple_agent \
+input_jsonl_fpath=benchmarks/apex_2025/data/apex_2025_benchmark.jsonl \
+output_jsonl_fpath=results/apex_2025_rollouts.jsonl \
+num_repeats=4
ng_collect_rollouts \
+agent_name=apex_2025_math_with_judge_simple_agent \
+input_jsonl_fpath=benchmarks/apex_2025/data/apex_2025_benchmark.jsonl \
+output_jsonl_fpath=results/apex_2025_rollouts.jsonl \
+prompt_config=benchmarks/prompts/generic/math.yaml \
+num_repeats=4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this should be updated to use the new gym ... cli rather than the outdated ng_collect_rollouts cli

```
Empty file.
24 changes: 24 additions & 0 deletions benchmarks/apex_2025/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Chain to existing resource server + agent config

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this header comment can go — the file is self-documenting, and reviewers tend to ask for rationale/narrative comments to be stripped from configs since they drift out of sync with the code.

Suggested change
# Chain to existing resource server + agent config

config_paths:
- resources_servers/math_with_judge/configs/math_with_judge.yaml

# We use `_inherit_from` directives to inherit from and not use the generic config
# above to ensure this benchmark config is isolated.
Comment on lines +5 to +6

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please drop this rationale comment as well — _inherit_from is self-documenting, and config-level rationale tends to rot out of sync.

Suggested change
# We use `_inherit_from` directives to inherit from and not use the generic config
# above to ensure this benchmark config is isolated.

apex_2025_math_with_judge_resources_server:
_inherit_from: math_with_judge
resources_servers:
math_with_judge:
should_use_judge: false

apex_2025_math_with_judge_simple_agent:
_inherit_from: math_with_judge_simple_agent
responses_api_agents:
simple_agent:
resources_server:
name: apex_2025_math_with_judge_resources_server
datasets:
- name: apex_2025
type: benchmark
jsonl_fpath: benchmarks/apex_2025/data/apex_2025_benchmark.jsonl
prompt_config: benchmarks/prompts/generic/math.yaml
prepare_script: benchmarks/apex_2025/prepare.py
1 change: 1 addition & 0 deletions benchmarks/apex_2025/data/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*benchmark.jsonl
54 changes: 54 additions & 0 deletions benchmarks/apex_2025/prepare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Prepare the APEX 2025 benchmark data."""

import json
from pathlib import Path

from datasets import load_dataset


BENCHMARK_DIR = Path(__file__).parent
DATA_DIR = BENCHMARK_DIR / "data"
OUTPUT_FPATH = DATA_DIR / "apex_2025_benchmark.jsonl"

HF_REPO_ID = "MathArena/apex_2025"


def prepare() -> Path:
"""Download and prepare APEX 2025 data. Returns the output file path."""
DATA_DIR.mkdir(parents=True, exist_ok=True)

print(f"Loading APEX 2025 data from {HF_REPO_ID}...")
ds = load_dataset(HF_REPO_ID, split="train")

count = 0
with open(OUTPUT_FPATH, "w", encoding="utf-8") as f:
for row in ds:
out = {
"problem_idx": row["problem_idx"],
"source": row["source"],
"question": row["problem"],
"expected_answer": str(row["answer"]),
}
f.write(json.dumps(out, ensure_ascii=False) + "\n")
count += 1

print(f"Wrote {count} problems to {OUTPUT_FPATH}")
return OUTPUT_FPATH


if __name__ == "__main__":
prepare()
Loading