feat: [GDPval-AA v2 Updates 6 / n] - Multi-Stage Task Sampling by vadam5 · Pull Request #1746 · NVIDIA-NeMo/Gym

vadam5 · 2026-06-26T00:38:57Z

Implements multi-stage adaptive ELO estimation to better approximate AA's ELO estimation strategy:

Grading: We sample pairwise matches between model submissions in two stages:

Balanced sampling: We first sample each model diversely, balancing exposure across tasks, judges, and opponents, to seed initial ratings.

Active sampling: After the initial phase, we transition to Elo-informed sampling that prioritizes pairings between models with similar ratings to derive the most information per comparison. We maintain balanced exposure of tasks within each model throughout the process.

GDPval-AA v2

Multi-stage adaptive ELO estimation for GDPVal pairwise comparison.

Instead of comparing the evaluated model against every reference model on all
tasks, this runs a sequence of stages. Each stage:

fixes a set of T tasks sampled from a task-distribution JSON file (see
responses_api_agents.stirrup_agent.task_distribution),
judges the evaluated model against a set of M reference models on those
tasks (delegated to an injected judge_stage callable),
fits an anchored Bradley-Terry MLE ELO from that stage's win/loss/tie
battles (reusing comparison.calculate_mle_elo), and
uses that estimate to choose the M references for the next stage.

Across stages, M typically shrinks (zooming in on references whose known
ELO is closest to the evaluated model's current estimate) while T grows
(spending the saved judge budget on a tighter final estimate).

To align with AA's evaluation setup, we will set the number of stages to 2. In the first stage, all reference models passed in the config will be used and in the second stage a subset of M reference models with ELOs close to the estimated elo in stage one will be used. The exact number of tasks in each stage (T) and the number of reference models (M) to use in the second stage will be determined experimentally.

Smoke Test Results:

I tried a small smoke test with two stages:
Stage 1: All models on 1 sampled task
Stage 2: Closest 2 models on 2 sampled tasks

I was evaluating the nemotron3 ultra checkpoint, which has an ELO of 1168 for the v2 benchmark. This small smoke test predicted 1134.

[multistage-elo] built task distribution over ['occupation'] from /lustre/fs1/portfolios/llmservice/projects/llmservice_nemotron_ultra/users/vadams/Gym/benchmarks/gdpval/data/gdpval_benchmark.jsonl -> /lustre/fs1/portfolios/llmservice/projects/llmservice_nemotron_ultra/users/vadams/Gym/resources_servers/gdpval/data/distributions/occupation_distribution.json
[multistage-elo] planned 2 stage(s); tasks per stage: [1, 2]
[multistage-elo] stage 1/2: 1 task(s) vs 5 ref(s) ['glm51', 'kimi_k25', 'minimax_m27', 'nemotron3_ultra_ga', 'qwen35_397b'] (prior ELO: n/a)
[multistage-elo]   judged 2/2 (task 7151c60a-d4cb-4fc4…)
[multistage-elo] stage 1/2 done: eval ELO = 1208.1 (fit over 5 ref(s))
[multistage-elo] stage 2/2: 2 task(s) vs 2 ref(s) ['minimax_m27', 'nemotron3_ultra_ga'] (prior ELO: 1208.1)
[multistage-elo]   judged 4/4 (task 3f625cb2-f40e-4ead…)
[multistage-elo] stage 2/2 done: eval ELO = 1133.8 (fit over 2 ref(s))
Wrote ELO summary (2 stages, final_eval_elo=1133.8313742888708) to elo_smoketest.json

Signed-off-by: Virginia Wu <vadams@nvidia.com>

copy-pr-bot · 2026-06-26T00:39:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Virginia Wu <vadams@nvidia.com>

…-NeMo/Gym into vadams/gdpval-multistage-sampling Signed-off-by: Virginia Wu <vadams@nvidia.com>

Signed-off-by: Virginia Wu <vadams@nvidia.com>

vadam5 added 2 commits June 25, 2026 17:06

added task distribution calculation code for gdpval and other datasets

996d9e5

Signed-off-by: Virginia Wu <vadams@nvidia.com>

made task_distribution default to occupation column

9057f17

Signed-off-by: Virginia Wu <vadams@nvidia.com>

vadam5 and others added 5 commits June 25, 2026 17:44

added occupation distribution

e05d8e0

Signed-off-by: Virginia Wu <vadams@nvidia.com>

Merge branch 'main' into vadams/gdpval-multistage-sampling

df0962e

shouldn't include occupation data file in repo

846441e

Signed-off-by: Virginia Wu <vadams@nvidia.com>

added multistage elo estimation

85ef58b

Signed-off-by: Virginia Wu <vadams@nvidia.com>

Merge branch 'main' into vadams/gdpval-multistage-sampling

b392ed9

vadam5 marked this pull request as ready for review June 26, 2026 06:22

vadam5 requested a review from agronskiy June 26, 2026 06:23

vadam5 and others added 5 commits June 26, 2026 17:06

multistage-elo E2E smoke test works

af7f344

Signed-off-by: Virginia Wu <vadams@nvidia.com>

Merge branch 'vadams/gdpval-multistage-sampling' of github.com:NVIDIA…

125aa5b

…-NeMo/Gym into vadams/gdpval-multistage-sampling Signed-off-by: Virginia Wu <vadams@nvidia.com>

trimmed white space

b11f446

Signed-off-by: Virginia Wu <vadams@nvidia.com>

removed smoked test yaml

ce40b2f

Signed-off-by: Virginia Wu <vadams@nvidia.com>

Merge branch 'main' into vadams/gdpval-multistage-sampling

669bab5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: [GDPval-AA v2 Updates 6 / n] - Multi-Stage Task Sampling#1746

feat: [GDPval-AA v2 Updates 6 / n] - Multi-Stage Task Sampling#1746
vadam5 wants to merge 12 commits into
mainfrom
vadams/gdpval-multistage-sampling

vadam5 commented Jun 26, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vadam5 commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Smoke Test Results:

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vadam5 commented Jun 26, 2026 •

edited

Loading