Skip to content

feat: [GDPval-AA v2 Updates 6 / n] - Multi-Stage Task Sampling#1746

Open
vadam5 wants to merge 12 commits into
mainfrom
vadams/gdpval-multistage-sampling
Open

feat: [GDPval-AA v2 Updates 6 / n] - Multi-Stage Task Sampling#1746
vadam5 wants to merge 12 commits into
mainfrom
vadams/gdpval-multistage-sampling

Conversation

@vadam5

@vadam5 vadam5 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Implements multi-stage adaptive ELO estimation to better approximate AA's ELO estimation strategy:

Grading: We sample pairwise matches between model submissions in two stages:

Balanced sampling: We first sample each model diversely, balancing exposure across tasks, judges, and opponents, to seed initial ratings.

Active sampling: After the initial phase, we transition to Elo-informed sampling that prioritizes pairings between models with similar ratings to derive the most information per comparison. We maintain balanced exposure of tasks within each model throughout the process.

GDPval-AA v2

Multi-stage adaptive ELO estimation for GDPVal pairwise comparison.

Instead of comparing the evaluated model against every reference model on all
tasks, this runs a sequence of stages. Each stage:

  1. fixes a set of T tasks sampled from a task-distribution JSON file (see
    responses_api_agents.stirrup_agent.task_distribution),
  2. judges the evaluated model against a set of M reference models on those
    tasks (delegated to an injected judge_stage callable),
  3. fits an anchored Bradley-Terry MLE ELO from that stage's win/loss/tie
    battles (reusing comparison.calculate_mle_elo), and
  4. uses that estimate to choose the M references for the next stage.

Across stages, M typically shrinks (zooming in on references whose known
ELO is closest to the evaluated model's current estimate) while T grows
(spending the saved judge budget on a tighter final estimate).

To align with AA's evaluation setup, we will set the number of stages to 2. In the first stage, all reference models passed in the config will be used and in the second stage a subset of M reference models with ELOs close to the estimated elo in stage one will be used. The exact number of tasks in each stage (T) and the number of reference models (M) to use in the second stage will be determined experimentally.

Smoke Test Results:

I tried a small smoke test with two stages:
Stage 1: All models on 1 sampled task
Stage 2: Closest 2 models on 2 sampled tasks

I was evaluating the nemotron3 ultra checkpoint, which has an ELO of 1168 for the v2 benchmark. This small smoke test predicted 1134.

[multistage-elo] built task distribution over ['occupation'] from /lustre/fs1/portfolios/llmservice/projects/llmservice_nemotron_ultra/users/vadams/Gym/benchmarks/gdpval/data/gdpval_benchmark.jsonl -> /lustre/fs1/portfolios/llmservice/projects/llmservice_nemotron_ultra/users/vadams/Gym/resources_servers/gdpval/data/distributions/occupation_distribution.json
[multistage-elo] planned 2 stage(s); tasks per stage: [1, 2]
[multistage-elo] stage 1/2: 1 task(s) vs 5 ref(s) ['glm51', 'kimi_k25', 'minimax_m27', 'nemotron3_ultra_ga', 'qwen35_397b'] (prior ELO: n/a)
[multistage-elo]   judged 2/2 (task 7151c60a-d4cb-4fc4…)
[multistage-elo] stage 1/2 done: eval ELO = 1208.1 (fit over 5 ref(s))
[multistage-elo] stage 2/2: 2 task(s) vs 2 ref(s) ['minimax_m27', 'nemotron3_ultra_ga'] (prior ELO: 1208.1)
[multistage-elo]   judged 4/4 (task 3f625cb2-f40e-4ead…)
[multistage-elo] stage 2/2 done: eval ELO = 1133.8 (fit over 2 ref(s))
Wrote ELO summary (2 stages, final_eval_elo=1133.8313742888708) to elo_smoketest.json

vadam5 added 2 commits June 25, 2026 17:06
Signed-off-by: Virginia Wu <vadams@nvidia.com>
Signed-off-by: Virginia Wu <vadams@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@vadam5 vadam5 marked this pull request as ready for review June 26, 2026 06:22
@vadam5 vadam5 requested a review from agronskiy June 26, 2026 06:23
vadam5 and others added 5 commits June 26, 2026 17:06
Signed-off-by: Virginia Wu <vadams@nvidia.com>
…-NeMo/Gym into vadams/gdpval-multistage-sampling

Signed-off-by: Virginia Wu <vadams@nvidia.com>
Signed-off-by: Virginia Wu <vadams@nvidia.com>
Signed-off-by: Virginia Wu <vadams@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant