feat(instruction_following): add loose-mode evaluation and compute_metrics by rodboev · Pull Request #1720 · NVIDIA-NeMo/Gym

rodboev · 2026-06-25T01:12:13Z

The instruction_following server currently evaluates each instruction in strict mode only and returns no aggregate metrics from compute_metrics(). As a result, rollouts_aggregate_metrics.json omits all four IFEval accuracy dimensions: prompt_strict_accuracy, instruction_strict_accuracy, prompt_loose_accuracy, and instruction_loose_accuracy.

This PR adds loose-mode evaluation and the missing compute_metrics() override:

Adds follow_all_instructions_loose and follow_instruction_list_loose to InstructionFollowingVerifyResponse
Adds a _check_following_loose helper that uses a native loose API if available (check_following_loose / check_following(..., mode="loose")) and falls back to the IFEval perturbation set: the original text, without-first-line, without-last-line, and without-first-and-last-line variants, each duplicated with asterisks removed, matching the NeMo Skills convention
Computes both strict and loose from the same instruction instance in a single loop inside verify(), avoiding double instantiation
Implements compute_metrics(tasks: List[List[Dict]]) on InstructionFollowingResourcesServer, matching the base class signature, and aggregates all four IFEval accuracy scores as percentages (0-100), consistent with compute_pass_majority_metrics
Adds focused tests covering loose field presence, the loose >= strict invariant, and compute_metrics() arithmetic

No new dependencies. verifiable_instructions is already in requirements.txt.

Closes #1160

Checklist

Targeted tests pass: uv run pytest resources_servers/instruction_following/tests/test_app.py -x -v
Touched-file lint passes: ruff check --config pyproject.toml resources_servers/instruction_following/app.py resources_servers/instruction_following/tests/test_app.py
Touched-file format check passes: ruff format --config pyproject.toml --check resources_servers/instruction_following/app.py resources_servers/instruction_following/tests/test_app.py
No changes to benchmark configs, data files, or nemo_gym core
DCO sign-off on all commits

Targeted tests were not run locally; upstream CI is the source of truth.

…trics Signed-off-by: Rod Boev <rod.boev@gmail.com>

…Skills spec Signed-off-by: Rod Boev <rod.boev@gmail.com>

…ym convention Signed-off-by: Rod Boev <rod.boev@gmail.com>

copy-pr-bot · 2026-06-25T01:12:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Rod Boev <rod.boev@gmail.com>

rodboev added 3 commits June 24, 2026 20:50

feat(instruction_following): add loose-mode evaluation and compute_me…

180a485

…trics Signed-off-by: Rod Boev <rod.boev@gmail.com>

fix(instruction_following): align loose perturbation set with IFEval/…

194c031

…Skills spec Signed-off-by: Rod Boev <rod.boev@gmail.com>

fix(instruction_following): scale compute_metrics to 0-100 to match G…

4105620

…ym convention Signed-off-by: Rod Boev <rod.boev@gmail.com>

nemo-automation-bot Bot added the community-request Issue reported or requested by someone from the community label Jun 25, 2026

Keep loose-mode evaluation branch consistent with repo formatting

26746b5

Signed-off-by: Rod Boev <rod.boev@gmail.com>

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(instruction_following): add loose-mode evaluation and compute_metrics#1720

feat(instruction_following): add loose-mode evaluation and compute_metrics#1720
rodboev wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
rodboev:pr/instruction-following-loose-mode

rodboev commented Jun 25, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rodboev commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rodboev commented Jun 25, 2026 •

edited

Loading