Skip to content

feat(instruction_following): add loose-mode evaluation and compute_metrics#1720

Open
rodboev wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
rodboev:pr/instruction-following-loose-mode
Open

feat(instruction_following): add loose-mode evaluation and compute_metrics#1720
rodboev wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
rodboev:pr/instruction-following-loose-mode

Conversation

@rodboev

@rodboev rodboev commented Jun 25, 2026

Copy link
Copy Markdown

The instruction_following server currently evaluates each instruction in strict mode only and returns no aggregate metrics from compute_metrics(). As a result, rollouts_aggregate_metrics.json omits all four IFEval accuracy dimensions: prompt_strict_accuracy, instruction_strict_accuracy, prompt_loose_accuracy, and instruction_loose_accuracy.

This PR adds loose-mode evaluation and the missing compute_metrics() override:

  • Adds follow_all_instructions_loose and follow_instruction_list_loose to InstructionFollowingVerifyResponse
  • Adds a _check_following_loose helper that uses a native loose API if available (check_following_loose / check_following(..., mode="loose")) and falls back to the IFEval perturbation set: the original text, without-first-line, without-last-line, and without-first-and-last-line variants, each duplicated with asterisks removed, matching the NeMo Skills convention
  • Computes both strict and loose from the same instruction instance in a single loop inside verify(), avoiding double instantiation
  • Implements compute_metrics(tasks: List[List[Dict]]) on InstructionFollowingResourcesServer, matching the base class signature, and aggregates all four IFEval accuracy scores as percentages (0-100), consistent with compute_pass_majority_metrics
  • Adds focused tests covering loose field presence, the loose >= strict invariant, and compute_metrics() arithmetic

No new dependencies. verifiable_instructions is already in requirements.txt.

Closes #1160

Checklist

  • Targeted tests pass: uv run pytest resources_servers/instruction_following/tests/test_app.py -x -v
  • Touched-file lint passes: ruff check --config pyproject.toml resources_servers/instruction_following/app.py resources_servers/instruction_following/tests/test_app.py
  • Touched-file format check passes: ruff format --config pyproject.toml --check resources_servers/instruction_following/app.py resources_servers/instruction_following/tests/test_app.py
  • No changes to benchmark configs, data files, or nemo_gym core
  • DCO sign-off on all commits

Targeted tests were not run locally; upstream CI is the source of truth.

rodboev added 3 commits June 24, 2026 20:50
…trics

Signed-off-by: Rod Boev <rod.boev@gmail.com>
…Skills spec

Signed-off-by: Rod Boev <rod.boev@gmail.com>
…ym convention

Signed-off-by: Rod Boev <rod.boev@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nemo-automation-bot nemo-automation-bot Bot added the community-request Issue reported or requested by someone from the community label Jun 25, 2026
Signed-off-by: Rod Boev <rod.boev@gmail.com>
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request Issue reported or requested by someone from the community waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Loose-mode evaluation is not implemented in instruction_following

2 participants