fix(audio): harden AudioBench and MMAU evaluation by pzelasko · Pull Request #1485 · NVIDIA-NeMo/Skills

pzelasko · 2026-06-15T15:22:59Z

Migration note

Supersedes #1475. Recreated from the NVIDIA-NeMo/Skills branch codex/pr1443-audiobench-mmau-fixes so repository CI can run.

Summary

Split out from #1443.

This PR contains AudioBench, MMAU-Pro, and NVEmbed robustness fixes that are independent of the AppTek benchmark.

What Changed

Reworked AudioBench manifest generation so combined and per-dataset manifests resolve audio paths correctly.
Added defensive handling for malformed nested AudioBench audio metadata during manifest rewriting.
Enabled soft-fail generation for:
- AudioBench non-judge
- MMAU-Pro closed-form
Made AudioBench preparation fail fast for unsupported dataset names and avoid swallowing dataset-level failures.
Normalized MMAU-Pro audio paths for generated manifests.
Added audio evaluator text coercion for loose string/dict/list/None fields.
Handled null MMAU-Pro samples as failed samples instead of aborting evaluation.
Handled empty NVEmbed generations as failed samples instead of raising.
Hardened NVEmbed judge execution:
- validates required input/output arguments
- pins runtime numpy<2
- installs task dependencies into an isolated target directory
- passes shell-safe script arguments
- supports SLURM account and judge-container overrides

Relationship To #1443

This PR is a prerequisite/adjacent cleanup split out of the original AppTek PR. It does not add the AppTek benchmark itself.

Validation

python -m py_compile on changed Python files.
Direct AudioBench manifest rewrite smoke test with malformed nested audio entries.
CI lint/DCO/copyright checks are passing.

Summary by CodeRabbit

New Features
- Audio generation now supports soft-fail mode so invalid/over-limit requests are handled gracefully.
Bug Fixes
- Evaluations are more robust to null samples and nested/non-string generation or reference text.
- Manifest creation now writes inference-friendly relative audio paths and combines results by category.
- MMAU-Pro formatting now rewrites audio paths for correct downstream access.
Chores / Improvements
- NVEmbed judging task setup now validates required inputs more strictly and improves job command execution.
Tests
- Added NVEmbed coverage for nested-generation coercion and the empty/soft-fail path.

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

coderabbitai · 2026-06-15T15:36:17Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2b56274d-0535-4fcf-8a28-894da47e20f7

📥 Commits

Reviewing files that changed from the base of the PR and between 5d9c0e5 and 58d877c.

📒 Files selected for processing (2)

nemo_skills/evaluation/evaluator/nvembed_judge.py
tests/test_nvembed_judge.py

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/evaluation/evaluator/nvembed_judge.py

📝 Walkthrough

Walkthrough

Adds ++server.enable_soft_fail=true to AudioBench and MMAU-Pro generation configs, enabling audio generation to fail gracefully within chunks. Switches manifest audio paths from absolute to relative form, introduces make_dataset_manifest_entry for path rewriting in dataset-specific manifests, and adds combined per-category manifest assembly in AudioBench. Hardens audio evaluators against null/non-string fields and refactors the NVEmbed judge pipeline task builder with new account and container parameters and shell-safe command construction.

Changes

Audio Evaluation Robustness and Manifest Path Fixes

Layer / File(s)	Summary
soft-fail generation config constants `nemo_skills/dataset/audiobench/nonjudge/__init__.py`, `nemo_skills/dataset/mmau-pro/closed_form/__init__.py`	`GENERATION_ARGS` extended with `++server.enable_soft_fail=true` in both AudioBench nonjudge and MMAU-Pro closed_form modules; `EVAL_ARGS` removed from MMAU-Pro closed_form.
Relative audio path rewriting in manifest preparation `nemo_skills/dataset/audiobench/prepare.py`, `nemo_skills/dataset/mmau-pro/prepare.py`	`create_manifest_entry` switches to relative `audio/{dataset_name}/{filename}` paths; new `make_dataset_manifest_entry` adds `../` prefix for dataset-specific jsonl writes. MMAU-Pro `format_entry()` similarly prefixes `../` to `audio_path` values in formatted output.
Combined AudioBench manifest assembly `nemo_skills/dataset/audiobench/prepare.py`	`main()` initializes `combined_entries` dict with judge/nonjudge accumulators, routes each dataset by category membership, and writes combined `{split}.jsonl` manifests per category with raw relative paths.
Evaluator robustness against null/non-string fields `nemo_skills/evaluation/evaluator/audio.py`, `nemo_skills/evaluation/evaluator/mmau_pro.py`, `nemo_skills/evaluation/evaluator/nvembed_judge.py`, `tests/test_nvembed_judge.py`	`coerce_text_field` helper added to audio evaluator for safe normalization of dict/list/string generation and expected_answer fields. None-sample guard added to `evaluate_instruction_following_sample`. NVEmbed evaluator adds `_coerce_text()` helper, records failure metadata (`nvembed_error="empty_generation"`) for empty or non-string generations instead of raising, and pip invocation updated to use `sys.executable -m pip`. Unit tests verify nested generation payload coercion and soft-fail behavior.
NVEmbed judge pipeline task builder refactor `nemo_skills/pipeline/judges/nvembed_judge.py`	`create_judge_tasks` gains `account` and `judge_container` parameters; output_dir validation now enforced. Run command rebuilt with staged `script_args` list, validates/enforces input_dir requirement, constructs per-job pip install into temp directory with `--target`, sets PYTHONPATH, applies `shlex.quote` for safe argument embedding, and uses `judge_container` with vllm fallback.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

Jorjeous

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(audio): harden AudioBench and MMAU evaluation' accurately summarizes the main changes—hardening audio evaluation components with robustness fixes for AudioBench, MMAU-Pro, and NVEmbed handling.
Docstring Coverage	✅ Passed	Docstring coverage is 82.35% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/pr1443-audiobench-mmau-fixes

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

nemo_skills/pipeline/judges/nvembed_judge.py (1)
129-138: 💤 Low value

Numpy version mismatch with evaluator script.

The pipeline installs numpy==1.26.4 (line 133), but nvembed_judge.py line 51 specifies numpy<2. While --skip-install prevents the script's version from being used, this creates a maintenance burden with duplicated version specifications.

Consider extracting shared package versions to a single source of truth, or at minimum add a comment noting the evaluator script's version should be kept in sync.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 129 - 138, Extract
the numpy version specification (currently hardcoded as numpy==1.26.4 in the
run_cmd string) to a shared source of truth such as a requirements file or
constants module that can be imported by both the pipeline builder and the
evaluator script, or alternatively add clear comments in both the run_cmd
construction and the evaluator script's requirements (line 51) indicating that
the numpy version specifications must be kept in sync to avoid maintenance
issues and ensure consistency.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 129-138: Extract the numpy version specification (currently
hardcoded as numpy==1.26.4 in the run_cmd string) to a shared source of truth
such as a requirements file or constants module that can be imported by both the
pipeline builder and the evaluator script, or alternatively add clear comments
in both the run_cmd construction and the evaluator script's requirements (line
51) indicating that the numpy version specifications must be kept in sync to
avoid maintenance issues and ensure consistency.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 19b0cc8c-7f39-4bef-93b4-344189218f9d

📥 Commits

Reviewing files that changed from the base of the PR and between da85a88 and 8264676.

📒 Files selected for processing (8)

nemo_skills/dataset/audiobench/nonjudge/__init__.py
nemo_skills/dataset/audiobench/prepare.py
nemo_skills/dataset/mmau-pro/closed_form/__init__.py
nemo_skills/dataset/mmau-pro/prepare.py
nemo_skills/evaluation/evaluator/audio.py
nemo_skills/evaluation/evaluator/mmau_pro.py
nemo_skills/evaluation/evaluator/nvembed_judge.py
nemo_skills/pipeline/judges/nvembed_judge.py

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/evaluation/evaluator/nvembed_judge.py (1)

136-138: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Non-string generation values are being incorrectly downgraded to empty_generation.

On Line 137, any non-string generation becomes "", so dict/list payloads with valid text get marked as failed on Line 143. This creates deterministic false negatives and diverges from the coercion pattern used in nemo_skills/evaluation/evaluator/audio.py.

Suggested fix

+def _coerce_text(value: Any) -> str:
+    if value is None:
+        return ""
+    if isinstance(value, str):
+        return value.strip()
+    if isinstance(value, dict):
+        for key in ("text", "expected_answer", "answer", "transcript", "reference", "generation"):
+            if key in value and value[key] is not None:
+                return _coerce_text(value[key])
+        return " ".join(_coerce_text(v) for v in value.values() if v is not None).strip()
+    if isinstance(value, (list, tuple)):
+        return " ".join(_coerce_text(v) for v in value if v is not None).strip()
+    return str(value).strip()
+
 def evaluate_sample_with_nvembed(sample: dict[str, Any], model_name: str = "nvidia/NV-Embed-v2") -> dict[str, Any]:
@@
-    generation_value = sample.get("generation", "")
-    generation = generation_value.strip() if isinstance(generation_value, str) else ""
+    generation = _coerce_text(sample.get("generation", ""))

Also applies to: 141-152

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/evaluation/evaluator/nvembed_judge.py` around lines 136 - 138,
The `generation` value handling is incorrectly converting non-string payloads
(such as dicts or lists) to empty strings, which causes valid text content to be
marked as failed. Instead of simply converting non-string values to empty
string, implement proper coercion logic that extracts the actual text content
from non-string `generation` values, similar to the pattern used in the audio.py
evaluator. This ensures that dict/list payloads with valid text are preserved
rather than being discarded as empty values.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Around line 136-138: The `generation` value handling is incorrectly converting
non-string payloads (such as dicts or lists) to empty strings, which causes
valid text content to be marked as failed. Instead of simply converting
non-string values to empty string, implement proper coercion logic that extracts
the actual text content from non-string `generation` values, similar to the
pattern used in the audio.py evaluator. This ensures that dict/list payloads
with valid text are preserved rather than being discarded as empty values.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4fe1dbce-60f0-4709-a5f4-1eb76687ac54

📥 Commits

Reviewing files that changed from the base of the PR and between 8264676 and ea07ba8.

📒 Files selected for processing (2)

nemo_skills/evaluation/evaluator/nvembed_judge.py
nemo_skills/pipeline/judges/nvembed_judge.py

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

coderabbitai

🧹 Nitpick comments (2)

nemo_skills/evaluation/evaluator/nvembed_judge.py (1)
153-155: ⚡ Quick win

Use direct key access for required sample fields (choices, expected_answer).

choices and expected_answer are treated as required immediately below, so use sample["choices"] / sample["expected_answer"] instead of .get(...) to fail loudly on malformed records rather than normalizing to empty defaults.
Proposed patch
-    choices = sample.get("choices", [])
-    expected_answer = sample.get("expected_answer", "")
+    choices = sample["choices"]
+    expected_answer = sample["expected_answer"]
As per coding guidelines, "Don't use .get for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/evaluation/evaluator/nvembed_judge.py` around lines 153 - 155,
The code is using .get() with default values for the dictionary keys "choices"
and "expected_answer" in the sample dictionary, which masks missing or malformed
data by silently returning empty defaults. Since these fields are required and
used immediately after, replace the .get() calls with direct dictionary access
using sample["choices"] and sample["expected_answer"] respectively. This will
cause a clear KeyError to be raised if these required keys are missing, making
data quality issues explicit rather than hiding them with empty defaults.
Source: Coding guidelines
tests/test_nvembed_judge.py (1)
18-40: ⚡ Quick win

Add a test for the new empty_generation branch.

This test covers nested coercion success well, but the newly introduced soft-fail path (empty coerced generation) should also be asserted to prevent regressions (nvembed_confidence=0.0, is_correct=False, nvembed_error="empty_generation").
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_nvembed_judge.py` around lines 18 - 40, Add a new test function
that verifies the soft-fail behavior when the generation coerces to an empty
string. Create a test similar to test_nvembed_coerces_nested_generation_text but
with a sample where the generation field (whether nested or flat) results in
empty text after coercion. Call evaluate_sample_with_nvembed with this sample
and assert that the result contains nvembed_confidence equal to 0.0, is_correct
equal to False, and nvembed_error equal to the string "empty_generation" to
ensure the empty_generation branch is properly tested and prevent regressions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Around line 153-155: The code is using .get() with default values for the
dictionary keys "choices" and "expected_answer" in the sample dictionary, which
masks missing or malformed data by silently returning empty defaults. Since
these fields are required and used immediately after, replace the .get() calls
with direct dictionary access using sample["choices"] and
sample["expected_answer"] respectively. This will cause a clear KeyError to be
raised if these required keys are missing, making data quality issues explicit
rather than hiding them with empty defaults.

In `@tests/test_nvembed_judge.py`:
- Around line 18-40: Add a new test function that verifies the soft-fail
behavior when the generation coerces to an empty string. Create a test similar
to test_nvembed_coerces_nested_generation_text but with a sample where the
generation field (whether nested or flat) results in empty text after coercion.
Call evaluate_sample_with_nvembed with this sample and assert that the result
contains nvembed_confidence equal to 0.0, is_correct equal to False, and
nvembed_error equal to the string "empty_generation" to ensure the
empty_generation branch is properly tested and prevent regressions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6b567320-845b-43cf-bd91-bc6a69ac7676

📥 Commits

Reviewing files that changed from the base of the PR and between ea07ba8 and 5d9c0e5.

📒 Files selected for processing (2)

nemo_skills/evaluation/evaluator/nvembed_judge.py
tests/test_nvembed_judge.py

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

fix(audio): harden AudioBench and MMAU evaluation

8264676

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

This was referenced Jun 15, 2026

feat(speech): add AppTek Call-Center Dialogues ASR benchmark #1486

Open

fix(audio): harden AudioBench and MMAU evaluation #1475

Closed

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

chore: document nvembed numpy pin

ea07ba8

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

fix: coerce nvembed generation text

5d9c0e5

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

test: cover empty nvembed generation

58d877c

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(audio): harden AudioBench and MMAU evaluation#1485

fix(audio): harden AudioBench and MMAU evaluation#1485
pzelasko wants to merge 4 commits into
mainfrom
codex/pr1443-audiobench-mmau-fixes

pzelasko commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

pzelasko commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Migration note

Summary

What Changed

Relationship To #1443

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pzelasko commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading