Skip to content

fix(audio): harden AudioBench and MMAU evaluation#1485

Open
pzelasko wants to merge 4 commits into
mainfrom
codex/pr1443-audiobench-mmau-fixes
Open

fix(audio): harden AudioBench and MMAU evaluation#1485
pzelasko wants to merge 4 commits into
mainfrom
codex/pr1443-audiobench-mmau-fixes

Conversation

@pzelasko

@pzelasko pzelasko commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Migration note

Supersedes #1475. Recreated from the NVIDIA-NeMo/Skills branch codex/pr1443-audiobench-mmau-fixes so repository CI can run.

Summary

Split out from #1443.

This PR contains AudioBench, MMAU-Pro, and NVEmbed robustness fixes that are independent of the AppTek benchmark.

What Changed

  • Reworked AudioBench manifest generation so combined and per-dataset manifests resolve audio paths correctly.
  • Added defensive handling for malformed nested AudioBench audio metadata during manifest rewriting.
  • Enabled soft-fail generation for:
    • AudioBench non-judge
    • MMAU-Pro closed-form
  • Made AudioBench preparation fail fast for unsupported dataset names and avoid swallowing dataset-level failures.
  • Normalized MMAU-Pro audio paths for generated manifests.
  • Added audio evaluator text coercion for loose string/dict/list/None fields.
  • Handled null MMAU-Pro samples as failed samples instead of aborting evaluation.
  • Handled empty NVEmbed generations as failed samples instead of raising.
  • Hardened NVEmbed judge execution:
    • validates required input/output arguments
    • pins runtime numpy<2
    • installs task dependencies into an isolated target directory
    • passes shell-safe script arguments
    • supports SLURM account and judge-container overrides

Relationship To #1443

This PR is a prerequisite/adjacent cleanup split out of the original AppTek PR. It does not add the AppTek benchmark itself.

Validation

  • python -m py_compile on changed Python files.
  • Direct AudioBench manifest rewrite smoke test with malformed nested audio entries.
  • CI lint/DCO/copyright checks are passing.

Summary by CodeRabbit

  • New Features
    • Audio generation now supports soft-fail mode so invalid/over-limit requests are handled gracefully.
  • Bug Fixes
    • Evaluations are more robust to null samples and nested/non-string generation or reference text.
    • Manifest creation now writes inference-friendly relative audio paths and combines results by category.
    • MMAU-Pro formatting now rewrites audio paths for correct downstream access.
  • Chores / Improvements
    • NVEmbed judging task setup now validates required inputs more strictly and improves job command execution.
  • Tests
    • Added NVEmbed coverage for nested-generation coercion and the empty/soft-fail path.

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2b56274d-0535-4fcf-8a28-894da47e20f7

📥 Commits

Reviewing files that changed from the base of the PR and between 5d9c0e5 and 58d877c.

📒 Files selected for processing (2)
  • nemo_skills/evaluation/evaluator/nvembed_judge.py
  • tests/test_nvembed_judge.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/evaluation/evaluator/nvembed_judge.py

📝 Walkthrough

Walkthrough

Adds ++server.enable_soft_fail=true to AudioBench and MMAU-Pro generation configs, enabling audio generation to fail gracefully within chunks. Switches manifest audio paths from absolute to relative form, introduces make_dataset_manifest_entry for path rewriting in dataset-specific manifests, and adds combined per-category manifest assembly in AudioBench. Hardens audio evaluators against null/non-string fields and refactors the NVEmbed judge pipeline task builder with new account and container parameters and shell-safe command construction.

Changes

Audio Evaluation Robustness and Manifest Path Fixes

Layer / File(s) Summary
soft-fail generation config constants
nemo_skills/dataset/audiobench/nonjudge/__init__.py, nemo_skills/dataset/mmau-pro/closed_form/__init__.py
GENERATION_ARGS extended with ++server.enable_soft_fail=true in both AudioBench nonjudge and MMAU-Pro closed_form modules; EVAL_ARGS removed from MMAU-Pro closed_form.
Relative audio path rewriting in manifest preparation
nemo_skills/dataset/audiobench/prepare.py, nemo_skills/dataset/mmau-pro/prepare.py
create_manifest_entry switches to relative audio/{dataset_name}/{filename} paths; new make_dataset_manifest_entry adds ../ prefix for dataset-specific jsonl writes. MMAU-Pro format_entry() similarly prefixes ../ to audio_path values in formatted output.
Combined AudioBench manifest assembly
nemo_skills/dataset/audiobench/prepare.py
main() initializes combined_entries dict with judge/nonjudge accumulators, routes each dataset by category membership, and writes combined {split}.jsonl manifests per category with raw relative paths.
Evaluator robustness against null/non-string fields
nemo_skills/evaluation/evaluator/audio.py, nemo_skills/evaluation/evaluator/mmau_pro.py, nemo_skills/evaluation/evaluator/nvembed_judge.py, tests/test_nvembed_judge.py
coerce_text_field helper added to audio evaluator for safe normalization of dict/list/string generation and expected_answer fields. None-sample guard added to evaluate_instruction_following_sample. NVEmbed evaluator adds _coerce_text() helper, records failure metadata (nvembed_error="empty_generation") for empty or non-string generations instead of raising, and pip invocation updated to use sys.executable -m pip. Unit tests verify nested generation payload coercion and soft-fail behavior.
NVEmbed judge pipeline task builder refactor
nemo_skills/pipeline/judges/nvembed_judge.py
create_judge_tasks gains account and judge_container parameters; output_dir validation now enforced. Run command rebuilt with staged script_args list, validates/enforces input_dir requirement, constructs per-job pip install into temp directory with --target, sets PYTHONPATH, applies shlex.quote for safe argument embedding, and uses judge_container with vllm fallback.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • Jorjeous
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(audio): harden AudioBench and MMAU evaluation' accurately summarizes the main changes—hardening audio evaluation components with robustness fixes for AudioBench, MMAU-Pro, and NVEmbed handling.
Docstring Coverage ✅ Passed Docstring coverage is 82.35% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/pr1443-audiobench-mmau-fixes

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
nemo_skills/pipeline/judges/nvembed_judge.py (1)

129-138: 💤 Low value

Numpy version mismatch with evaluator script.

The pipeline installs numpy==1.26.4 (line 133), but nvembed_judge.py line 51 specifies numpy<2. While --skip-install prevents the script's version from being used, this creates a maintenance burden with duplicated version specifications.

Consider extracting shared package versions to a single source of truth, or at minimum add a comment noting the evaluator script's version should be kept in sync.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 129 - 138, Extract
the numpy version specification (currently hardcoded as numpy==1.26.4 in the
run_cmd string) to a shared source of truth such as a requirements file or
constants module that can be imported by both the pipeline builder and the
evaluator script, or alternatively add clear comments in both the run_cmd
construction and the evaluator script's requirements (line 51) indicating that
the numpy version specifications must be kept in sync to avoid maintenance
issues and ensure consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 129-138: Extract the numpy version specification (currently
hardcoded as numpy==1.26.4 in the run_cmd string) to a shared source of truth
such as a requirements file or constants module that can be imported by both the
pipeline builder and the evaluator script, or alternatively add clear comments
in both the run_cmd construction and the evaluator script's requirements (line
51) indicating that the numpy version specifications must be kept in sync to
avoid maintenance issues and ensure consistency.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 19b0cc8c-7f39-4bef-93b4-344189218f9d

📥 Commits

Reviewing files that changed from the base of the PR and between da85a88 and 8264676.

📒 Files selected for processing (8)
  • nemo_skills/dataset/audiobench/nonjudge/__init__.py
  • nemo_skills/dataset/audiobench/prepare.py
  • nemo_skills/dataset/mmau-pro/closed_form/__init__.py
  • nemo_skills/dataset/mmau-pro/prepare.py
  • nemo_skills/evaluation/evaluator/audio.py
  • nemo_skills/evaluation/evaluator/mmau_pro.py
  • nemo_skills/evaluation/evaluator/nvembed_judge.py
  • nemo_skills/pipeline/judges/nvembed_judge.py

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/evaluation/evaluator/nvembed_judge.py (1)

136-138: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Non-string generation values are being incorrectly downgraded to empty_generation.

On Line 137, any non-string generation becomes "", so dict/list payloads with valid text get marked as failed on Line 143. This creates deterministic false negatives and diverges from the coercion pattern used in nemo_skills/evaluation/evaluator/audio.py.

Suggested fix
+def _coerce_text(value: Any) -> str:
+    if value is None:
+        return ""
+    if isinstance(value, str):
+        return value.strip()
+    if isinstance(value, dict):
+        for key in ("text", "expected_answer", "answer", "transcript", "reference", "generation"):
+            if key in value and value[key] is not None:
+                return _coerce_text(value[key])
+        return " ".join(_coerce_text(v) for v in value.values() if v is not None).strip()
+    if isinstance(value, (list, tuple)):
+        return " ".join(_coerce_text(v) for v in value if v is not None).strip()
+    return str(value).strip()
+
 def evaluate_sample_with_nvembed(sample: dict[str, Any], model_name: str = "nvidia/NV-Embed-v2") -> dict[str, Any]:
@@
-    generation_value = sample.get("generation", "")
-    generation = generation_value.strip() if isinstance(generation_value, str) else ""
+    generation = _coerce_text(sample.get("generation", ""))

Also applies to: 141-152

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/evaluation/evaluator/nvembed_judge.py` around lines 136 - 138,
The `generation` value handling is incorrectly converting non-string payloads
(such as dicts or lists) to empty strings, which causes valid text content to be
marked as failed. Instead of simply converting non-string values to empty
string, implement proper coercion logic that extracts the actual text content
from non-string `generation` values, similar to the pattern used in the audio.py
evaluator. This ensures that dict/list payloads with valid text are preserved
rather than being discarded as empty values.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Around line 136-138: The `generation` value handling is incorrectly converting
non-string payloads (such as dicts or lists) to empty strings, which causes
valid text content to be marked as failed. Instead of simply converting
non-string values to empty string, implement proper coercion logic that extracts
the actual text content from non-string `generation` values, similar to the
pattern used in the audio.py evaluator. This ensures that dict/list payloads
with valid text are preserved rather than being discarded as empty values.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4fe1dbce-60f0-4709-a5f4-1eb76687ac54

📥 Commits

Reviewing files that changed from the base of the PR and between 8264676 and ea07ba8.

📒 Files selected for processing (2)
  • nemo_skills/evaluation/evaluator/nvembed_judge.py
  • nemo_skills/pipeline/judges/nvembed_judge.py

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
nemo_skills/evaluation/evaluator/nvembed_judge.py (1)

153-155: ⚡ Quick win

Use direct key access for required sample fields (choices, expected_answer).

choices and expected_answer are treated as required immediately below, so use sample["choices"] / sample["expected_answer"] instead of .get(...) to fail loudly on malformed records rather than normalizing to empty defaults.

Proposed patch
-    choices = sample.get("choices", [])
-    expected_answer = sample.get("expected_answer", "")
+    choices = sample["choices"]
+    expected_answer = sample["expected_answer"]

As per coding guidelines, "Don't use .get for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/evaluation/evaluator/nvembed_judge.py` around lines 153 - 155,
The code is using .get() with default values for the dictionary keys "choices"
and "expected_answer" in the sample dictionary, which masks missing or malformed
data by silently returning empty defaults. Since these fields are required and
used immediately after, replace the .get() calls with direct dictionary access
using sample["choices"] and sample["expected_answer"] respectively. This will
cause a clear KeyError to be raised if these required keys are missing, making
data quality issues explicit rather than hiding them with empty defaults.

Source: Coding guidelines

tests/test_nvembed_judge.py (1)

18-40: ⚡ Quick win

Add a test for the new empty_generation branch.

This test covers nested coercion success well, but the newly introduced soft-fail path (empty coerced generation) should also be asserted to prevent regressions (nvembed_confidence=0.0, is_correct=False, nvembed_error="empty_generation").

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_nvembed_judge.py` around lines 18 - 40, Add a new test function
that verifies the soft-fail behavior when the generation coerces to an empty
string. Create a test similar to test_nvembed_coerces_nested_generation_text but
with a sample where the generation field (whether nested or flat) results in
empty text after coercion. Call evaluate_sample_with_nvembed with this sample
and assert that the result contains nvembed_confidence equal to 0.0, is_correct
equal to False, and nvembed_error equal to the string "empty_generation" to
ensure the empty_generation branch is properly tested and prevent regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Around line 153-155: The code is using .get() with default values for the
dictionary keys "choices" and "expected_answer" in the sample dictionary, which
masks missing or malformed data by silently returning empty defaults. Since
these fields are required and used immediately after, replace the .get() calls
with direct dictionary access using sample["choices"] and
sample["expected_answer"] respectively. This will cause a clear KeyError to be
raised if these required keys are missing, making data quality issues explicit
rather than hiding them with empty defaults.

In `@tests/test_nvembed_judge.py`:
- Around line 18-40: Add a new test function that verifies the soft-fail
behavior when the generation coerces to an empty string. Create a test similar
to test_nvembed_coerces_nested_generation_text but with a sample where the
generation field (whether nested or flat) results in empty text after coercion.
Call evaluate_sample_with_nvembed with this sample and assert that the result
contains nvembed_confidence equal to 0.0, is_correct equal to False, and
nvembed_error equal to the string "empty_generation" to ensure the
empty_generation branch is properly tested and prevent regressions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6b567320-845b-43cf-bd91-bc6a69ac7676

📥 Commits

Reviewing files that changed from the base of the PR and between ea07ba8 and 5d9c0e5.

📒 Files selected for processing (2)
  • nemo_skills/evaluation/evaluator/nvembed_judge.py
  • tests/test_nvembed_judge.py

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant