fix(audio): harden AudioBench and MMAU evaluation by pzelasko · Pull Request #1475 · NVIDIA-NeMo/Skills

pzelasko · 2026-06-04T17:14:50Z

Summary

Split out from #1443.

This PR contains AudioBench, MMAU-Pro, and NVEmbed robustness fixes that are independent of the AppTek benchmark.

What Changed

Reworked AudioBench manifest generation so combined and per-dataset manifests resolve audio paths correctly.
Added defensive handling for malformed nested AudioBench audio metadata during manifest rewriting.
Enabled soft-fail generation for:
- AudioBench non-judge
- MMAU-Pro closed-form
Made AudioBench preparation fail fast for unsupported dataset names and avoid swallowing dataset-level failures.
Normalized MMAU-Pro audio paths for generated manifests.
Added audio evaluator text coercion for loose string/dict/list/None fields.
Handled null MMAU-Pro samples as failed samples instead of aborting evaluation.
Handled empty NVEmbed generations as failed samples instead of raising.
Hardened NVEmbed judge execution:
- validates required input/output arguments
- pins runtime numpy<2
- installs task dependencies into an isolated target directory
- passes shell-safe script arguments
- supports SLURM account and judge-container overrides

Relationship To #1443

This PR is a prerequisite/adjacent cleanup split out of the original AppTek PR. It does not add the AppTek benchmark itself.

Validation

python -m py_compile on changed Python files.
Direct AudioBench manifest rewrite smoke test with malformed nested audio entries.
CI lint/DCO/copyright checks are passing.

coderabbitai · 2026-06-04T17:21:03Z

Walkthrough

This PR hardens audio dataset preparation and evaluation against missing, malformed, or structured input fields. It restructures AudioBench manifests to use relative paths, enables soft-fail mode across AudioBench and MMAU‑Pro configurations, and refactors the NVEmbed judge pipeline to manage dependencies separately for each SLURM job.

Changes

Audio Robustness and Manifest Path Restructuring

Layer / File(s)	Summary
Soft-fail configuration across datasets `nemo_skills/dataset/audiobench/nonjudge/__init__.py`, `nemo_skills/dataset/mmau-pro/closed_form/__init__.py`	`GENERATION_ARGS` extended with `server.enable_soft_fail=true` in both AudioBench non-judge and MMAU‑Pro closed-form. Removed `EVAL_ARGS` from MMAU‑Pro closed_form.
AudioBench manifest path restructuring `nemo_skills/dataset/audiobench/prepare.py`	Refactored manifest generation: extract dict `instruction["text"]`, emit dataset-scoped `audio/{dataset_name}/{audio_filename}`, add `make_dataset_manifest_entry` to deep-copy and rewrite `audio/...` -> `../audio/...`, serialize rewritten entries when writing per-dataset files, and accumulate category manifests written to `output_dir/{category}/{split}.jsonl`.
MMAU‑Pro audio path normalization `nemo_skills/dataset/mmau-pro/prepare.py`	`format_entry` now prefixes audio paths with `../` in `formatted_entry["audio_path"]` and builds `user_message["audio"]` or `user_message["audios"]` from the normalized paths.
Audio evaluator coercion `nemo_skills/evaluation/evaluator/audio.py`	Added `coerce_text_field` to normalize None/string/dict/list generation and reference fields to plain text; `evaluate_sample` now uses it for both generation and expected answer.
MMAU‑Pro null-sample guard `nemo_skills/evaluation/evaluator/mmau_pro.py`	`evaluate_instruction_following_sample` returns `{"is_correct": False, "error": "null_sample"}` and logs a warning when given `None` instead of proceeding.
NVEmbed evaluator hardening `nemo_skills/evaluation/evaluator/nvembed_judge.py`	Pinned `numpy<2` in runtime install; empty/blank `generation` is handled by returning an incorrect sample with `nvembed_error: "empty_generation"` and zero confidence instead of raising.
NVEmbed judge pipeline refactor `nemo_skills/pipeline/judges/nvembed_judge.py`	`create_judge_tasks` signature adds `account` and `judge_container`; command construction now builds `script_args`, installs pinned deps into a job-scoped temp dir, sets `PYTHONPATH` and offline env vars, runs `nvembed_judge.py` with shell-quoted args, and adjusts task flags and container selection.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

AlexGrinch

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main objective: hardening AudioBench and MMAU evaluation with multiple robustness fixes across files and modules.
Docstring Coverage	✅ Passed	Docstring coverage is 84.62% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/evaluation/evaluator/nvembed_judge.py (1)

123-138: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard generation before .strip() so null values don’t abort evaluation.

If a record has "generation": null, the .strip() call runs before your new empty-generation fallback and raises AttributeError, aborting the file instead of soft-failing the sample.

Suggested fix

-    generation = sample.get("generation", "").strip()
+    generation_value = sample.get("generation", "")
+    generation = generation_value.strip() if isinstance(generation_value, str) else ""

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/evaluation/evaluator/nvembed_judge.py` around lines 123 - 138,
The code calls .strip() on generation without guarding for null, so when
sample.get("generation") returns None it raises AttributeError; change the
assignment to guard and normalize generation first (e.g., read raw_generation =
sample.get("generation"); set generation = (raw_generation or "").strip() or
explicitly if raw_generation is None then generation = ""), then proceed with
the existing empty-generation branch that updates sample keys
("nvembed_matched_choice", "nvembed_confidence", "is_correct", "nvembed_error");
update the logic around the generation variable in nvembed_judge.py so any None
is converted to an empty string before calling .strip().

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/dataset/audiobench/prepare.py`:
- Around line 159-160: Replace the silent fallback that uses
instruction.get("text") with a fail-fast access so malformed dicts don't get
masked: inside the isinstance(instruction, dict) branch, access the required key
directly (instruction = instruction["text"]) so a KeyError surfaces for missing
keys, and optionally wrap that access to raise a clear ValueError (including the
offending dict or index) if you prefer a more descriptive error; ensure the
resulting instruction is a str before continuing.

In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Line 45: Run the code formatter ruff on the file that contains the pip install
list literal starting with [sys.executable, "-m", "pip", "install", "-q",
"numpy<2", "datasets", ...] to apply ruff-format fixes, then stage and commit
the resulting changes so the pre-commit/CI lint job passes. Ensure you only run
the formatter (ruff format) and include the updated file in your commit.

In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 117-126: The file fails the ruff formatter; run the ruff formatter
on nemo_skills/pipeline/judges/nvembed_judge.py (or run pre-commit) and commit
the reformatted output so the assignment to run_cmd and surrounding code are
properly formatted; specifically reformat the module where run_cmd is defined
(search for the run_cmd assignment in nvembed_judge.py) and commit the changes
to unblock the CI lint job.
- Around line 103-110: The code currently uses
judge_pipeline_args.get("input_dir") and str(...) which can pass "None" as a
path; change to require and validate the key by using direct access
judge_pipeline_args["input_dir"] (or explicitly check "input_dir" in
judge_pipeline_args) when input_file is None, and if missing or None raise a
clear exception (e.g. ValueError) before building script_args; update the branch
that calls script_args.extend(["--input-dir", str(input_dir)]) to assume a valid
path only after this validation so a missing input_dir fails fast in
nvembed_judge.py.

---

Outside diff comments:
In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Around line 123-138: The code calls .strip() on generation without guarding
for null, so when sample.get("generation") returns None it raises
AttributeError; change the assignment to guard and normalize generation first
(e.g., read raw_generation = sample.get("generation"); set generation =
(raw_generation or "").strip() or explicitly if raw_generation is None then
generation = ""), then proceed with the existing empty-generation branch that
updates sample keys ("nvembed_matched_choice", "nvembed_confidence",
"is_correct", "nvembed_error"); update the logic around the generation variable
in nvembed_judge.py so any None is converted to an empty string before calling
.strip().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8e2dd68c-54b6-4a40-9560-a1f4e34ba34f

📥 Commits

Reviewing files that changed from the base of the PR and between 9c34cd1 and 3939682.

📒 Files selected for processing (8)

nemo_skills/dataset/audiobench/nonjudge/__init__.py
nemo_skills/dataset/audiobench/prepare.py
nemo_skills/dataset/mmau-pro/closed_form/__init__.py
nemo_skills/dataset/mmau-pro/prepare.py
nemo_skills/evaluation/evaluator/audio.py
nemo_skills/evaluation/evaluator/mmau_pro.py
nemo_skills/evaluation/evaluator/nvembed_judge.py
nemo_skills/pipeline/judges/nvembed_judge.py

pzelasko · 2026-06-04T17:31:31Z

@coderabbitai review\n\nPushed fixes for the AudioBench/NVEmbed comments: fail-fast instruction handling, null-safe generation handling, input_dir validation, and ruff-format changes. Please re-review the current head.

coderabbitai · 2026-06-04T17:31:43Z

@pzelasko Sure! I'll re-review the current head with the pushed fixes in mind.

🧠 Learnings used

Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.

✅ Action performed

Full review finished.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/pipeline/judges/nvembed_judge.py (1)
80-80: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use direct access for output_dir to fail fast if missing.

Similar to the input_dir fix below, output_dir is required (the evaluator's --output-dir is required=True). Using .get() without validation could pass "None" as a path string if the key is missing.
Suggested fix
-    output_dir_path = judge_pipeline_args.get("output_dir")
+    try:
+        output_dir_path = judge_pipeline_args["output_dir"]
+    except KeyError as exc:
+        raise ValueError("NVEmbed judge requires judge_pipeline_args['output_dir']") from exc
As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/judges/nvembed_judge.py` at line 80, The code currently
uses judge_pipeline_args.get("output_dir") which can silently return None;
change this to direct access judge_pipeline_args["output_dir"] so the program
fails fast if the required key is missing (update the assignment to
output_dir_path = judge_pipeline_args["output_dir"] and remove any downstream
logic that treats None as a valid path).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/dataset/audiobench/prepare.py`:
- Around line 216-235: The helper make_dataset_manifest_entry currently accesses
nested audio["path"] without confirming the keys/types, which can raise KeyError
on malformed messages; update the checks in the loop so you first verify message
contains an "audio" mapping and that "path" exists (e.g., if "audio" in message
and isinstance(message["audio"], dict) and "path" in message["audio"]) before
rewriting, and likewise iterate message.get("audios", []) but only call rewrite
on items that are dicts and contain "path" (e.g., for audio in
message.get("audios", []): if isinstance(audio, dict) and "path" in audio:
audio["path"] = rewrite(audio["path"])). This preserves behavior for valid
entries and safely skips malformed nested audio entries in
make_dataset_manifest_entry.

---

Outside diff comments:
In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Line 80: The code currently uses judge_pipeline_args.get("output_dir") which
can silently return None; change this to direct access
judge_pipeline_args["output_dir"] so the program fails fast if the required key
is missing (update the assignment to output_dir_path =
judge_pipeline_args["output_dir"] and remove any downstream logic that treats
None as a valid path).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c440e519-0b43-4e04-aff0-90108b6082e5

📥 Commits

Reviewing files that changed from the base of the PR and between 9c34cd1 and d9bdff0.

📒 Files selected for processing (8)

nemo_skills/dataset/audiobench/nonjudge/__init__.py
nemo_skills/dataset/audiobench/prepare.py
nemo_skills/dataset/mmau-pro/closed_form/__init__.py
nemo_skills/dataset/mmau-pro/prepare.py
nemo_skills/evaluation/evaluator/audio.py
nemo_skills/evaluation/evaluator/mmau_pro.py
nemo_skills/evaluation/evaluator/nvembed_judge.py
nemo_skills/pipeline/judges/nvembed_judge.py

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

pzelasko · 2026-06-04T17:46:32Z

@coderabbitai review

Pushed fixes for the latest AudioBench/NVEmbed review: nested audio path rewrites now guard key/type access, and NVEmbed output_dir fails fast when missing or null. Please re-review the current head.

coderabbitai · 2026-06-04T17:46:45Z

@pzelasko Sure! I'll re-review the current head with the nested audio path guarding and NVEmbed output_dir fail-fast fixes in mind.

🧠 Learnings used

Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.

✅ Action performed

Full review finished.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/dataset/audiobench/prepare.py`:
- Around line 613-618: The unsupported-dataset check needs to be moved into
process_dataset(): inside the process_dataset(dataset_name, ...) function
validate dataset_name against JUDGE_DATASETS and NONJUDGE_DATASETS and raise
ValueError for unsupported names (use dataset_name in the error message, not the
undefined name variable), instead of letting callers fall through to
category="unknown"; update any code paths that previously relied on the CLI-only
check so process_dataset always fails fast for invalid dataset names.

In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Around line 134-150: The code is silently treating missing or non-string
generation values as empty outputs; in the nvembed_judge logic, replace
sample.get("generation", "") with direct access sample["generation"] (so missing
keys raise) and explicitly validate the type: if not
isinstance(sample["generation"], str) raise or mark as malformed (e.g., set
"nvembed_error":"malformed_generation") and only set
nvembed_error:"empty_generation" when the string .strip() is empty; keep the
rest of the return behavior (nvembed_matched_choice, nvembed_confidence,
is_correct) unchanged so a truly blank string is handled as a valid empty
prediction but bad/missing data fails fast.

In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 80-85: The current validation only rejects None for
judge_pipeline_args["output_dir"] (and similarly for input_file/input_dir around
lines 111-123); update these checks to also reject empty or all-whitespace
strings so "" isn't treated as Path("."); specifically, after retrieving
output_dir_path (and input_file/input_dir), treat values as invalid if they are
None or if str(value).strip() == "" (or simply falsy after stripping) and raise
the same ValueError mentioning the exact parameter (e.g., "NVEmbed judge
requires a non-null, non-empty output_dir"); apply the same fix to the
input_file and input_dir validation blocks so blank strings are rejected at
config validation time.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 96a3b996-0f45-48e0-828d-7c623d4a0d49

📥 Commits

Reviewing files that changed from the base of the PR and between 9c34cd1 and 8264676.

📒 Files selected for processing (8)

nemo_skills/dataset/audiobench/nonjudge/__init__.py
nemo_skills/dataset/audiobench/prepare.py
nemo_skills/dataset/mmau-pro/closed_form/__init__.py
nemo_skills/dataset/mmau-pro/prepare.py
nemo_skills/evaluation/evaluator/audio.py
nemo_skills/evaluation/evaluator/mmau_pro.py
nemo_skills/evaluation/evaluator/nvembed_judge.py
nemo_skills/pipeline/judges/nvembed_judge.py

coderabbitai · 2026-06-04T17:57:29Z

+        if dataset_name in JUDGE_DATASETS:
+            category = "judge"
+        elif dataset_name in NONJUDGE_DATASETS:
+            category = "nonjudge"
+        else:
+            raise ValueError(f"Unsupported AudioBench dataset name: {name}")


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Move the unsupported-dataset check into process_dataset().

This fail-fast only protects the CLI path right now. Direct callers can still pass an unsupported name into process_dataset(), which falls back to category = "unknown" and writes manifests under an unsupported layout.

Suggested fix

def process_dataset( dataset_name: str, output_dir: Path, save_audio: bool = True, split: str = "test", max_samples: int = -1, ) -> tuple[int, List[Dict]]: """Process a single AudioBench dataset. @@ Returns: Tuple of (num_samples, manifest_entries) """ + if dataset_name not in JUDGE_DATASETS and dataset_name not in NONJUDGE_DATASETS: + normalized_name = f"{dataset_name}_test" + if normalized_name in JUDGE_DATASETS or normalized_name in NONJUDGE_DATASETS: + dataset_name = normalized_name + else: + raise ValueError(f"Unsupported AudioBench dataset name: {dataset_name}") + print(f"\n{'=' * 60}") print(f"Processing: {dataset_name}") print(f"{'=' * 60}")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/audiobench/prepare.py` around lines 613 - 618, The unsupported-dataset check needs to be moved into process_dataset(): inside the process_dataset(dataset_name, ...) function validate dataset_name against JUDGE_DATASETS and NONJUDGE_DATASETS and raise ValueError for unsupported names (use dataset_name in the error message, not the undefined name variable), instead of letting callers fall through to category="unknown"; update any code paths that previously relied on the CLI-only check so process_dataset always fails fast for invalid dataset names.

coderabbitai · 2026-06-04T17:57:29Z

+    generation_value = sample.get("generation", "")
+    generation = generation_value.strip() if isinstance(generation_value, str) else ""
    choices = sample.get("choices", [])
    expected_answer = sample.get("expected_answer", "")

-    # Fail fast if data is malformed - this indicates a pipeline error
+    # Empty model outputs are valid failed predictions. Keep malformed-data
+    # checks strict, but do not let one blank generation abort the whole eval.
    if not generation:
-        raise ValueError(
-            f"Sample missing generation field or has empty generation. Sample ID: {sample.get('id', 'unknown')}"
+        sample.update(
+            {
+                "nvembed_matched_choice": "",
+                "nvembed_confidence": 0.0,
+                "is_correct": False,
+                "nvembed_error": "empty_generation",
+            }
        )
+        return sample


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep missing or malformed generation values as data errors.

sample.get("generation", "") and the non-string fallback now treat a missing key or structured value exactly like a blank model output. That hides bad records and can skew evaluation metrics instead of failing fast. Read the key directly, validate that it is a string, and only return nvembed_error="empty_generation" after a real string strips to empty.

Suggested fix

- generation_value = sample.get("generation", "") - generation = generation_value.strip() if isinstance(generation_value, str) else "" + generation_value = sample["generation"] + if not isinstance(generation_value, str): + raise TypeError(f"Sample generation must be a string. Sample ID: {sample.get('id', 'unknown')}") + generation = generation_value.strip()

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/evaluation/evaluator/nvembed_judge.py` around lines 134 - 150, The code is silently treating missing or non-string generation values as empty outputs; in the nvembed_judge logic, replace sample.get("generation", "") with direct access sample["generation"] (so missing keys raise) and explicitly validate the type: if not isinstance(sample["generation"], str) raise or mark as malformed (e.g., set "nvembed_error":"malformed_generation") and only set nvembed_error:"empty_generation" when the string .strip() is empty; keep the rest of the return behavior (nvembed_matched_choice, nvembed_confidence, is_correct) unchanged so a truly blank string is handled as a valid empty prediction but bad/missing data fails fast.

coderabbitai · 2026-06-04T17:57:29Z

+    try:
+        output_dir_path = judge_pipeline_args["output_dir"]
+    except KeyError as exc:
+        raise ValueError("NVEmbed judge requires judge_pipeline_args['output_dir']") from exc
+    if output_dir_path is None:
+        raise ValueError("NVEmbed judge requires a non-null output_dir")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject blank path values here too.

These checks only reject None. If input_file, input_dir, or output_dir is "", the evaluator later treats it as Path("") == ".", so the job can read the working directory or fail with a misleading copy error instead of stopping at config validation.

Suggested fix

- if output_dir_path is None: + if output_dir_path is None or not str(output_dir_path).strip(): raise ValueError("NVEmbed judge requires a non-null output_dir") @@ - if input_dir is None: + if input_dir is None or not str(input_dir).strip(): raise ValueError("NVEmbed judge requires a non-null input_dir when input_file is unset") script_args.extend(["--input-dir", str(input_dir)]) script_args.extend(["--num-seeds", str(num_seeds)]) else: + if not str(input_file).strip(): + raise ValueError("NVEmbed judge requires a non-empty input_file when provided") script_args.extend(["--input-file", str(input_file)])

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

Also applies to: 111-123

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 80 - 85, The current validation only rejects None for judge_pipeline_args["output_dir"] (and similarly for input_file/input_dir around lines 111-123); update these checks to also reject empty or all-whitespace strings so "" isn't treated as Path("."); specifically, after retrieving output_dir_path (and input_file/input_dir), treat values as invalid if they are None or if str(value).strip() == "" (or simply falsy after stripping) and raise the same ValueError mentioning the exact parameter (e.g., "NVEmbed judge requires a non-null, non-empty output_dir"); apply the same fix to the input_file and input_dir validation blocks so blank strings are rejected at config validation time.

pzelasko · 2026-06-15T15:23:46Z

Recreated as #1485 from the same-repo branch codex/pr1443-audiobench-mmau-fixes so repository CI can run. Closing this fork-based PR to keep the review queue focused on the compliant replacement.

pzelasko mentioned this pull request Jun 4, 2026

feat(speech): add AppTek Call-Center Dialogues ASR benchmark #1443

Closed

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread nemo_skills/dataset/audiobench/prepare.py Outdated

Comment thread nemo_skills/evaluation/evaluator/nvembed_judge.py Outdated

Comment thread nemo_skills/pipeline/judges/nvembed_judge.py

Comment thread nemo_skills/pipeline/judges/nvembed_judge.py

pzelasko force-pushed the codex/pr1443-audiobench-mmau-fixes branch from 3939682 to d9bdff0 Compare June 4, 2026 17:27

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread nemo_skills/dataset/audiobench/prepare.py

fix(audio): harden AudioBench and MMAU evaluation

8264676

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

pzelasko force-pushed the codex/pr1443-audiobench-mmau-fixes branch from d9bdff0 to 8264676 Compare June 4, 2026 17:45

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

pzelasko mentioned this pull request Jun 15, 2026

fix(audio): harden AudioBench and MMAU evaluation #1485

Open

pzelasko closed this Jun 15, 2026

Uh oh!

Conversation

pzelasko commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Relationship To #1443

Validation

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pzelasko commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pzelasko commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

pzelasko commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pzelasko commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading