Skip to content

fix(audio): harden AudioBench and MMAU evaluation#1475

Closed
pzelasko wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
pzelasko:codex/pr1443-audiobench-mmau-fixes
Closed

fix(audio): harden AudioBench and MMAU evaluation#1475
pzelasko wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
pzelasko:codex/pr1443-audiobench-mmau-fixes

Conversation

@pzelasko

@pzelasko pzelasko commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

Split out from #1443.

This PR contains AudioBench, MMAU-Pro, and NVEmbed robustness fixes that are independent of the AppTek benchmark.

What Changed

  • Reworked AudioBench manifest generation so combined and per-dataset manifests resolve audio paths correctly.
  • Added defensive handling for malformed nested AudioBench audio metadata during manifest rewriting.
  • Enabled soft-fail generation for:
    • AudioBench non-judge
    • MMAU-Pro closed-form
  • Made AudioBench preparation fail fast for unsupported dataset names and avoid swallowing dataset-level failures.
  • Normalized MMAU-Pro audio paths for generated manifests.
  • Added audio evaluator text coercion for loose string/dict/list/None fields.
  • Handled null MMAU-Pro samples as failed samples instead of aborting evaluation.
  • Handled empty NVEmbed generations as failed samples instead of raising.
  • Hardened NVEmbed judge execution:
    • validates required input/output arguments
    • pins runtime numpy<2
    • installs task dependencies into an isolated target directory
    • passes shell-safe script arguments
    • supports SLURM account and judge-container overrides

Relationship To #1443

This PR is a prerequisite/adjacent cleanup split out of the original AppTek PR. It does not add the AppTek benchmark itself.

Validation

  • python -m py_compile on changed Python files.
  • Direct AudioBench manifest rewrite smoke test with malformed nested audio entries.
  • CI lint/DCO/copyright checks are passing.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR hardens audio dataset preparation and evaluation against missing, malformed, or structured input fields. It restructures AudioBench manifests to use relative paths, enables soft-fail mode across AudioBench and MMAU‑Pro configurations, and refactors the NVEmbed judge pipeline to manage dependencies separately for each SLURM job.

Changes

Audio Robustness and Manifest Path Restructuring

Layer / File(s) Summary
Soft-fail configuration across datasets
nemo_skills/dataset/audiobench/nonjudge/__init__.py, nemo_skills/dataset/mmau-pro/closed_form/__init__.py
GENERATION_ARGS extended with server.enable_soft_fail=true in both AudioBench non-judge and MMAU‑Pro closed-form. Removed EVAL_ARGS from MMAU‑Pro closed_form.
AudioBench manifest path restructuring
nemo_skills/dataset/audiobench/prepare.py
Refactored manifest generation: extract dict instruction["text"], emit dataset-scoped audio/{dataset_name}/{audio_filename}, add make_dataset_manifest_entry to deep-copy and rewrite audio/... -> ../audio/..., serialize rewritten entries when writing per-dataset files, and accumulate category manifests written to output_dir/{category}/{split}.jsonl.
MMAU‑Pro audio path normalization
nemo_skills/dataset/mmau-pro/prepare.py
format_entry now prefixes audio paths with ../ in formatted_entry["audio_path"] and builds user_message["audio"] or user_message["audios"] from the normalized paths.
Audio evaluator coercion
nemo_skills/evaluation/evaluator/audio.py
Added coerce_text_field to normalize None/string/dict/list generation and reference fields to plain text; evaluate_sample now uses it for both generation and expected answer.
MMAU‑Pro null-sample guard
nemo_skills/evaluation/evaluator/mmau_pro.py
evaluate_instruction_following_sample returns {"is_correct": False, "error": "null_sample"} and logs a warning when given None instead of proceeding.
NVEmbed evaluator hardening
nemo_skills/evaluation/evaluator/nvembed_judge.py
Pinned numpy<2 in runtime install; empty/blank generation is handled by returning an incorrect sample with nvembed_error: "empty_generation" and zero confidence instead of raising.
NVEmbed judge pipeline refactor
nemo_skills/pipeline/judges/nvembed_judge.py
create_judge_tasks signature adds account and judge_container; command construction now builds script_args, installs pinned deps into a job-scoped temp dir, sets PYTHONPATH and offline env vars, runs nvembed_judge.py with shell-quoted args, and adjusts task flags and container selection.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • AlexGrinch
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main objective: hardening AudioBench and MMAU evaluation with multiple robustness fixes across files and modules.
Docstring Coverage ✅ Passed Docstring coverage is 84.62% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/evaluation/evaluator/nvembed_judge.py (1)

123-138: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard generation before .strip() so null values don’t abort evaluation.

If a record has "generation": null, the .strip() call runs before your new empty-generation fallback and raises AttributeError, aborting the file instead of soft-failing the sample.

Suggested fix
-    generation = sample.get("generation", "").strip()
+    generation_value = sample.get("generation", "")
+    generation = generation_value.strip() if isinstance(generation_value, str) else ""
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/evaluation/evaluator/nvembed_judge.py` around lines 123 - 138,
The code calls .strip() on generation without guarding for null, so when
sample.get("generation") returns None it raises AttributeError; change the
assignment to guard and normalize generation first (e.g., read raw_generation =
sample.get("generation"); set generation = (raw_generation or "").strip() or
explicitly if raw_generation is None then generation = ""), then proceed with
the existing empty-generation branch that updates sample keys
("nvembed_matched_choice", "nvembed_confidence", "is_correct", "nvembed_error");
update the logic around the generation variable in nvembed_judge.py so any None
is converted to an empty string before calling .strip().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/dataset/audiobench/prepare.py`:
- Around line 159-160: Replace the silent fallback that uses
instruction.get("text") with a fail-fast access so malformed dicts don't get
masked: inside the isinstance(instruction, dict) branch, access the required key
directly (instruction = instruction["text"]) so a KeyError surfaces for missing
keys, and optionally wrap that access to raise a clear ValueError (including the
offending dict or index) if you prefer a more descriptive error; ensure the
resulting instruction is a str before continuing.

In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Line 45: Run the code formatter ruff on the file that contains the pip install
list literal starting with [sys.executable, "-m", "pip", "install", "-q",
"numpy<2", "datasets", ...] to apply ruff-format fixes, then stage and commit
the resulting changes so the pre-commit/CI lint job passes. Ensure you only run
the formatter (ruff format) and include the updated file in your commit.

In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 117-126: The file fails the ruff formatter; run the ruff formatter
on nemo_skills/pipeline/judges/nvembed_judge.py (or run pre-commit) and commit
the reformatted output so the assignment to run_cmd and surrounding code are
properly formatted; specifically reformat the module where run_cmd is defined
(search for the run_cmd assignment in nvembed_judge.py) and commit the changes
to unblock the CI lint job.
- Around line 103-110: The code currently uses
judge_pipeline_args.get("input_dir") and str(...) which can pass "None" as a
path; change to require and validate the key by using direct access
judge_pipeline_args["input_dir"] (or explicitly check "input_dir" in
judge_pipeline_args) when input_file is None, and if missing or None raise a
clear exception (e.g. ValueError) before building script_args; update the branch
that calls script_args.extend(["--input-dir", str(input_dir)]) to assume a valid
path only after this validation so a missing input_dir fails fast in
nvembed_judge.py.

---

Outside diff comments:
In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Around line 123-138: The code calls .strip() on generation without guarding
for null, so when sample.get("generation") returns None it raises
AttributeError; change the assignment to guard and normalize generation first
(e.g., read raw_generation = sample.get("generation"); set generation =
(raw_generation or "").strip() or explicitly if raw_generation is None then
generation = ""), then proceed with the existing empty-generation branch that
updates sample keys ("nvembed_matched_choice", "nvembed_confidence",
"is_correct", "nvembed_error"); update the logic around the generation variable
in nvembed_judge.py so any None is converted to an empty string before calling
.strip().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8e2dd68c-54b6-4a40-9560-a1f4e34ba34f

📥 Commits

Reviewing files that changed from the base of the PR and between 9c34cd1 and 3939682.

📒 Files selected for processing (8)
  • nemo_skills/dataset/audiobench/nonjudge/__init__.py
  • nemo_skills/dataset/audiobench/prepare.py
  • nemo_skills/dataset/mmau-pro/closed_form/__init__.py
  • nemo_skills/dataset/mmau-pro/prepare.py
  • nemo_skills/evaluation/evaluator/audio.py
  • nemo_skills/evaluation/evaluator/mmau_pro.py
  • nemo_skills/evaluation/evaluator/nvembed_judge.py
  • nemo_skills/pipeline/judges/nvembed_judge.py

Comment thread nemo_skills/dataset/audiobench/prepare.py Outdated
Comment thread nemo_skills/evaluation/evaluator/nvembed_judge.py Outdated
Comment thread nemo_skills/pipeline/judges/nvembed_judge.py
Comment thread nemo_skills/pipeline/judges/nvembed_judge.py
@pzelasko pzelasko force-pushed the codex/pr1443-audiobench-mmau-fixes branch from 3939682 to d9bdff0 Compare June 4, 2026 17:27
@pzelasko

pzelasko commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review\n\nPushed fixes for the AudioBench/NVEmbed comments: fail-fast instruction handling, null-safe generation handling, input_dir validation, and ruff-format changes. Please re-review the current head.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

@pzelasko Sure! I'll re-review the current head with the pushed fixes in mind.


🧠 Learnings used
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/pipeline/judges/nvembed_judge.py (1)

80-80: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use direct access for output_dir to fail fast if missing.

Similar to the input_dir fix below, output_dir is required (the evaluator's --output-dir is required=True). Using .get() without validation could pass "None" as a path string if the key is missing.

Suggested fix
-    output_dir_path = judge_pipeline_args.get("output_dir")
+    try:
+        output_dir_path = judge_pipeline_args["output_dir"]
+    except KeyError as exc:
+        raise ValueError("NVEmbed judge requires judge_pipeline_args['output_dir']") from exc

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/judges/nvembed_judge.py` at line 80, The code currently
uses judge_pipeline_args.get("output_dir") which can silently return None;
change this to direct access judge_pipeline_args["output_dir"] so the program
fails fast if the required key is missing (update the assignment to
output_dir_path = judge_pipeline_args["output_dir"] and remove any downstream
logic that treats None as a valid path).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/dataset/audiobench/prepare.py`:
- Around line 216-235: The helper make_dataset_manifest_entry currently accesses
nested audio["path"] without confirming the keys/types, which can raise KeyError
on malformed messages; update the checks in the loop so you first verify message
contains an "audio" mapping and that "path" exists (e.g., if "audio" in message
and isinstance(message["audio"], dict) and "path" in message["audio"]) before
rewriting, and likewise iterate message.get("audios", []) but only call rewrite
on items that are dicts and contain "path" (e.g., for audio in
message.get("audios", []): if isinstance(audio, dict) and "path" in audio:
audio["path"] = rewrite(audio["path"])). This preserves behavior for valid
entries and safely skips malformed nested audio entries in
make_dataset_manifest_entry.

---

Outside diff comments:
In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Line 80: The code currently uses judge_pipeline_args.get("output_dir") which
can silently return None; change this to direct access
judge_pipeline_args["output_dir"] so the program fails fast if the required key
is missing (update the assignment to output_dir_path =
judge_pipeline_args["output_dir"] and remove any downstream logic that treats
None as a valid path).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c440e519-0b43-4e04-aff0-90108b6082e5

📥 Commits

Reviewing files that changed from the base of the PR and between 9c34cd1 and d9bdff0.

📒 Files selected for processing (8)
  • nemo_skills/dataset/audiobench/nonjudge/__init__.py
  • nemo_skills/dataset/audiobench/prepare.py
  • nemo_skills/dataset/mmau-pro/closed_form/__init__.py
  • nemo_skills/dataset/mmau-pro/prepare.py
  • nemo_skills/evaluation/evaluator/audio.py
  • nemo_skills/evaluation/evaluator/mmau_pro.py
  • nemo_skills/evaluation/evaluator/nvembed_judge.py
  • nemo_skills/pipeline/judges/nvembed_judge.py

Comment thread nemo_skills/dataset/audiobench/prepare.py
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
@pzelasko pzelasko force-pushed the codex/pr1443-audiobench-mmau-fixes branch from d9bdff0 to 8264676 Compare June 4, 2026 17:45
@pzelasko

pzelasko commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

Pushed fixes for the latest AudioBench/NVEmbed review: nested audio path rewrites now guard key/type access, and NVEmbed output_dir fails fast when missing or null. Please re-review the current head.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

@pzelasko Sure! I'll re-review the current head with the nested audio path guarding and NVEmbed output_dir fail-fast fixes in mind.


🧠 Learnings used
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/dataset/audiobench/prepare.py`:
- Around line 613-618: The unsupported-dataset check needs to be moved into
process_dataset(): inside the process_dataset(dataset_name, ...) function
validate dataset_name against JUDGE_DATASETS and NONJUDGE_DATASETS and raise
ValueError for unsupported names (use dataset_name in the error message, not the
undefined name variable), instead of letting callers fall through to
category="unknown"; update any code paths that previously relied on the CLI-only
check so process_dataset always fails fast for invalid dataset names.

In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Around line 134-150: The code is silently treating missing or non-string
generation values as empty outputs; in the nvembed_judge logic, replace
sample.get("generation", "") with direct access sample["generation"] (so missing
keys raise) and explicitly validate the type: if not
isinstance(sample["generation"], str) raise or mark as malformed (e.g., set
"nvembed_error":"malformed_generation") and only set
nvembed_error:"empty_generation" when the string .strip() is empty; keep the
rest of the return behavior (nvembed_matched_choice, nvembed_confidence,
is_correct) unchanged so a truly blank string is handled as a valid empty
prediction but bad/missing data fails fast.

In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 80-85: The current validation only rejects None for
judge_pipeline_args["output_dir"] (and similarly for input_file/input_dir around
lines 111-123); update these checks to also reject empty or all-whitespace
strings so "" isn't treated as Path("."); specifically, after retrieving
output_dir_path (and input_file/input_dir), treat values as invalid if they are
None or if str(value).strip() == "" (or simply falsy after stripping) and raise
the same ValueError mentioning the exact parameter (e.g., "NVEmbed judge
requires a non-null, non-empty output_dir"); apply the same fix to the
input_file and input_dir validation blocks so blank strings are rejected at
config validation time.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 96a3b996-0f45-48e0-828d-7c623d4a0d49

📥 Commits

Reviewing files that changed from the base of the PR and between 9c34cd1 and 8264676.

📒 Files selected for processing (8)
  • nemo_skills/dataset/audiobench/nonjudge/__init__.py
  • nemo_skills/dataset/audiobench/prepare.py
  • nemo_skills/dataset/mmau-pro/closed_form/__init__.py
  • nemo_skills/dataset/mmau-pro/prepare.py
  • nemo_skills/evaluation/evaluator/audio.py
  • nemo_skills/evaluation/evaluator/mmau_pro.py
  • nemo_skills/evaluation/evaluator/nvembed_judge.py
  • nemo_skills/pipeline/judges/nvembed_judge.py

Comment on lines +613 to +618
if dataset_name in JUDGE_DATASETS:
category = "judge"
elif dataset_name in NONJUDGE_DATASETS:
category = "nonjudge"
else:
raise ValueError(f"Unsupported AudioBench dataset name: {name}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Move the unsupported-dataset check into process_dataset().

This fail-fast only protects the CLI path right now. Direct callers can still pass an unsupported name into process_dataset(), which falls back to category = "unknown" and writes manifests under an unsupported layout.

Suggested fix
 def process_dataset(
     dataset_name: str,
     output_dir: Path,
     save_audio: bool = True,
     split: str = "test",
     max_samples: int = -1,
 ) -> tuple[int, List[Dict]]:
     """Process a single AudioBench dataset.
@@
     Returns:
         Tuple of (num_samples, manifest_entries)
     """
+    if dataset_name not in JUDGE_DATASETS and dataset_name not in NONJUDGE_DATASETS:
+        normalized_name = f"{dataset_name}_test"
+        if normalized_name in JUDGE_DATASETS or normalized_name in NONJUDGE_DATASETS:
+            dataset_name = normalized_name
+        else:
+            raise ValueError(f"Unsupported AudioBench dataset name: {dataset_name}")
+
     print(f"\n{'=' * 60}")
     print(f"Processing: {dataset_name}")
     print(f"{'=' * 60}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/audiobench/prepare.py` around lines 613 - 618, The
unsupported-dataset check needs to be moved into process_dataset(): inside the
process_dataset(dataset_name, ...) function validate dataset_name against
JUDGE_DATASETS and NONJUDGE_DATASETS and raise ValueError for unsupported names
(use dataset_name in the error message, not the undefined name variable),
instead of letting callers fall through to category="unknown"; update any code
paths that previously relied on the CLI-only check so process_dataset always
fails fast for invalid dataset names.

Comment on lines +134 to +150
generation_value = sample.get("generation", "")
generation = generation_value.strip() if isinstance(generation_value, str) else ""
choices = sample.get("choices", [])
expected_answer = sample.get("expected_answer", "")

# Fail fast if data is malformed - this indicates a pipeline error
# Empty model outputs are valid failed predictions. Keep malformed-data
# checks strict, but do not let one blank generation abort the whole eval.
if not generation:
raise ValueError(
f"Sample missing generation field or has empty generation. Sample ID: {sample.get('id', 'unknown')}"
sample.update(
{
"nvembed_matched_choice": "",
"nvembed_confidence": 0.0,
"is_correct": False,
"nvembed_error": "empty_generation",
}
)
return sample

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep missing or malformed generation values as data errors.

sample.get("generation", "") and the non-string fallback now treat a missing key or structured value exactly like a blank model output. That hides bad records and can skew evaluation metrics instead of failing fast. Read the key directly, validate that it is a string, and only return nvembed_error="empty_generation" after a real string strips to empty.

Suggested fix
-    generation_value = sample.get("generation", "")
-    generation = generation_value.strip() if isinstance(generation_value, str) else ""
+    generation_value = sample["generation"]
+    if not isinstance(generation_value, str):
+        raise TypeError(f"Sample generation must be a string. Sample ID: {sample.get('id', 'unknown')}")
+    generation = generation_value.strip()

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/evaluation/evaluator/nvembed_judge.py` around lines 134 - 150,
The code is silently treating missing or non-string generation values as empty
outputs; in the nvembed_judge logic, replace sample.get("generation", "") with
direct access sample["generation"] (so missing keys raise) and explicitly
validate the type: if not isinstance(sample["generation"], str) raise or mark as
malformed (e.g., set "nvembed_error":"malformed_generation") and only set
nvembed_error:"empty_generation" when the string .strip() is empty; keep the
rest of the return behavior (nvembed_matched_choice, nvembed_confidence,
is_correct) unchanged so a truly blank string is handled as a valid empty
prediction but bad/missing data fails fast.

Comment on lines +80 to +85
try:
output_dir_path = judge_pipeline_args["output_dir"]
except KeyError as exc:
raise ValueError("NVEmbed judge requires judge_pipeline_args['output_dir']") from exc
if output_dir_path is None:
raise ValueError("NVEmbed judge requires a non-null output_dir")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject blank path values here too.

These checks only reject None. If input_file, input_dir, or output_dir is "", the evaluator later treats it as Path("") == ".", so the job can read the working directory or fail with a misleading copy error instead of stopping at config validation.

Suggested fix
-    if output_dir_path is None:
+    if output_dir_path is None or not str(output_dir_path).strip():
         raise ValueError("NVEmbed judge requires a non-null output_dir")
@@
-        if input_dir is None:
+        if input_dir is None or not str(input_dir).strip():
             raise ValueError("NVEmbed judge requires a non-null input_dir when input_file is unset")
         script_args.extend(["--input-dir", str(input_dir)])
         script_args.extend(["--num-seeds", str(num_seeds)])
     else:
+        if not str(input_file).strip():
+            raise ValueError("NVEmbed judge requires a non-empty input_file when provided")
         script_args.extend(["--input-file", str(input_file)])

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

Also applies to: 111-123

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 80 - 85, The
current validation only rejects None for judge_pipeline_args["output_dir"] (and
similarly for input_file/input_dir around lines 111-123); update these checks to
also reject empty or all-whitespace strings so "" isn't treated as Path(".");
specifically, after retrieving output_dir_path (and input_file/input_dir), treat
values as invalid if they are None or if str(value).strip() == "" (or simply
falsy after stripping) and raise the same ValueError mentioning the exact
parameter (e.g., "NVEmbed judge requires a non-null, non-empty output_dir");
apply the same fix to the input_file and input_dir validation blocks so blank
strings are rejected at config validation time.

Copy link
Copy Markdown
Collaborator Author

Recreated as #1485 from the same-repo branch codex/pr1443-audiobench-mmau-fixes so repository CI can run. Closing this fork-based PR to keep the review queue focused on the compliant replacement.

@pzelasko pzelasko closed this Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant