fix(audio): harden AudioBench and MMAU evaluation#1475
Conversation
WalkthroughThis PR hardens audio dataset preparation and evaluation against missing, malformed, or structured input fields. It restructures AudioBench manifests to use relative paths, enables soft-fail mode across AudioBench and MMAU‑Pro configurations, and refactors the NVEmbed judge pipeline to manage dependencies separately for each SLURM job. ChangesAudio Robustness and Manifest Path Restructuring
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/evaluation/evaluator/nvembed_judge.py (1)
123-138:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard
generationbefore.strip()so null values don’t abort evaluation.If a record has
"generation": null, the.strip()call runs before your new empty-generation fallback and raisesAttributeError, aborting the file instead of soft-failing the sample.Suggested fix
- generation = sample.get("generation", "").strip() + generation_value = sample.get("generation", "") + generation = generation_value.strip() if isinstance(generation_value, str) else ""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/evaluation/evaluator/nvembed_judge.py` around lines 123 - 138, The code calls .strip() on generation without guarding for null, so when sample.get("generation") returns None it raises AttributeError; change the assignment to guard and normalize generation first (e.g., read raw_generation = sample.get("generation"); set generation = (raw_generation or "").strip() or explicitly if raw_generation is None then generation = ""), then proceed with the existing empty-generation branch that updates sample keys ("nvembed_matched_choice", "nvembed_confidence", "is_correct", "nvembed_error"); update the logic around the generation variable in nvembed_judge.py so any None is converted to an empty string before calling .strip().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/dataset/audiobench/prepare.py`:
- Around line 159-160: Replace the silent fallback that uses
instruction.get("text") with a fail-fast access so malformed dicts don't get
masked: inside the isinstance(instruction, dict) branch, access the required key
directly (instruction = instruction["text"]) so a KeyError surfaces for missing
keys, and optionally wrap that access to raise a clear ValueError (including the
offending dict or index) if you prefer a more descriptive error; ensure the
resulting instruction is a str before continuing.
In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Line 45: Run the code formatter ruff on the file that contains the pip install
list literal starting with [sys.executable, "-m", "pip", "install", "-q",
"numpy<2", "datasets", ...] to apply ruff-format fixes, then stage and commit
the resulting changes so the pre-commit/CI lint job passes. Ensure you only run
the formatter (ruff format) and include the updated file in your commit.
In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 117-126: The file fails the ruff formatter; run the ruff formatter
on nemo_skills/pipeline/judges/nvembed_judge.py (or run pre-commit) and commit
the reformatted output so the assignment to run_cmd and surrounding code are
properly formatted; specifically reformat the module where run_cmd is defined
(search for the run_cmd assignment in nvembed_judge.py) and commit the changes
to unblock the CI lint job.
- Around line 103-110: The code currently uses
judge_pipeline_args.get("input_dir") and str(...) which can pass "None" as a
path; change to require and validate the key by using direct access
judge_pipeline_args["input_dir"] (or explicitly check "input_dir" in
judge_pipeline_args) when input_file is None, and if missing or None raise a
clear exception (e.g. ValueError) before building script_args; update the branch
that calls script_args.extend(["--input-dir", str(input_dir)]) to assume a valid
path only after this validation so a missing input_dir fails fast in
nvembed_judge.py.
---
Outside diff comments:
In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Around line 123-138: The code calls .strip() on generation without guarding
for null, so when sample.get("generation") returns None it raises
AttributeError; change the assignment to guard and normalize generation first
(e.g., read raw_generation = sample.get("generation"); set generation =
(raw_generation or "").strip() or explicitly if raw_generation is None then
generation = ""), then proceed with the existing empty-generation branch that
updates sample keys ("nvembed_matched_choice", "nvembed_confidence",
"is_correct", "nvembed_error"); update the logic around the generation variable
in nvembed_judge.py so any None is converted to an empty string before calling
.strip().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8e2dd68c-54b6-4a40-9560-a1f4e34ba34f
📒 Files selected for processing (8)
nemo_skills/dataset/audiobench/nonjudge/__init__.pynemo_skills/dataset/audiobench/prepare.pynemo_skills/dataset/mmau-pro/closed_form/__init__.pynemo_skills/dataset/mmau-pro/prepare.pynemo_skills/evaluation/evaluator/audio.pynemo_skills/evaluation/evaluator/mmau_pro.pynemo_skills/evaluation/evaluator/nvembed_judge.pynemo_skills/pipeline/judges/nvembed_judge.py
3939682 to
d9bdff0
Compare
|
@coderabbitai review\n\nPushed fixes for the AudioBench/NVEmbed comments: fail-fast instruction handling, null-safe generation handling, input_dir validation, and ruff-format changes. Please re-review the current head. |
|
🧠 Learnings used✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/pipeline/judges/nvembed_judge.py (1)
80-80:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse direct access for
output_dirto fail fast if missing.Similar to the
input_dirfix below,output_diris required (the evaluator's--output-dirisrequired=True). Using.get()without validation could pass"None"as a path string if the key is missing.Suggested fix
- output_dir_path = judge_pipeline_args.get("output_dir") + try: + output_dir_path = judge_pipeline_args["output_dir"] + except KeyError as exc: + raise ValueError("NVEmbed judge requires judge_pipeline_args['output_dir']") from excAs per coding guidelines, "Don't use
.get()for accessing dictionary keys if the code expects them to be present; use direct accessdata[key_name]to fail with a clear error instead of silently corrupting data".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/pipeline/judges/nvembed_judge.py` at line 80, The code currently uses judge_pipeline_args.get("output_dir") which can silently return None; change this to direct access judge_pipeline_args["output_dir"] so the program fails fast if the required key is missing (update the assignment to output_dir_path = judge_pipeline_args["output_dir"] and remove any downstream logic that treats None as a valid path).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/dataset/audiobench/prepare.py`:
- Around line 216-235: The helper make_dataset_manifest_entry currently accesses
nested audio["path"] without confirming the keys/types, which can raise KeyError
on malformed messages; update the checks in the loop so you first verify message
contains an "audio" mapping and that "path" exists (e.g., if "audio" in message
and isinstance(message["audio"], dict) and "path" in message["audio"]) before
rewriting, and likewise iterate message.get("audios", []) but only call rewrite
on items that are dicts and contain "path" (e.g., for audio in
message.get("audios", []): if isinstance(audio, dict) and "path" in audio:
audio["path"] = rewrite(audio["path"])). This preserves behavior for valid
entries and safely skips malformed nested audio entries in
make_dataset_manifest_entry.
---
Outside diff comments:
In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Line 80: The code currently uses judge_pipeline_args.get("output_dir") which
can silently return None; change this to direct access
judge_pipeline_args["output_dir"] so the program fails fast if the required key
is missing (update the assignment to output_dir_path =
judge_pipeline_args["output_dir"] and remove any downstream logic that treats
None as a valid path).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c440e519-0b43-4e04-aff0-90108b6082e5
📒 Files selected for processing (8)
nemo_skills/dataset/audiobench/nonjudge/__init__.pynemo_skills/dataset/audiobench/prepare.pynemo_skills/dataset/mmau-pro/closed_form/__init__.pynemo_skills/dataset/mmau-pro/prepare.pynemo_skills/evaluation/evaluator/audio.pynemo_skills/evaluation/evaluator/mmau_pro.pynemo_skills/evaluation/evaluator/nvembed_judge.pynemo_skills/pipeline/judges/nvembed_judge.py
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
d9bdff0 to
8264676
Compare
|
@coderabbitai review Pushed fixes for the latest AudioBench/NVEmbed review: nested audio path rewrites now guard key/type access, and NVEmbed output_dir fails fast when missing or null. Please re-review the current head. |
|
🧠 Learnings used✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/dataset/audiobench/prepare.py`:
- Around line 613-618: The unsupported-dataset check needs to be moved into
process_dataset(): inside the process_dataset(dataset_name, ...) function
validate dataset_name against JUDGE_DATASETS and NONJUDGE_DATASETS and raise
ValueError for unsupported names (use dataset_name in the error message, not the
undefined name variable), instead of letting callers fall through to
category="unknown"; update any code paths that previously relied on the CLI-only
check so process_dataset always fails fast for invalid dataset names.
In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Around line 134-150: The code is silently treating missing or non-string
generation values as empty outputs; in the nvembed_judge logic, replace
sample.get("generation", "") with direct access sample["generation"] (so missing
keys raise) and explicitly validate the type: if not
isinstance(sample["generation"], str) raise or mark as malformed (e.g., set
"nvembed_error":"malformed_generation") and only set
nvembed_error:"empty_generation" when the string .strip() is empty; keep the
rest of the return behavior (nvembed_matched_choice, nvembed_confidence,
is_correct) unchanged so a truly blank string is handled as a valid empty
prediction but bad/missing data fails fast.
In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 80-85: The current validation only rejects None for
judge_pipeline_args["output_dir"] (and similarly for input_file/input_dir around
lines 111-123); update these checks to also reject empty or all-whitespace
strings so "" isn't treated as Path("."); specifically, after retrieving
output_dir_path (and input_file/input_dir), treat values as invalid if they are
None or if str(value).strip() == "" (or simply falsy after stripping) and raise
the same ValueError mentioning the exact parameter (e.g., "NVEmbed judge
requires a non-null, non-empty output_dir"); apply the same fix to the
input_file and input_dir validation blocks so blank strings are rejected at
config validation time.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 96a3b996-0f45-48e0-828d-7c623d4a0d49
📒 Files selected for processing (8)
nemo_skills/dataset/audiobench/nonjudge/__init__.pynemo_skills/dataset/audiobench/prepare.pynemo_skills/dataset/mmau-pro/closed_form/__init__.pynemo_skills/dataset/mmau-pro/prepare.pynemo_skills/evaluation/evaluator/audio.pynemo_skills/evaluation/evaluator/mmau_pro.pynemo_skills/evaluation/evaluator/nvembed_judge.pynemo_skills/pipeline/judges/nvembed_judge.py
| if dataset_name in JUDGE_DATASETS: | ||
| category = "judge" | ||
| elif dataset_name in NONJUDGE_DATASETS: | ||
| category = "nonjudge" | ||
| else: | ||
| raise ValueError(f"Unsupported AudioBench dataset name: {name}") |
There was a problem hiding this comment.
Move the unsupported-dataset check into process_dataset().
This fail-fast only protects the CLI path right now. Direct callers can still pass an unsupported name into process_dataset(), which falls back to category = "unknown" and writes manifests under an unsupported layout.
Suggested fix
def process_dataset(
dataset_name: str,
output_dir: Path,
save_audio: bool = True,
split: str = "test",
max_samples: int = -1,
) -> tuple[int, List[Dict]]:
"""Process a single AudioBench dataset.
@@
Returns:
Tuple of (num_samples, manifest_entries)
"""
+ if dataset_name not in JUDGE_DATASETS and dataset_name not in NONJUDGE_DATASETS:
+ normalized_name = f"{dataset_name}_test"
+ if normalized_name in JUDGE_DATASETS or normalized_name in NONJUDGE_DATASETS:
+ dataset_name = normalized_name
+ else:
+ raise ValueError(f"Unsupported AudioBench dataset name: {dataset_name}")
+
print(f"\n{'=' * 60}")
print(f"Processing: {dataset_name}")
print(f"{'=' * 60}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/audiobench/prepare.py` around lines 613 - 618, The
unsupported-dataset check needs to be moved into process_dataset(): inside the
process_dataset(dataset_name, ...) function validate dataset_name against
JUDGE_DATASETS and NONJUDGE_DATASETS and raise ValueError for unsupported names
(use dataset_name in the error message, not the undefined name variable),
instead of letting callers fall through to category="unknown"; update any code
paths that previously relied on the CLI-only check so process_dataset always
fails fast for invalid dataset names.
| generation_value = sample.get("generation", "") | ||
| generation = generation_value.strip() if isinstance(generation_value, str) else "" | ||
| choices = sample.get("choices", []) | ||
| expected_answer = sample.get("expected_answer", "") | ||
|
|
||
| # Fail fast if data is malformed - this indicates a pipeline error | ||
| # Empty model outputs are valid failed predictions. Keep malformed-data | ||
| # checks strict, but do not let one blank generation abort the whole eval. | ||
| if not generation: | ||
| raise ValueError( | ||
| f"Sample missing generation field or has empty generation. Sample ID: {sample.get('id', 'unknown')}" | ||
| sample.update( | ||
| { | ||
| "nvembed_matched_choice": "", | ||
| "nvembed_confidence": 0.0, | ||
| "is_correct": False, | ||
| "nvembed_error": "empty_generation", | ||
| } | ||
| ) | ||
| return sample |
There was a problem hiding this comment.
Keep missing or malformed generation values as data errors.
sample.get("generation", "") and the non-string fallback now treat a missing key or structured value exactly like a blank model output. That hides bad records and can skew evaluation metrics instead of failing fast. Read the key directly, validate that it is a string, and only return nvembed_error="empty_generation" after a real string strips to empty.
Suggested fix
- generation_value = sample.get("generation", "")
- generation = generation_value.strip() if isinstance(generation_value, str) else ""
+ generation_value = sample["generation"]
+ if not isinstance(generation_value, str):
+ raise TypeError(f"Sample generation must be a string. Sample ID: {sample.get('id', 'unknown')}")
+ generation = generation_value.strip()As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/evaluation/evaluator/nvembed_judge.py` around lines 134 - 150,
The code is silently treating missing or non-string generation values as empty
outputs; in the nvembed_judge logic, replace sample.get("generation", "") with
direct access sample["generation"] (so missing keys raise) and explicitly
validate the type: if not isinstance(sample["generation"], str) raise or mark as
malformed (e.g., set "nvembed_error":"malformed_generation") and only set
nvembed_error:"empty_generation" when the string .strip() is empty; keep the
rest of the return behavior (nvembed_matched_choice, nvembed_confidence,
is_correct) unchanged so a truly blank string is handled as a valid empty
prediction but bad/missing data fails fast.
| try: | ||
| output_dir_path = judge_pipeline_args["output_dir"] | ||
| except KeyError as exc: | ||
| raise ValueError("NVEmbed judge requires judge_pipeline_args['output_dir']") from exc | ||
| if output_dir_path is None: | ||
| raise ValueError("NVEmbed judge requires a non-null output_dir") |
There was a problem hiding this comment.
Reject blank path values here too.
These checks only reject None. If input_file, input_dir, or output_dir is "", the evaluator later treats it as Path("") == ".", so the job can read the working directory or fail with a misleading copy error instead of stopping at config validation.
Suggested fix
- if output_dir_path is None:
+ if output_dir_path is None or not str(output_dir_path).strip():
raise ValueError("NVEmbed judge requires a non-null output_dir")
@@
- if input_dir is None:
+ if input_dir is None or not str(input_dir).strip():
raise ValueError("NVEmbed judge requires a non-null input_dir when input_file is unset")
script_args.extend(["--input-dir", str(input_dir)])
script_args.extend(["--num-seeds", str(num_seeds)])
else:
+ if not str(input_file).strip():
+ raise ValueError("NVEmbed judge requires a non-empty input_file when provided")
script_args.extend(["--input-file", str(input_file)])As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".
Also applies to: 111-123
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 80 - 85, The
current validation only rejects None for judge_pipeline_args["output_dir"] (and
similarly for input_file/input_dir around lines 111-123); update these checks to
also reject empty or all-whitespace strings so "" isn't treated as Path(".");
specifically, after retrieving output_dir_path (and input_file/input_dir), treat
values as invalid if they are None or if str(value).strip() == "" (or simply
falsy after stripping) and raise the same ValueError mentioning the exact
parameter (e.g., "NVEmbed judge requires a non-null, non-empty output_dir");
apply the same fix to the input_file and input_dir validation blocks so blank
strings are rejected at config validation time.
|
Recreated as #1485 from the same-repo branch |
Summary
Split out from #1443.
This PR contains AudioBench, MMAU-Pro, and NVEmbed robustness fixes that are independent of the AppTek benchmark.
What Changed
Nonefields.numpy<2Relationship To #1443
This PR is a prerequisite/adjacent cleanup split out of the original AppTek PR. It does not add the AppTek benchmark itself.
Validation
python -m py_compileon changed Python files.