fix(audio): harden AudioBench and MMAU evaluation#1485
Conversation
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds ChangesAudio Evaluation Robustness and Manifest Path Fixes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
nemo_skills/pipeline/judges/nvembed_judge.py (1)
129-138: 💤 Low valueNumpy version mismatch with evaluator script.
The pipeline installs
numpy==1.26.4(line 133), butnvembed_judge.pyline 51 specifiesnumpy<2. While--skip-installprevents the script's version from being used, this creates a maintenance burden with duplicated version specifications.Consider extracting shared package versions to a single source of truth, or at minimum add a comment noting the evaluator script's version should be kept in sync.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 129 - 138, Extract the numpy version specification (currently hardcoded as numpy==1.26.4 in the run_cmd string) to a shared source of truth such as a requirements file or constants module that can be imported by both the pipeline builder and the evaluator script, or alternatively add clear comments in both the run_cmd construction and the evaluator script's requirements (line 51) indicating that the numpy version specifications must be kept in sync to avoid maintenance issues and ensure consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 129-138: Extract the numpy version specification (currently
hardcoded as numpy==1.26.4 in the run_cmd string) to a shared source of truth
such as a requirements file or constants module that can be imported by both the
pipeline builder and the evaluator script, or alternatively add clear comments
in both the run_cmd construction and the evaluator script's requirements (line
51) indicating that the numpy version specifications must be kept in sync to
avoid maintenance issues and ensure consistency.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 19b0cc8c-7f39-4bef-93b4-344189218f9d
📒 Files selected for processing (8)
nemo_skills/dataset/audiobench/nonjudge/__init__.pynemo_skills/dataset/audiobench/prepare.pynemo_skills/dataset/mmau-pro/closed_form/__init__.pynemo_skills/dataset/mmau-pro/prepare.pynemo_skills/evaluation/evaluator/audio.pynemo_skills/evaluation/evaluator/mmau_pro.pynemo_skills/evaluation/evaluator/nvembed_judge.pynemo_skills/pipeline/judges/nvembed_judge.py
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/evaluation/evaluator/nvembed_judge.py (1)
136-138:⚠️ Potential issue | 🟠 Major | ⚡ Quick winNon-string
generationvalues are being incorrectly downgraded toempty_generation.On Line 137, any non-string
generationbecomes"", so dict/list payloads with valid text get marked as failed on Line 143. This creates deterministic false negatives and diverges from the coercion pattern used innemo_skills/evaluation/evaluator/audio.py.Suggested fix
+def _coerce_text(value: Any) -> str: + if value is None: + return "" + if isinstance(value, str): + return value.strip() + if isinstance(value, dict): + for key in ("text", "expected_answer", "answer", "transcript", "reference", "generation"): + if key in value and value[key] is not None: + return _coerce_text(value[key]) + return " ".join(_coerce_text(v) for v in value.values() if v is not None).strip() + if isinstance(value, (list, tuple)): + return " ".join(_coerce_text(v) for v in value if v is not None).strip() + return str(value).strip() + def evaluate_sample_with_nvembed(sample: dict[str, Any], model_name: str = "nvidia/NV-Embed-v2") -> dict[str, Any]: @@ - generation_value = sample.get("generation", "") - generation = generation_value.strip() if isinstance(generation_value, str) else "" + generation = _coerce_text(sample.get("generation", ""))Also applies to: 141-152
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/evaluation/evaluator/nvembed_judge.py` around lines 136 - 138, The `generation` value handling is incorrectly converting non-string payloads (such as dicts or lists) to empty strings, which causes valid text content to be marked as failed. Instead of simply converting non-string values to empty string, implement proper coercion logic that extracts the actual text content from non-string `generation` values, similar to the pattern used in the audio.py evaluator. This ensures that dict/list payloads with valid text are preserved rather than being discarded as empty values.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Around line 136-138: The `generation` value handling is incorrectly converting
non-string payloads (such as dicts or lists) to empty strings, which causes
valid text content to be marked as failed. Instead of simply converting
non-string values to empty string, implement proper coercion logic that extracts
the actual text content from non-string `generation` values, similar to the
pattern used in the audio.py evaluator. This ensures that dict/list payloads
with valid text are preserved rather than being discarded as empty values.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4fe1dbce-60f0-4709-a5f4-1eb76687ac54
📒 Files selected for processing (2)
nemo_skills/evaluation/evaluator/nvembed_judge.pynemo_skills/pipeline/judges/nvembed_judge.py
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
nemo_skills/evaluation/evaluator/nvembed_judge.py (1)
153-155: ⚡ Quick winUse direct key access for required sample fields (
choices,expected_answer).
choicesandexpected_answerare treated as required immediately below, so usesample["choices"]/sample["expected_answer"]instead of.get(...)to fail loudly on malformed records rather than normalizing to empty defaults.Proposed patch
- choices = sample.get("choices", []) - expected_answer = sample.get("expected_answer", "") + choices = sample["choices"] + expected_answer = sample["expected_answer"]As per coding guidelines, "Don't use
.getfor accessing dictionary keys if the code expects them to be present; use direct accessdata[key_name]to fail with a clear error instead of silently corrupting data."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/evaluation/evaluator/nvembed_judge.py` around lines 153 - 155, The code is using .get() with default values for the dictionary keys "choices" and "expected_answer" in the sample dictionary, which masks missing or malformed data by silently returning empty defaults. Since these fields are required and used immediately after, replace the .get() calls with direct dictionary access using sample["choices"] and sample["expected_answer"] respectively. This will cause a clear KeyError to be raised if these required keys are missing, making data quality issues explicit rather than hiding them with empty defaults.Source: Coding guidelines
tests/test_nvembed_judge.py (1)
18-40: ⚡ Quick winAdd a test for the new
empty_generationbranch.This test covers nested coercion success well, but the newly introduced soft-fail path (empty coerced generation) should also be asserted to prevent regressions (
nvembed_confidence=0.0,is_correct=False,nvembed_error="empty_generation").🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_nvembed_judge.py` around lines 18 - 40, Add a new test function that verifies the soft-fail behavior when the generation coerces to an empty string. Create a test similar to test_nvembed_coerces_nested_generation_text but with a sample where the generation field (whether nested or flat) results in empty text after coercion. Call evaluate_sample_with_nvembed with this sample and assert that the result contains nvembed_confidence equal to 0.0, is_correct equal to False, and nvembed_error equal to the string "empty_generation" to ensure the empty_generation branch is properly tested and prevent regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@nemo_skills/evaluation/evaluator/nvembed_judge.py`:
- Around line 153-155: The code is using .get() with default values for the
dictionary keys "choices" and "expected_answer" in the sample dictionary, which
masks missing or malformed data by silently returning empty defaults. Since
these fields are required and used immediately after, replace the .get() calls
with direct dictionary access using sample["choices"] and
sample["expected_answer"] respectively. This will cause a clear KeyError to be
raised if these required keys are missing, making data quality issues explicit
rather than hiding them with empty defaults.
In `@tests/test_nvembed_judge.py`:
- Around line 18-40: Add a new test function that verifies the soft-fail
behavior when the generation coerces to an empty string. Create a test similar
to test_nvembed_coerces_nested_generation_text but with a sample where the
generation field (whether nested or flat) results in empty text after coercion.
Call evaluate_sample_with_nvembed with this sample and assert that the result
contains nvembed_confidence equal to 0.0, is_correct equal to False, and
nvembed_error equal to the string "empty_generation" to ensure the
empty_generation branch is properly tested and prevent regressions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6b567320-845b-43cf-bd91-bc6a69ac7676
📒 Files selected for processing (2)
nemo_skills/evaluation/evaluator/nvembed_judge.pytests/test_nvembed_judge.py
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Migration note
Supersedes #1475. Recreated from the
NVIDIA-NeMo/Skillsbranchcodex/pr1443-audiobench-mmau-fixesso repository CI can run.Summary
Split out from #1443.
This PR contains AudioBench, MMAU-Pro, and NVEmbed robustness fixes that are independent of the AppTek benchmark.
What Changed
Nonefields.numpy<2Relationship To #1443
This PR is a prerequisite/adjacent cleanup split out of the original AppTek PR. It does not add the AppTek benchmark itself.
Validation
python -m py_compileon changed Python files.Summary by CodeRabbit