Standardize audio manifest container paths#1489
Conversation
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (12)
🚧 Files skipped from review as they are similar to previous changes (7)
📝 WalkthroughWalkthroughIntroduces three shared helpers ( ChangesConfigurable in-container audio root for dataset prepare scripts
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
967f6ba to
07fd5ec
Compare
Add shared helpers (build_container_audio_path, get_container_audio_root) and thread --audio-prefix through the audio benchmark prepare scripts so prepared JSONL manifests use in-container paths rooted at /data (overridable via --audio-prefix or NEMO_SKILLS_AUDIO_ROOT) instead of hardcoded /dataset paths. Document the prepare/eval mount contract and add tests covering the path helpers and per-benchmark manifest paths. Signed-off-by: Dongji Gao <dongjig@nvidia.com>
07fd5ec to
4bf283e
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
nemo_skills/dataset/fleurs/prepare.py (1)
224-231: ⚡ Quick winCache source-locale rows in ST collection to avoid repeated full loads.
Line 230 reloads and re-parses the entire source split for every
(src_locale, tgt_locale)pair. Caching source rows once per locale will significantly reduce prep time and redundant I/O.Suggested refactor
def _collect_st_records(...): pairs = build_translation_pairs(languages) target_cache: dict[str, dict[int, dict]] = {} + source_cache: dict[str, list[dict]] = {} def get_target_index(locale: str) -> dict[int, dict]: if locale not in target_cache: target_cache[locale] = index_by_id(load_fleurs(locale, split, local_dir=local_dir)) return target_cache[locale] + + def get_source_rows(locale: str) -> list[dict]: + if locale not in source_cache: + source_cache[locale] = load_fleurs(locale, split, local_dir=local_dir) + return source_cache[locale] records: list[dict] = [] for src_locale, tgt_locale in pairs: ... - for source_row in tqdm(load_fleurs(src_locale, split, local_dir=local_dir), desc=tag): + for source_row in tqdm(get_source_rows(src_locale), desc=tag): ...🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/fleurs/prepare.py` around lines 224 - 231, The current code calls load_fleurs for every (src_locale, tgt_locale) pair in the pairs loop, which means the same source locale gets reloaded and re-parsed multiple times if it appears in multiple pairs. To fix this, create a cache dictionary before the pairs loop to store the loaded source rows indexed by src_locale, then check this cache inside the pairs loop before calling load_fleurs. If the src_locale is already in the cache, use the cached rows; otherwise, load it once and store it in the cache for subsequent pairs that use the same source locale. This will eliminate the redundant I/O and parsing operations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/evaluation/speech-audio.md`:
- Line 776: The fenced code blocks at line 776 and line 814 in
docs/evaluation/speech-audio.md are missing language identifiers, which violates
markdownlint rule MD040. Add the language specifier "text" to each opening fence
marker by changing the bare ``` to ```text at both locations to satisfy the
linting requirement.
- Around line 672-676: The `--audio-prefix` example in the custom audio path
prefix documentation is inconsistent with the convention documented earlier in
the file. According to the established convention, `--audio-prefix` should be
the global root path (e.g., `/data`), not the full dataset-specific path. Update
the example command to pass `--audio-prefix /data` instead of `--audio-prefix
/data/contextual-earnings22` to align with the documented convention.
In `@nemo_skills/dataset/fleurs/prepare.py`:
- Around line 231-233: The code silently drops ST samples when target rows are
missing by using target_by_id.get(source_row["id"]) followed by a continue
statement, which can hide data mismatches. Replace the .get() call with direct
dictionary access using target_by_id[source_row["id"]] and remove the if
target_row is None check and continue statement. This will cause a KeyError to
be raised immediately if an expected target row is missing, failing fast with a
clear error instead of silently corrupting the manifest by dropping samples.
---
Nitpick comments:
In `@nemo_skills/dataset/fleurs/prepare.py`:
- Around line 224-231: The current code calls load_fleurs for every (src_locale,
tgt_locale) pair in the pairs loop, which means the same source locale gets
reloaded and re-parsed multiple times if it appears in multiple pairs. To fix
this, create a cache dictionary before the pairs loop to store the loaded source
rows indexed by src_locale, then check this cache inside the pairs loop before
calling load_fleurs. If the src_locale is already in the cache, use the cached
rows; otherwise, load it once and store it in the cache for subsequent pairs
that use the same source locale. This will eliminate the redundant I/O and
parsing operations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7916e5b1-b420-4172-8869-4157afc6d48f
📒 Files selected for processing (12)
docs/evaluation/speech-audio.mdnemo_skills/dataset/asr-leaderboard/prepare.pynemo_skills/dataset/audiobench/prepare.pynemo_skills/dataset/contextasr-bench/prepare.pynemo_skills/dataset/covost2/prepare.pynemo_skills/dataset/fleurs/prepare.pynemo_skills/dataset/librispeech-pc/prepare.pynemo_skills/dataset/mmau-pro/prepare.pynemo_skills/dataset/musan/prepare.pynemo_skills/dataset/numb3rs/prepare.pynemo_skills/dataset/utils.pytests/test_audio_path_prefix.py
🚧 Files skipped from review as they are similar to previous changes (7)
- nemo_skills/dataset/librispeech-pc/prepare.py
- nemo_skills/dataset/contextasr-bench/prepare.py
- nemo_skills/dataset/utils.py
- nemo_skills/dataset/musan/prepare.py
- nemo_skills/dataset/audiobench/prepare.py
- nemo_skills/dataset/numb3rs/prepare.py
- tests/test_audio_path_prefix.py
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 3
🧹 Nitpick comments (1)
nemo_skills/dataset/fleurs/prepare.py (1)
224-231: ⚡ Quick winCache source-locale rows in ST collection to avoid repeated full loads.
Line 230 reloads and re-parses the entire source split for every
(src_locale, tgt_locale)pair. Caching source rows once per locale will significantly reduce prep time and redundant I/O.Suggested refactor
def _collect_st_records(...): pairs = build_translation_pairs(languages) target_cache: dict[str, dict[int, dict]] = {} + source_cache: dict[str, list[dict]] = {} def get_target_index(locale: str) -> dict[int, dict]: if locale not in target_cache: target_cache[locale] = index_by_id(load_fleurs(locale, split, local_dir=local_dir)) return target_cache[locale] + + def get_source_rows(locale: str) -> list[dict]: + if locale not in source_cache: + source_cache[locale] = load_fleurs(locale, split, local_dir=local_dir) + return source_cache[locale] records: list[dict] = [] for src_locale, tgt_locale in pairs: ... - for source_row in tqdm(load_fleurs(src_locale, split, local_dir=local_dir), desc=tag): + for source_row in tqdm(get_source_rows(src_locale), desc=tag): ...🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/fleurs/prepare.py` around lines 224 - 231, The current code calls load_fleurs for every (src_locale, tgt_locale) pair in the pairs loop, which means the same source locale gets reloaded and re-parsed multiple times if it appears in multiple pairs. To fix this, create a cache dictionary before the pairs loop to store the loaded source rows indexed by src_locale, then check this cache inside the pairs loop before calling load_fleurs. If the src_locale is already in the cache, use the cached rows; otherwise, load it once and store it in the cache for subsequent pairs that use the same source locale. This will eliminate the redundant I/O and parsing operations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/evaluation/speech-audio.md`:
- Line 776: The fenced code blocks at line 776 and line 814 in
docs/evaluation/speech-audio.md are missing language identifiers, which violates
markdownlint rule MD040. Add the language specifier "text" to each opening fence
marker by changing the bare ``` to ```text at both locations to satisfy the
linting requirement.
- Around line 672-676: The `--audio-prefix` example in the custom audio path
prefix documentation is inconsistent with the convention documented earlier in
the file. According to the established convention, `--audio-prefix` should be
the global root path (e.g., `/data`), not the full dataset-specific path. Update
the example command to pass `--audio-prefix /data` instead of `--audio-prefix
/data/contextual-earnings22` to align with the documented convention.
In `@nemo_skills/dataset/fleurs/prepare.py`:
- Around line 231-233: The code silently drops ST samples when target rows are
missing by using target_by_id.get(source_row["id"]) followed by a continue
statement, which can hide data mismatches. Replace the .get() call with direct
dictionary access using target_by_id[source_row["id"]] and remove the if
target_row is None check and continue statement. This will cause a KeyError to
be raised immediately if an expected target row is missing, failing fast with a
clear error instead of silently corrupting the manifest by dropping samples.
---
Nitpick comments:
In `@nemo_skills/dataset/fleurs/prepare.py`:
- Around line 224-231: The current code calls load_fleurs for every (src_locale,
tgt_locale) pair in the pairs loop, which means the same source locale gets
reloaded and re-parsed multiple times if it appears in multiple pairs. To fix
this, create a cache dictionary before the pairs loop to store the loaded source
rows indexed by src_locale, then check this cache inside the pairs loop before
calling load_fleurs. If the src_locale is already in the cache, use the cached
rows; otherwise, load it once and store it in the cache for subsequent pairs
that use the same source locale. This will eliminate the redundant I/O and
parsing operations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7916e5b1-b420-4172-8869-4157afc6d48f
📒 Files selected for processing (12)
docs/evaluation/speech-audio.mdnemo_skills/dataset/asr-leaderboard/prepare.pynemo_skills/dataset/audiobench/prepare.pynemo_skills/dataset/contextasr-bench/prepare.pynemo_skills/dataset/covost2/prepare.pynemo_skills/dataset/fleurs/prepare.pynemo_skills/dataset/librispeech-pc/prepare.pynemo_skills/dataset/mmau-pro/prepare.pynemo_skills/dataset/musan/prepare.pynemo_skills/dataset/numb3rs/prepare.pynemo_skills/dataset/utils.pytests/test_audio_path_prefix.py
🚧 Files skipped from review as they are similar to previous changes (7)
- nemo_skills/dataset/librispeech-pc/prepare.py
- nemo_skills/dataset/contextasr-bench/prepare.py
- nemo_skills/dataset/utils.py
- nemo_skills/dataset/musan/prepare.py
- nemo_skills/dataset/audiobench/prepare.py
- nemo_skills/dataset/numb3rs/prepare.py
- tests/test_audio_path_prefix.py
🛑 Comments failed to post (3)
docs/evaluation/speech-audio.md (2)
672-676:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix inconsistent
--audio-prefixexample for Contextual Earnings-22.Line 675 contradicts the convention documented in Line 48-Line 51. If
--audio-prefixis global root, this example should pass/data, not/data/contextual-earnings22.Suggested doc fix
-ns prepare_data contextual-earnings22 --data_dir=/path/to/contextual-earnings22 --audio-prefix /data/contextual-earnings22 +ns prepare_data contextual-earnings22 --data_dir=/path/to/contextual-earnings22 --audio-prefix /data🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/evaluation/speech-audio.md` around lines 672 - 676, The `--audio-prefix` example in the custom audio path prefix documentation is inconsistent with the convention documented earlier in the file. According to the established convention, `--audio-prefix` should be the global root path (e.g., `/data`), not the full dataset-specific path. Update the example command to pass `--audio-prefix /data` instead of `--audio-prefix /data/contextual-earnings22` to align with the documented convention.
776-776:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd language identifiers to fenced code blocks to satisfy markdownlint.
Line 776 and Line 814 open fenced blocks without a language specifier (MD040).
Suggested doc fix
-``` +```text <data_dir>/covost2/ asr/test.jsonl # one record per (lang, audio) for transcription st/test.jsonl # one record per (src→tgt, audio) for translation audio/<lang>/<split>/...wav-
+text
<data_dir>/fleurs/
asr/test.jsonl # one record per (locale, audio) for transcription
st/test.jsonl # one record per (src→tgt, audio) for translation
audio//...wavAlso applies to: 814-814
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 776-776: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/evaluation/speech-audio.md` at line 776, The fenced code blocks at line 776 and line 814 in docs/evaluation/speech-audio.md are missing language identifiers, which violates markdownlint rule MD040. Add the language specifier "text" to each opening fence marker by changing the bare ``` to ```text at both locations to satisfy the linting requirement.Source: Linters/SAST tools
nemo_skills/dataset/fleurs/prepare.py (1)
231-233:
⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t silently drop ST samples on missing target rows.
Line 231 uses
.get()and thencontinue, which can hide data mismatches and produce incomplete manifests without failing fast.Suggested fix
- target_row = target_by_id.get(source_row["id"]) - if target_row is None: - continue + target_row = target_by_id[source_row["id"]]As per coding guidelines, “Don't use
.getfor accessing dictionary keys if the code expects them to be present… let the code fail with a clear error instead of silently corrupting data,” and “Don't catch exceptions when we don't expect them to be normally raised.”📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.target_row = target_by_id[source_row["id"]]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/fleurs/prepare.py` around lines 231 - 233, The code silently drops ST samples when target rows are missing by using target_by_id.get(source_row["id"]) followed by a continue statement, which can hide data mismatches. Replace the .get() call with direct dictionary access using target_by_id[source_row["id"]] and remove the if target_row is None check and continue statement. This will cause a KeyError to be raised immediately if an expected target row is missing, failing fast with a clear error instead of silently corrupting the manifest by dropping samples.Source: Coding guidelines
Summary
/databy default.--audio-prefixthrough audio benchmark prepare scripts so manifests consistently use/data/<benchmark>/...instead of mixed/datasetor host paths./data.Behavior change
contextasr-bench --audio-prefixnow means the global in-container audio root. For example, use--audio-prefix /data; the script appendscontextasr-benchitself.Known existing issue / TODO
numb3rs/prepare.pyissue: the currentnvidia/Numb3rstest split rows do not have acategorykey, so full real-data Numb3rs prep fails withKeyError: 'category'. This is unrelated to the audio-path change and remains tracked as a separate TODO; this PR only covers the path convention.Test plan
python -m pytest tests/test_audio_path_prefix.py -q(13 passed)python -m compileall -qon changed Python filesgit diff --check/datawith no stale/datasetpaths.ruff/pre-commitlocally because neither command is installed in the active environment.Made with Cursor
Summary by CodeRabbit
New Features
--audio-prefixacross multiple dataset preparation tools to control the in-container audio root (default:/data), ensuring generated JSONL audio paths follow the selected prefix.--data_dirto control where outputs are written.Documentation
--audio-prefix=/data.Tests