Skip to content

Standardize audio manifest container paths#1489

Open
DongjiGao wants to merge 1 commit into
mainfrom
standardize-audio-manifest-paths
Open

Standardize audio manifest container paths#1489
DongjiGao wants to merge 1 commit into
mainfrom
standardize-audio-manifest-paths

Conversation

@DongjiGao

@DongjiGao DongjiGao commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Recreated from an in-repo branch so CI can run on it (fork PRs do not receive CI secrets). Replaces #1437.

Summary

  • Add shared helpers for writing prepared audio manifests with in-container paths rooted at /data by default.
  • Thread --audio-prefix through audio benchmark prepare scripts so manifests consistently use /data/<benchmark>/... instead of mixed /dataset or host paths.
  • Document the prepare/eval mount contract and update stale examples to use /data.

Behavior change

  • contextasr-bench --audio-prefix now means the global in-container audio root. For example, use --audio-prefix /data; the script appends contextasr-bench itself.

Known existing issue / TODO

  • Real Draco prep surfaced a pre-existing numb3rs/prepare.py issue: the current nvidia/Numb3rs test split rows do not have a category key, so full real-data Numb3rs prep fails with KeyError: 'category'. This is unrelated to the audio-path change and remains tracked as a separate TODO; this PR only covers the path convention.

Test plan

  • python -m pytest tests/test_audio_path_prefix.py -q (13 passed)
  • python -m compileall -q on changed Python files
  • git diff --check
  • Real prepared-data checks on Draco for completed benchmarks: ASR leaderboard, CoVoST2, and MMAU-Pro paths resolve under /data with no stale /dataset paths.
  • Not run: ruff / pre-commit locally because neither command is installed in the active environment.

Made with Cursor

Summary by CodeRabbit

  • New Features

    • Added --audio-prefix across multiple dataset preparation tools to control the in-container audio root (default: /data), ensuring generated JSONL audio paths follow the selected prefix.
    • Enhanced ASR Leaderboard preparation with --data_dir to control where outputs are written.
  • Documentation

    • Updated speech evaluation documentation to standardize the in-container audio-root convention, including examples and --audio-prefix=/data.
  • Tests

    • Added automated coverage to validate audio-path prefixing across dataset modules and end-to-end JSONL outputs.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7916e5b1-b420-4172-8869-4157afc6d48f

📥 Commits

Reviewing files that changed from the base of the PR and between 07fd5ec and 4bf283e.

📒 Files selected for processing (12)
  • docs/evaluation/speech-audio.md
  • nemo_skills/dataset/asr-leaderboard/prepare.py
  • nemo_skills/dataset/audiobench/prepare.py
  • nemo_skills/dataset/contextasr-bench/prepare.py
  • nemo_skills/dataset/covost2/prepare.py
  • nemo_skills/dataset/fleurs/prepare.py
  • nemo_skills/dataset/librispeech-pc/prepare.py
  • nemo_skills/dataset/mmau-pro/prepare.py
  • nemo_skills/dataset/musan/prepare.py
  • nemo_skills/dataset/numb3rs/prepare.py
  • nemo_skills/dataset/utils.py
  • tests/test_audio_path_prefix.py
🚧 Files skipped from review as they are similar to previous changes (7)
  • nemo_skills/dataset/librispeech-pc/prepare.py
  • nemo_skills/dataset/contextasr-bench/prepare.py
  • nemo_skills/dataset/utils.py
  • nemo_skills/dataset/musan/prepare.py
  • nemo_skills/dataset/audiobench/prepare.py
  • nemo_skills/dataset/numb3rs/prepare.py
  • tests/test_audio_path_prefix.py

📝 Walkthrough

Walkthrough

Introduces three shared helpers (DEFAULT_CONTAINER_AUDIO_ROOT, get_container_audio_root, build_container_audio_path) in nemo_skills/dataset/utils.py and propagates a configurable audio_root/--audio-prefix parameter through nine dataset prepare scripts (asr-leaderboard, audiobench, contextasr-bench, covost2, fleurs, librispeech-pc, mmau-pro, musan, numb3rs), replacing hardcoded /dataset/... paths with /data-rooted paths. A new test file validates path prefix behavior across all scripts, and documentation examples are updated accordingly.

Changes

Configurable in-container audio root for dataset prepare scripts

Layer / File(s) Summary
Shared audio-path utilities
nemo_skills/dataset/utils.py
Adds DEFAULT_CONTAINER_AUDIO_ROOT = "/data", get_container_audio_root (resolves from CLI arg, NEMO_SKILLS_AUDIO_ROOT env, or default), and build_container_audio_path (joins root, benchmark name, and path parts).
ASR Leaderboard audio root
nemo_skills/dataset/asr-leaderboard/prepare.py
Adds audio_root parameter to format_entry and prepare_dataset, replaces hardcoded audio path with build_container_audio_path(...), and exposes --audio-prefix and --data_dir CLI arguments.
Audiobench audio root
nemo_skills/dataset/audiobench/prepare.py
Adds audio_root parameter to create_manifest_entry and process_dataset, replaces hardcoded audio path construction with build_container_audio_path(...), and adds --audio-prefix CLI argument.
ContextASR-Bench audio root with benchmark-scoped prefix
nemo_skills/dataset/contextasr-bench/prepare.py
Introduces resolve_audio_prefix() helper to compute benchmark-specific container path prefix, updates --audio-prefix help text to document container-global semantics, and separates audio verification (host path) from JSONL writing (container path).
CoVoST2 audio root
nemo_skills/dataset/covost2/prepare.py
Updates get_container_audio_path to accept audio_root parameter, extends prepare_covost2 with audio_root argument, and adds --audio-prefix CLI option with root computation.
FLEURS audio root
nemo_skills/dataset/fleurs/prepare.py
Updates get_container_audio_path to accept audio_root, extends ASR and ST record collectors with audio_root, extends prepare_fleurs with audio_root argument, and adds --audio-prefix CLI argument.
LibriSpeech-PC audio root
nemo_skills/dataset/librispeech-pc/prepare.py
Extends process_split with audio_root parameter, replaces inline container path construction with build_container_audio_path(...), and adds --audio-prefix CLI argument.
MMAU-Pro audio root with path normalization
nemo_skills/dataset/mmau-pro/prepare.py
Adds _normalize_audio_path helper to rewrite dataset audio paths into configured container paths, extends format_entry with audio_root parameter, and adds --audio-prefix CLI argument.
MUSAN audio root
nemo_skills/dataset/musan/prepare.py
Updates create_manifest_entry to use build_container_audio_path(...), extends process_category_from_files and process_category with audio_root parameter, and adds --audio-prefix CLI argument.
Numb3rs audio root
nemo_skills/dataset/numb3rs/prepare.py
Refactors save_audio_and_format_entry and prepare_category to accept audio_root parameter, replaces hardcoded audio prefix with build_container_audio_path(...), and adds --audio-prefix CLI argument.
Test suite and documentation
tests/test_audio_path_prefix.py, docs/evaluation/speech-audio.md
Adds 14 test functions (unit and end-to-end with monkeypatched loaders) validating no stale /dataset/ substrings and correct /data-rooted audio paths for all datasets. Documentation gains "Audio path convention" subsection and updates all /dataset references to /data.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.31% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Standardize audio manifest container paths' clearly summarizes the main change: standardizing how audio manifests use container paths across multiple dataset preparation scripts.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch standardize-audio-manifest-paths

Comment @coderabbitai help to get the list of available commands and usage tips.

@DongjiGao DongjiGao closed this Jun 15, 2026
@DongjiGao DongjiGao reopened this Jun 15, 2026
@DongjiGao DongjiGao force-pushed the standardize-audio-manifest-paths branch from 967f6ba to 07fd5ec Compare June 15, 2026 19:05
Add shared helpers (build_container_audio_path, get_container_audio_root) and thread --audio-prefix through the audio benchmark prepare scripts so prepared JSONL manifests use in-container paths rooted at /data (overridable via --audio-prefix or NEMO_SKILLS_AUDIO_ROOT) instead of hardcoded /dataset paths.

Document the prepare/eval mount contract and add tests covering the path helpers and per-benchmark manifest paths.

Signed-off-by: Dongji Gao <dongjig@nvidia.com>
@DongjiGao DongjiGao force-pushed the standardize-audio-manifest-paths branch from 07fd5ec to 4bf283e Compare June 15, 2026 20:02

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
nemo_skills/dataset/fleurs/prepare.py (1)

224-231: ⚡ Quick win

Cache source-locale rows in ST collection to avoid repeated full loads.

Line 230 reloads and re-parses the entire source split for every (src_locale, tgt_locale) pair. Caching source rows once per locale will significantly reduce prep time and redundant I/O.

Suggested refactor
 def _collect_st_records(...):
     pairs = build_translation_pairs(languages)
     target_cache: dict[str, dict[int, dict]] = {}
+    source_cache: dict[str, list[dict]] = {}

     def get_target_index(locale: str) -> dict[int, dict]:
         if locale not in target_cache:
             target_cache[locale] = index_by_id(load_fleurs(locale, split, local_dir=local_dir))
         return target_cache[locale]
+
+    def get_source_rows(locale: str) -> list[dict]:
+        if locale not in source_cache:
+            source_cache[locale] = load_fleurs(locale, split, local_dir=local_dir)
+        return source_cache[locale]

     records: list[dict] = []
     for src_locale, tgt_locale in pairs:
         ...
-        for source_row in tqdm(load_fleurs(src_locale, split, local_dir=local_dir), desc=tag):
+        for source_row in tqdm(get_source_rows(src_locale), desc=tag):
             ...
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/fleurs/prepare.py` around lines 224 - 231, The current
code calls load_fleurs for every (src_locale, tgt_locale) pair in the pairs
loop, which means the same source locale gets reloaded and re-parsed multiple
times if it appears in multiple pairs. To fix this, create a cache dictionary
before the pairs loop to store the loaded source rows indexed by src_locale,
then check this cache inside the pairs loop before calling load_fleurs. If the
src_locale is already in the cache, use the cached rows; otherwise, load it once
and store it in the cache for subsequent pairs that use the same source locale.
This will eliminate the redundant I/O and parsing operations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/evaluation/speech-audio.md`:
- Line 776: The fenced code blocks at line 776 and line 814 in
docs/evaluation/speech-audio.md are missing language identifiers, which violates
markdownlint rule MD040. Add the language specifier "text" to each opening fence
marker by changing the bare ``` to ```text at both locations to satisfy the
linting requirement.
- Around line 672-676: The `--audio-prefix` example in the custom audio path
prefix documentation is inconsistent with the convention documented earlier in
the file. According to the established convention, `--audio-prefix` should be
the global root path (e.g., `/data`), not the full dataset-specific path. Update
the example command to pass `--audio-prefix /data` instead of `--audio-prefix
/data/contextual-earnings22` to align with the documented convention.

In `@nemo_skills/dataset/fleurs/prepare.py`:
- Around line 231-233: The code silently drops ST samples when target rows are
missing by using target_by_id.get(source_row["id"]) followed by a continue
statement, which can hide data mismatches. Replace the .get() call with direct
dictionary access using target_by_id[source_row["id"]] and remove the if
target_row is None check and continue statement. This will cause a KeyError to
be raised immediately if an expected target row is missing, failing fast with a
clear error instead of silently corrupting the manifest by dropping samples.

---

Nitpick comments:
In `@nemo_skills/dataset/fleurs/prepare.py`:
- Around line 224-231: The current code calls load_fleurs for every (src_locale,
tgt_locale) pair in the pairs loop, which means the same source locale gets
reloaded and re-parsed multiple times if it appears in multiple pairs. To fix
this, create a cache dictionary before the pairs loop to store the loaded source
rows indexed by src_locale, then check this cache inside the pairs loop before
calling load_fleurs. If the src_locale is already in the cache, use the cached
rows; otherwise, load it once and store it in the cache for subsequent pairs
that use the same source locale. This will eliminate the redundant I/O and
parsing operations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7916e5b1-b420-4172-8869-4157afc6d48f

📥 Commits

Reviewing files that changed from the base of the PR and between 07fd5ec and 4bf283e.

📒 Files selected for processing (12)
  • docs/evaluation/speech-audio.md
  • nemo_skills/dataset/asr-leaderboard/prepare.py
  • nemo_skills/dataset/audiobench/prepare.py
  • nemo_skills/dataset/contextasr-bench/prepare.py
  • nemo_skills/dataset/covost2/prepare.py
  • nemo_skills/dataset/fleurs/prepare.py
  • nemo_skills/dataset/librispeech-pc/prepare.py
  • nemo_skills/dataset/mmau-pro/prepare.py
  • nemo_skills/dataset/musan/prepare.py
  • nemo_skills/dataset/numb3rs/prepare.py
  • nemo_skills/dataset/utils.py
  • tests/test_audio_path_prefix.py
🚧 Files skipped from review as they are similar to previous changes (7)
  • nemo_skills/dataset/librispeech-pc/prepare.py
  • nemo_skills/dataset/contextasr-bench/prepare.py
  • nemo_skills/dataset/utils.py
  • nemo_skills/dataset/musan/prepare.py
  • nemo_skills/dataset/audiobench/prepare.py
  • nemo_skills/dataset/numb3rs/prepare.py
  • tests/test_audio_path_prefix.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 3

🧹 Nitpick comments (1)
nemo_skills/dataset/fleurs/prepare.py (1)

224-231: ⚡ Quick win

Cache source-locale rows in ST collection to avoid repeated full loads.

Line 230 reloads and re-parses the entire source split for every (src_locale, tgt_locale) pair. Caching source rows once per locale will significantly reduce prep time and redundant I/O.

Suggested refactor
 def _collect_st_records(...):
     pairs = build_translation_pairs(languages)
     target_cache: dict[str, dict[int, dict]] = {}
+    source_cache: dict[str, list[dict]] = {}

     def get_target_index(locale: str) -> dict[int, dict]:
         if locale not in target_cache:
             target_cache[locale] = index_by_id(load_fleurs(locale, split, local_dir=local_dir))
         return target_cache[locale]
+
+    def get_source_rows(locale: str) -> list[dict]:
+        if locale not in source_cache:
+            source_cache[locale] = load_fleurs(locale, split, local_dir=local_dir)
+        return source_cache[locale]

     records: list[dict] = []
     for src_locale, tgt_locale in pairs:
         ...
-        for source_row in tqdm(load_fleurs(src_locale, split, local_dir=local_dir), desc=tag):
+        for source_row in tqdm(get_source_rows(src_locale), desc=tag):
             ...
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/fleurs/prepare.py` around lines 224 - 231, The current
code calls load_fleurs for every (src_locale, tgt_locale) pair in the pairs
loop, which means the same source locale gets reloaded and re-parsed multiple
times if it appears in multiple pairs. To fix this, create a cache dictionary
before the pairs loop to store the loaded source rows indexed by src_locale,
then check this cache inside the pairs loop before calling load_fleurs. If the
src_locale is already in the cache, use the cached rows; otherwise, load it once
and store it in the cache for subsequent pairs that use the same source locale.
This will eliminate the redundant I/O and parsing operations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/evaluation/speech-audio.md`:
- Line 776: The fenced code blocks at line 776 and line 814 in
docs/evaluation/speech-audio.md are missing language identifiers, which violates
markdownlint rule MD040. Add the language specifier "text" to each opening fence
marker by changing the bare ``` to ```text at both locations to satisfy the
linting requirement.
- Around line 672-676: The `--audio-prefix` example in the custom audio path
prefix documentation is inconsistent with the convention documented earlier in
the file. According to the established convention, `--audio-prefix` should be
the global root path (e.g., `/data`), not the full dataset-specific path. Update
the example command to pass `--audio-prefix /data` instead of `--audio-prefix
/data/contextual-earnings22` to align with the documented convention.

In `@nemo_skills/dataset/fleurs/prepare.py`:
- Around line 231-233: The code silently drops ST samples when target rows are
missing by using target_by_id.get(source_row["id"]) followed by a continue
statement, which can hide data mismatches. Replace the .get() call with direct
dictionary access using target_by_id[source_row["id"]] and remove the if
target_row is None check and continue statement. This will cause a KeyError to
be raised immediately if an expected target row is missing, failing fast with a
clear error instead of silently corrupting the manifest by dropping samples.

---

Nitpick comments:
In `@nemo_skills/dataset/fleurs/prepare.py`:
- Around line 224-231: The current code calls load_fleurs for every (src_locale,
tgt_locale) pair in the pairs loop, which means the same source locale gets
reloaded and re-parsed multiple times if it appears in multiple pairs. To fix
this, create a cache dictionary before the pairs loop to store the loaded source
rows indexed by src_locale, then check this cache inside the pairs loop before
calling load_fleurs. If the src_locale is already in the cache, use the cached
rows; otherwise, load it once and store it in the cache for subsequent pairs
that use the same source locale. This will eliminate the redundant I/O and
parsing operations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7916e5b1-b420-4172-8869-4157afc6d48f

📥 Commits

Reviewing files that changed from the base of the PR and between 07fd5ec and 4bf283e.

📒 Files selected for processing (12)
  • docs/evaluation/speech-audio.md
  • nemo_skills/dataset/asr-leaderboard/prepare.py
  • nemo_skills/dataset/audiobench/prepare.py
  • nemo_skills/dataset/contextasr-bench/prepare.py
  • nemo_skills/dataset/covost2/prepare.py
  • nemo_skills/dataset/fleurs/prepare.py
  • nemo_skills/dataset/librispeech-pc/prepare.py
  • nemo_skills/dataset/mmau-pro/prepare.py
  • nemo_skills/dataset/musan/prepare.py
  • nemo_skills/dataset/numb3rs/prepare.py
  • nemo_skills/dataset/utils.py
  • tests/test_audio_path_prefix.py
🚧 Files skipped from review as they are similar to previous changes (7)
  • nemo_skills/dataset/librispeech-pc/prepare.py
  • nemo_skills/dataset/contextasr-bench/prepare.py
  • nemo_skills/dataset/utils.py
  • nemo_skills/dataset/musan/prepare.py
  • nemo_skills/dataset/audiobench/prepare.py
  • nemo_skills/dataset/numb3rs/prepare.py
  • tests/test_audio_path_prefix.py
🛑 Comments failed to post (3)
docs/evaluation/speech-audio.md (2)

672-676: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix inconsistent --audio-prefix example for Contextual Earnings-22.

Line 675 contradicts the convention documented in Line 48-Line 51. If --audio-prefix is global root, this example should pass /data, not /data/contextual-earnings22.

Suggested doc fix
-ns prepare_data contextual-earnings22 --data_dir=/path/to/contextual-earnings22 --audio-prefix /data/contextual-earnings22
+ns prepare_data contextual-earnings22 --data_dir=/path/to/contextual-earnings22 --audio-prefix /data
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/evaluation/speech-audio.md` around lines 672 - 676, The `--audio-prefix`
example in the custom audio path prefix documentation is inconsistent with the
convention documented earlier in the file. According to the established
convention, `--audio-prefix` should be the global root path (e.g., `/data`), not
the full dataset-specific path. Update the example command to pass
`--audio-prefix /data` instead of `--audio-prefix /data/contextual-earnings22`
to align with the documented convention.

776-776: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifiers to fenced code blocks to satisfy markdownlint.

Line 776 and Line 814 open fenced blocks without a language specifier (MD040).

Suggested doc fix
-```
+```text
 <data_dir>/covost2/
     asr/test.jsonl   # one record per (lang, audio) for transcription
     st/test.jsonl    # one record per (src→tgt, audio) for translation
     audio/<lang>/<split>/...wav

- +text
<data_dir>/fleurs/
asr/test.jsonl # one record per (locale, audio) for transcription
st/test.jsonl # one record per (src→tgt, audio) for translation
audio//...wav

Also applies to: 814-814

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 776-776: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/evaluation/speech-audio.md` at line 776, The fenced code blocks at line
776 and line 814 in docs/evaluation/speech-audio.md are missing language
identifiers, which violates markdownlint rule MD040. Add the language specifier
"text" to each opening fence marker by changing the bare ``` to ```text at both
locations to satisfy the linting requirement.

Source: Linters/SAST tools

nemo_skills/dataset/fleurs/prepare.py (1)

231-233: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t silently drop ST samples on missing target rows.

Line 231 uses .get() and then continue, which can hide data mismatches and produce incomplete manifests without failing fast.

Suggested fix
-            target_row = target_by_id.get(source_row["id"])
-            if target_row is None:
-                continue
+            target_row = target_by_id[source_row["id"]]

As per coding guidelines, “Don't use .get for accessing dictionary keys if the code expects them to be present… let the code fail with a clear error instead of silently corrupting data,” and “Don't catch exceptions when we don't expect them to be normally raised.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

            target_row = target_by_id[source_row["id"]]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/fleurs/prepare.py` around lines 231 - 233, The code
silently drops ST samples when target rows are missing by using
target_by_id.get(source_row["id"]) followed by a continue statement, which can
hide data mismatches. Replace the .get() call with direct dictionary access
using target_by_id[source_row["id"]] and remove the if target_row is None check
and continue statement. This will cause a KeyError to be raised immediately if
an expected target row is missing, failing fast with a clear error instead of
silently corrupting the manifest by dropping samples.

Source: Coding guidelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant