Add LibriSpeechMix SOT benchmark by vmendelev · Pull Request #1412 · NVIDIA-NeMo/Skills

vmendelev · 2026-05-01T17:16:09Z

Summary

Adds a LibriSpeechMix SOT benchmark group for evaluating multispeaker ASR models with serialized speaker tags:

librispeechmix-sot.{under20s,over20s}-{dev-clean,test-clean}-{1mix,2mix,3mix}
dataset preparation from the official LibriSpeechMix JSONL lists plus LibriSpeech/OpenSLR audio
explicit duration split at 20.0 seconds (under20s <= 20.0, over20s > 20.0)
reference SOT text with [s0], [s1], etc., RTTM supervision, and SegLST sidecars
corpus-level no-PnC cpWER evaluator/metrics with speaker-permutation matching and raw S/I/D counts
optional MeetEval sidecars when meeteval-wer is installed
RTMT-ASR, Multitalker Parakeet, and SALM/Canary-Qwen generation wiring through NeMo-Skills/serve_unified
persistent in-process multitalker_parakeet unified backend that loads Multitalker Parakeet ASR and Sortformer diarization once

The Multitalker baseline defaults to:

ASR: nvidia/multitalker-parakeet-streaming-0.6b-v1
diarization: nvidia/diar_streaming_sortformer_4spk-v2.1

Testing

Rebased on current github/main (7dcb0a67).

Local focused tests:

python -m pytest tests/test_librispeechmix_sot.py tests/test_vllm_audio.py tests/test_unified_server_audio_parser.py tests/test_configs.py

Result: 37 passed in 18.54s.

Compile checks:

python -m py_compile \
  nemo_skills/dataset/librispeechmix-sot/prepare.py \
  nemo_skills/dataset/librispeechmix-sot/librispeechmix_sot_score.py \
  nemo_skills/evaluation/evaluator/librispeechmix_sot.py \
  nemo_skills/evaluation/metrics/librispeechmix_sot_utils.py \
  nemo_skills/evaluation/metrics/librispeechmix_sot_metrics.py \
  nemo_skills/inference/librispeechmix_rtmtasr.py \
  nemo_skills/inference/librispeechmix_multitalker_parakeet.py \
  recipes/multimodal/server/backends/salm_backend.py \
  recipes/multimodal/server/backends/multitalker_parakeet_backend.py

Result: passed.

Also ran:

git diff --check: passed
prepare/evaluator smoke on a tiny LibriSpeechMix sample: passed
persistent Multitalker Parakeet tiny DRACO smoke: passed, generated [s0] SOT output and debug_info.persistent_inprocess: true

Cluster Results

Full under-20s LibriSpeechMix test-clean runs completed on DRACO OCI via Symphony job:

job id: issue81-full-multitalker-20260501-000558
status: succeeded, exit code 0
full report: Symphony issue 81 artifact .symphony/artifacts/issue81-final-report.md
job log: Symphony issue 81 artifact .symphony/artifacts/jobs/issue81-full-multitalker-20260501-000558/job.log

Each run used --num-chunks 16, --server-batch-size 2, --mt-batch-size 2, and --max-concurrent-requests 2.

Remote output roots:

DRACO Lustre issue-specific baseline root: issue81_librispeechmix_sot/baselines/multitalker-parakeet/
split subdirectories: under20s-test-clean-1mix-full, under20s-test-clean-2mix-full, under20s-test-clean-3mix-full

Local copied metrics:

.symphony/artifacts/issue81-baselines/multitalker-parakeet/under20s-test-clean-1mix-full/metrics.json
.symphony/artifacts/issue81-baselines/multitalker-parakeet/under20s-test-clean-2mix-full/metrics.json
.symphony/artifacts/issue81-baselines/multitalker-parakeet/under20s-test-clean-3mix-full/metrics.json

Multitalker Parakeet cpWER:

Split	Rows	cpWER	Errors	S	I	D	Ref words
under20s-test-clean-1mix	2620	3.62	1921	1007	220	694	53120
under20s-test-clean-2mix	2620	8.19	8705	2938	901	4866	106240
under20s-test-clean-3mix	2620	15.71	25039	6392	2766	15881	159360
aggregate	7860	11.19	35665	10337	3887	21441	318720

Notes

meeteval-wer was not installed in the local scoring venv, so optional MeetEval average/per-record JSON was skipped. The scorer still wrote SegLST sidecars and NeMo cpWER metrics.
RTMT-ASR En Canary-v2 and Canary-Qwen paths are wired but were not run as full baselines in this PR.
Over-20s manifests are implemented/preparable, but no over-20s model inference was run.

coderabbitai · 2026-05-01T17:29:15Z

📝 Walkthrough

Walkthrough

Adds LibriSpeechMix SOT (speaker-of-the-talker), a multispeaker ASR benchmark with evaluation framework. Includes dataset preparation, cpWER metrics/utilities, evaluator implementation, inference wrappers for RTMT-ASR and Multitalker Parakeet models, unified server backends, and comprehensive tests.

Changes

Cohort / File(s)	Summary
Documentation `docs/evaluation/speech-audio.md`	Introduces LibriSpeechMix SOT benchmark specification, including supported IDs, duration splits, preparation commands, scoring metrics (corpus-level cpWER with speaker permutation matching), reference/hypothesis selection logic, SegLST artifact generation, and MeetEval integration.
Dataset Group Configuration `nemo_skills/dataset/librispeechmix-sot/__init__.py`	Declares benchmark group with scoring module, constructs `BENCHMARKS` mapping from duration/LSM splits and mixture counts.
Dataset Scoring Module `nemo_skills/dataset/librispeechmix-sot/librispeechmix_sot_score.py`	Implements score aggregation: partitions benchmarks into named groups (`all`, `under20s`, `over20s`, `test_clean`, `dev_clean`), aggregates cpWER metrics per group and eval mode, computes corpus-level cpWER from raw error counts.
Dataset Preparation `nemo_skills/dataset/librispeechmix-sot/prepare.py`	Resolves LibriSpeech/LibriSpeechMix data, buckets examples into under/over 20s duration splits, generates SOT text with stable speaker tags, creates RTTM and SegLST sidecar files, optionally materializes mixed audio with per-speaker delays.
Dataset Variant Configurations `nemo_skills/dataset/librispeechmix-sot/{under,over}20s-{dev,test}-clean-{1,2,3}mix/__init__.py`	Twelve configuration modules (12 variants) each specifying metrics type, evaluation/generation arguments (OpenAI prompt format, audio enabled), inference module path, and evaluation split.
Evaluation Metrics `nemo_skills/evaluation/metrics/librispeechmix_sot_metrics.py`, `librispeechmix_sot_utils.py`	Implements `LibriSpeechMixSOTMetrics` for pass@k aggregation and cpWER calculation; utilities provide speaker-stream parsing, edit-distance computation, minimum-permutation cpWER, SegLST conversion, and MeetEval artifact writing.
Evaluator Implementation `nemo_skills/evaluation/evaluator/librispeechmix_sot.py`	Defines evaluator config/class: selects hypothesis from configurable prediction keys, computes cpWER and error breakdowns per sample, augments JSONL with results, optionally emits MeetEval artifacts.
Evaluator Registry `nemo_skills/evaluation/evaluator/__init__.py`	Registers `librispeechmix_sot` evaluator in lazy-import registry.
Metrics Registry `nemo_skills/evaluation/metrics/map_metrics.py`	Adds `librispeechmix_sot` entry to `METRICS_MAP`.
Inference Wrappers `nemo_skills/inference/librispeechmix_multitalker_parakeet.py`, `librispeechmix_rtmtasr.py`	RTMT-ASR wrapper: executes NeMo transcribe script, normalizes JSONL output. Multitalker Parakeet wrapper: runs streaming pipeline, normalizes segments into speaker-tagged text per session.
Unified Server Backends `recipes/multimodal/server/backends/multitalker_parakeet_backend.py`, `salm_backend.py`, `__init__.py`	Multitalker Parakeet: in-process streaming diarization/ASR with persistent pipeline and segment aggregation. SALM: loads pretrained model, batches audio with per-request prompts, handles warmup and temporary file cleanup. Backend registry extended for both.
Tests `tests/test_librispeechmix_sot.py`, `tests/test_datasets.py`	Comprehensive test suite: dataset registration, duration bucketing, RTTM/speaker-tag/SOT-text generation, preparation manifest writing, metric aggregation, evaluator hypothesis selection, inference CLI command generation, backend in-process execution, segment aggregation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.48% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a new LibriSpeechMix SOT benchmark to the codebase. It directly reflects the primary objective and is specific enough to be meaningful.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch vmendelev/issue81-librispeechmix-sot

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai · 2026-05-01T17:29:27Z

📝 Walkthrough

Walkthrough

Adds LibriSpeechMix SOT (speaker-of-the-talker), a multispeaker ASR benchmark with evaluation framework. Includes dataset preparation, cpWER metrics/utilities, evaluator implementation, inference wrappers for RTMT-ASR and Multitalker Parakeet models, unified server backends, and comprehensive tests.

Changes

Cohort / File(s)	Summary
Documentation `docs/evaluation/speech-audio.md`	Introduces LibriSpeechMix SOT benchmark specification, including supported IDs, duration splits, preparation commands, scoring metrics (corpus-level cpWER with speaker permutation matching), reference/hypothesis selection logic, SegLST artifact generation, and MeetEval integration.
Dataset Group Configuration `nemo_skills/dataset/librispeechmix-sot/__init__.py`	Declares benchmark group with scoring module, constructs `BENCHMARKS` mapping from duration/LSM splits and mixture counts.
Dataset Scoring Module `nemo_skills/dataset/librispeechmix-sot/librispeechmix_sot_score.py`	Implements score aggregation: partitions benchmarks into named groups (`all`, `under20s`, `over20s`, `test_clean`, `dev_clean`), aggregates cpWER metrics per group and eval mode, computes corpus-level cpWER from raw error counts.
Dataset Preparation `nemo_skills/dataset/librispeechmix-sot/prepare.py`	Resolves LibriSpeech/LibriSpeechMix data, buckets examples into under/over 20s duration splits, generates SOT text with stable speaker tags, creates RTTM and SegLST sidecar files, optionally materializes mixed audio with per-speaker delays.
Dataset Variant Configurations `nemo_skills/dataset/librispeechmix-sot/{under,over}20s-{dev,test}-clean-{1,2,3}mix/__init__.py`	Twelve configuration modules (12 variants) each specifying metrics type, evaluation/generation arguments (OpenAI prompt format, audio enabled), inference module path, and evaluation split.
Evaluation Metrics `nemo_skills/evaluation/metrics/librispeechmix_sot_metrics.py`, `librispeechmix_sot_utils.py`	Implements `LibriSpeechMixSOTMetrics` for pass@k aggregation and cpWER calculation; utilities provide speaker-stream parsing, edit-distance computation, minimum-permutation cpWER, SegLST conversion, and MeetEval artifact writing.
Evaluator Implementation `nemo_skills/evaluation/evaluator/librispeechmix_sot.py`	Defines evaluator config/class: selects hypothesis from configurable prediction keys, computes cpWER and error breakdowns per sample, augments JSONL with results, optionally emits MeetEval artifacts.
Evaluator Registry `nemo_skills/evaluation/evaluator/__init__.py`	Registers `librispeechmix_sot` evaluator in lazy-import registry.
Metrics Registry `nemo_skills/evaluation/metrics/map_metrics.py`	Adds `librispeechmix_sot` entry to `METRICS_MAP`.
Inference Wrappers `nemo_skills/inference/librispeechmix_multitalker_parakeet.py`, `librispeechmix_rtmtasr.py`	RTMT-ASR wrapper: executes NeMo transcribe script, normalizes JSONL output. Multitalker Parakeet wrapper: runs streaming pipeline, normalizes segments into speaker-tagged text per session.
Unified Server Backends `recipes/multimodal/server/backends/multitalker_parakeet_backend.py`, `salm_backend.py`, `__init__.py`	Multitalker Parakeet: in-process streaming diarization/ASR with persistent pipeline and segment aggregation. SALM: loads pretrained model, batches audio with per-request prompts, handles warmup and temporary file cleanup. Backend registry extended for both.
Tests `tests/test_librispeechmix_sot.py`, `tests/test_datasets.py`	Comprehensive test suite: dataset registration, duration bucketing, RTTM/speaker-tag/SOT-text generation, preparation manifest writing, metric aggregation, evaluator hypothesis selection, inference CLI command generation, backend in-process execution, segment aggregation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.48% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a new LibriSpeechMix SOT benchmark to the codebase. It directly reflects the primary objective and is specific enough to be meaningful.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch vmendelev/issue81-librispeechmix-sot

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

🧹 Nitpick comments (4)

nemo_skills/evaluation/evaluator/__init__.py (1)
56-56: Consider adding this benchmark to Slurm CI coverage.

Given the new evaluator + cpWER path, cluster test coverage would reduce regression risk (especially across all split/mix variants).

Based on learnings: "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/__init__.py` at line 56, Add the new
librispeechmix_sot benchmark to Slurm CI so the LibrispeechMixSOTEvaluator and
its cpWER evaluation path are exercised: update the CI test matrix/job
configuration to include the "librispeechmix_sot" benchmark identifier and
ensure the job runs the same split/mix variants used by
LibrispeechMixSOTEvaluator (and invokes the cpWER path), so cluster tests cover
this evaluator and prevent regressions.
tests/test_datasets.py (1)
61-61: ⚡ Quick win

Please add/update benchmark docs for the new dataset entry.

Now that librispeechmix-sot is registered, add corresponding docs with run/eval example commands and expected model results to keep benchmark usage discoverable.

Based on learnings: "When adding new benchmarks, add it to the corresponding place in the documentation with example commands for running evaluation and expected results for tested models".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_datasets.py` at line 61, Add documentation for the newly
registered dataset "librispeechmix-sot": update the benchmarks/docs page that
lists datasets and include a short section for librispeechmix-sot containing (1)
example CLI commands to run training/evaluation (match the project's existing
command style used for other datasets), (2) example eval flags and dataset
splits (e.g., "test"), and (3) expected model results (report metrics and the
tested model names used to produce them) so users can reproduce and compare;
ensure the section follows the same format/structure as other dataset entries in
the benchmarks docs.
nemo_skills/dataset/librispeechmix-sot/__init__.py (1)
7-20: Consider adding Slurm coverage for the new benchmark matrix.

This group now expands to 12 benchmark slices, so a representative Slurm smoke for at least one under-20s and one over-20s slice would help keep the new eval/metrics path exercised.

Based on learnings: this benchmark adds nontrivial evaluation/metrics logic, so adding it to Slurm smoke coverage is worth considering.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/librispeechmix-sot/__init__.py` around lines 7 - 20, The
benchmark matrix defined by _DURATION_SPLITS, _LSM_SPLITS, _MIXES and expanded
into BENCHMARKS now creates 12 slices and we should add Slurm smoke coverage for
at least one under-20s and one over-20s slice; update your Slurm smoke config
(or the smoke list used by CI) to include two representative keys such as
"librispeechmix-sot.under20s-test-clean-1mix" and
"librispeechmix-sot.over20s-test-clean-1mix" (or equivalent dev-clean variants)
so the new evaluation/metrics path exercised by SCORE_MODULE
("nemo_skills.dataset.librispeechmix-sot.librispeechmix_sot_score") runs in CI.
tests/test_librispeechmix_sot.py (1)
50-330: Consider covering one LibriSpeechMix SOT split in the slurm benchmark suite.

The local coverage here is strong, but this benchmark also depends on cluster-only paths like dataset staging and backend/evaluator wiring. Adding one small LibriSpeechMix SOT slurm job would help keep the full benchmark path exercised.

Based on learnings, "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_librispeechmix_sot.py` around lines 50 - 330, Add a lightweight
LibriSpeechMix SOT slurm test entry so the full benchmark path (staging,
backend/evaluator wiring, metrics) is exercised: add "librispeechmix-sot" (or a
specific split like "librispeechmix-sot.under20s-test-clean-1mix") to the slurm
benchmark test matrix (the variable/list that defines slurm jobs, e.g.,
SLURM_BENCHMARKS or similar), configure it to use a tiny sample (max_samples=1
or a small manifest) and minimal resources, and enable any dataset
staging/backends flags required for evaluation (so RTMTASR/MultitalkerParakeet
backends and LibriSpeech staging are invoked). Ensure the new job references the
same metric/evaluator pipeline exercised in tests (LibriSpeechMixSOTMetrics /
LibriSpeechMixSOTEvaluator) so cluster-only code paths run in CI.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/speech-audio.md`:
- Around line 411-458: The docs/evaluation/speech-audio.md section
"LibriSpeechMix SOT" is missing an expected-results snippet for the validated
baselines; add a short "Expected results" subsection under that section showing
corpus-level no-PnC cpWER numbers for the tested baseline(s) (e.g., results for
the recommended generation_module(s) such as
nemo_skills.inference.librispeechmix_rtmtasr and
nemo_skills.inference.librispeechmix_multitalker_parakeet), include the exact
eval command(s) users can run (matching the examples already present like the ns
eval command) and list the cpWER value(s) and any notes (e.g., whether MeetEval
was used or --normalizer chime8 applied) so readers can sanity-check fresh runs;
place the snippet near the prep/run examples in the LibriSpeechMix SOT section
and label it "Expected results" for clarity.

In `@nemo_skills/dataset/librispeechmix-sot/librispeechmix_sot_score.py`:
- Around line 18-27: The loop that aggregates counts (iterating over
benchmarks.values(), using eval_mode and metrics keys) must fail fast instead of
defaulting missing fields to zero: replace metrics =
benchmark_metrics.get(eval_mode) with direct indexing (metrics =
benchmark_metrics[eval_mode]) and replace all metrics.get("cpwer_*", 0) accesses
with direct indexing (e.g., metrics["cpwer_errors"],
metrics["cpwer_substitutions"], metrics["cpwer_insertions"],
metrics["cpwer_deletions"], metrics["cpwer_ref_words"], metrics["num_entries"])
so a KeyError surfaces on malformed input; if you want a clearer message wrap in
a try/except and re-raise a ValueError that includes the benchmark identifier
and eval_mode for debugging.

In `@nemo_skills/dataset/librispeechmix-sot/prepare.py`:
- Around line 123-125: The mixed_duration function (and the other places that
zip parallel row arrays) silently truncate malformed rows; validate that the
parallel lists have equal lengths before zipping (e.g., assert
len(row["delays"]) == len(row["durations"]) and raise a descriptive ValueError
including an identifier from row like row.get("id") or the row itself) or switch
to zip(..., strict=True) to force an error on mismatch; apply the same
length-check or strict zip change to the other zip usages referenced (the blocks
around lines 166-175 and 223-231) so malformed JSONL entries fail fast with a
clear message.

In `@nemo_skills/evaluation/evaluator/librispeechmix_sot.py`:
- Around line 34-40: In _get_hypothesis, don't silently return "" when
user-supplied prediction_keys are wrong: if self.config.prediction_keys is set,
iterate those keys and if none are found raise a ValueError (or custom
exception) identifying the missing keys and the data_point (e.g., include an id
or the keys tried); if self.config.prediction_keys is not set, keep the existing
fallback behavior using the default keys ["pred_text","generation"] and return
"" only as a last resort. Update the logic in method _get_hypothesis of
librispeechmix_sot.py to validate and fail fast when configured prediction_keys
are invalid.

In `@nemo_skills/evaluation/metrics/librispeechmix_sot_metrics.py`:
- Around line 31-44: The update() logic currently prefers "generation" over
"pred_text", which is inconsistent with
LibriSpeechMixSOTEvaluator._get_hypothesis(); change the predicted_answers list
comprehension in update() to prefer pred_text first (use p.get("pred_text") or
p.get("generation") or None) and also when accumulating cpwer counts for the
representative item, derive the chosen hypothesis consistently (e.g., chosen =
first.get("pred_text") or first.get("generation")) so the pass@k/no-answer
bookkeeping matches the hypothesis used for cpWER aggregation.

In `@nemo_skills/evaluation/metrics/librispeechmix_sot_utils.py`:
- Around line 108-140: The current exhaustive permutation search over
itertools.permutations(range(size)) can blow up for large size when hallucinated
hypothesis speakers appear; modify the logic around ref_items/hyp_items/size to
guard against this by (a) defining a MAX_SPEAKER_ASSIGNMENT constant (e.g., 6)
and if size > MAX_SPEAKER_ASSIGNMENT avoid full permutation, and (b) implement a
fallback assignment algorithm: build a cost matrix using _counts_for_text
(errors or substitutions) for each ref/hyp pair and run a linear assignment
(Hungarian) solver like scipy.optimize.linear_sum_assignment or a greedy
nearest-match heuristic to produce assignment, totals and cpwer; keep the same
returned structure
(best/candidate/assignment/reference_streams/hypothesis_streams) so callers of
the function are unchanged.

In `@nemo_skills/inference/librispeechmix_multitalker_parakeet.py`:
- Around line 143-145: The current loop silently collapses entries with missing
session identifiers into a synthetic "sample" bucket; change the logic in the
entries processing loop to require one of the keys "session_id", "recording_id",
or "audio_id" to be present on each entry, retrieving it via direct key access
(not .get) and raising a clear exception (e.g., ValueError or KeyError) when
none are found; set session_id = entry[found_key] and then append to
by_session[session_id] so missing/renamed session ids fail fast instead of
merging into a single "sample" bucket.

In `@nemo_skills/inference/librispeechmix_rtmtasr.py`:
- Around line 79-80: The code currently hardcodes langid="en" while accepting
source_lang/target_lang, so either derive langid from the incoming config (e.g.,
use source_lang or target_lang when present and map to the expected token like
"en"/"fr") or validate and raise an explicit error if a non-"en" language is
passed; update the RTMT-ASR invocation where "langid=en" and
"gt_lang_attr_name=target_lang" are composed to replace the hardcoded literal
with the resolved/validated langid and ensure callers cannot pass unused
source_lang/target_lang values (throw an exception or return a clear validation
error when unsupported languages are supplied).
- Around line 105-108: Currently the code silently normalizes missing prediction
fields into empty strings by using row.get(...) which masks schema problems;
change it to require a supported prediction field and fail loudly: check row for
the presence of "pred_text" or "generation", raise a ValueError (or KeyError)
with a clear message if neither exists, and then set row["pred_text"] =
row["pred_text"] if present else row["generation"] and ensure row["generation"]
is set from the present field — replace uses of
row.get("pred_text")/row.get("generation") with direct indexing and an explicit
presence check to surface incompatible/malformed rows.

In `@recipes/multimodal/server/backends/multitalker_parakeet_backend.py`:
- Around line 98-102: The current deserialization logic in the
MultitalkerParakeetBackend class (the method that uses cls.__dataclass_fields__
to build cls(...)) silently moves unknown keys into extra_config; instead,
validate the input dict by computing unknown_keys = set(d.keys()) - known -
{"extra_config"} and if unknown_keys is non-empty raise a TypeError (or
ValueError) listing the offending keys so typos are rejected, then construct the
dataclass using only known keys (and handle extra_config only if explicitly
provided). Update the method (the classmethod that returns cls(...)) to perform
this check before instantiating the class.

In `@recipes/multimodal/server/backends/salm_backend.py`:
- Around line 166-175: The current loop builds a single prompts list for all
valid requests and calls self._model.generate once, ignoring
self.salm_config.batch_size; change this to respect batch_size by chunking the
prompts/temp_paths/valid_indices into slices of size
self.salm_config.batch_size, for each chunk compute a chunk_max_new_tokens
(start from self.config.max_new_tokens and take the max with any
request.max_new_tokens in that slice), call
self._model.generate(prompts=chunk_prompts, max_new_tokens=chunk_max_new_tokens)
for each chunk, and concatenate the returned output_ids in order; ensure you
update references to prompts, temp_paths and valid_indices used in the original
loop (e.g., the loop that builds prompts and the single generate() call) so
oversized batches are processed in bounded-size model calls rather than one big
call.
- Around line 121-122: The warmup exception handling in load_model()/SALM
backend currently swallows all exceptions (except Exception) which can hide
fatal incompatibilities; change the except block in the warmup routine so that
only documented/benign exceptions are caught (e.g., the specific warmup-related
exception type your backend raises) and logged, and re-raise any other
exceptions so load_model() fails fast; alternatively, log the error and
immediately raise the caught exception after logging (preserving stack) instead
of silently returning success. Ensure you reference the warmup call inside
load_model()/the SALM backend’s warmup block and adjust the except Exception as
written to either a specific exception class or to re-raise after
logger.warning.

---

Nitpick comments:
In `@nemo_skills/dataset/librispeechmix-sot/__init__.py`:
- Around line 7-20: The benchmark matrix defined by _DURATION_SPLITS,
_LSM_SPLITS, _MIXES and expanded into BENCHMARKS now creates 12 slices and we
should add Slurm smoke coverage for at least one under-20s and one over-20s
slice; update your Slurm smoke config (or the smoke list used by CI) to include
two representative keys such as "librispeechmix-sot.under20s-test-clean-1mix"
and "librispeechmix-sot.over20s-test-clean-1mix" (or equivalent dev-clean
variants) so the new evaluation/metrics path exercised by SCORE_MODULE
("nemo_skills.dataset.librispeechmix-sot.librispeechmix_sot_score") runs in CI.

In `@nemo_skills/evaluation/evaluator/__init__.py`:
- Line 56: Add the new librispeechmix_sot benchmark to Slurm CI so the
LibrispeechMixSOTEvaluator and its cpWER evaluation path are exercised: update
the CI test matrix/job configuration to include the "librispeechmix_sot"
benchmark identifier and ensure the job runs the same split/mix variants used by
LibrispeechMixSOTEvaluator (and invokes the cpWER path), so cluster tests cover
this evaluator and prevent regressions.

In `@tests/test_datasets.py`:
- Line 61: Add documentation for the newly registered dataset
"librispeechmix-sot": update the benchmarks/docs page that lists datasets and
include a short section for librispeechmix-sot containing (1) example CLI
commands to run training/evaluation (match the project's existing command style
used for other datasets), (2) example eval flags and dataset splits (e.g.,
"test"), and (3) expected model results (report metrics and the tested model
names used to produce them) so users can reproduce and compare; ensure the
section follows the same format/structure as other dataset entries in the
benchmarks docs.

In `@tests/test_librispeechmix_sot.py`:
- Around line 50-330: Add a lightweight LibriSpeechMix SOT slurm test entry so
the full benchmark path (staging, backend/evaluator wiring, metrics) is
exercised: add "librispeechmix-sot" (or a specific split like
"librispeechmix-sot.under20s-test-clean-1mix") to the slurm benchmark test
matrix (the variable/list that defines slurm jobs, e.g., SLURM_BENCHMARKS or
similar), configure it to use a tiny sample (max_samples=1 or a small manifest)
and minimal resources, and enable any dataset staging/backends flags required
for evaluation (so RTMTASR/MultitalkerParakeet backends and LibriSpeech staging
are invoked). Ensure the new job references the same metric/evaluator pipeline
exercised in tests (LibriSpeechMixSOTMetrics / LibriSpeechMixSOTEvaluator) so
cluster-only code paths run in CI.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4dab57b5-55a4-4f9b-aec1-98dc44c0922d

📥 Commits

Reviewing files that changed from the base of the PR and between 7dcb0a6 and e6a3726.

📒 Files selected for processing (28)

docs/evaluation/speech-audio.md
nemo_skills/dataset/librispeechmix-sot/__init__.py
nemo_skills/dataset/librispeechmix-sot/librispeechmix_sot_score.py
nemo_skills/dataset/librispeechmix-sot/over20s-dev-clean-1mix/__init__.py
nemo_skills/dataset/librispeechmix-sot/over20s-dev-clean-2mix/__init__.py
nemo_skills/dataset/librispeechmix-sot/over20s-dev-clean-3mix/__init__.py
nemo_skills/dataset/librispeechmix-sot/over20s-test-clean-1mix/__init__.py
nemo_skills/dataset/librispeechmix-sot/over20s-test-clean-2mix/__init__.py
nemo_skills/dataset/librispeechmix-sot/over20s-test-clean-3mix/__init__.py
nemo_skills/dataset/librispeechmix-sot/prepare.py
nemo_skills/dataset/librispeechmix-sot/under20s-dev-clean-1mix/__init__.py
nemo_skills/dataset/librispeechmix-sot/under20s-dev-clean-2mix/__init__.py
nemo_skills/dataset/librispeechmix-sot/under20s-dev-clean-3mix/__init__.py
nemo_skills/dataset/librispeechmix-sot/under20s-test-clean-1mix/__init__.py
nemo_skills/dataset/librispeechmix-sot/under20s-test-clean-2mix/__init__.py
nemo_skills/dataset/librispeechmix-sot/under20s-test-clean-3mix/__init__.py
nemo_skills/evaluation/evaluator/__init__.py
nemo_skills/evaluation/evaluator/librispeechmix_sot.py
nemo_skills/evaluation/metrics/librispeechmix_sot_metrics.py
nemo_skills/evaluation/metrics/librispeechmix_sot_utils.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/inference/librispeechmix_multitalker_parakeet.py
nemo_skills/inference/librispeechmix_rtmtasr.py
recipes/multimodal/server/backends/__init__.py
recipes/multimodal/server/backends/multitalker_parakeet_backend.py
recipes/multimodal/server/backends/salm_backend.py
tests/test_datasets.py
tests/test_librispeechmix_sot.py

coderabbitai · 2026-05-01T17:29:28Z

+## LibriSpeechMix SOT
+
+LibriSpeechMix SOT evaluates multispeaker ASR with serialized speaker tags. It uses the official [LibriSpeechMix](https://github.com/NaoyukiKanda/LibriSpeechMix) JSONL lists and LibriSpeech/OpenSLR audio, then writes NeMo SOT manifests with `[s0]`, `[s1]`, etc. reference text, RTTM supervision, and reference SegLST files.
+
+Supported benchmark variants:
+
+- `librispeechmix-sot.under20s-test-clean-1mix`
+- `librispeechmix-sot.under20s-test-clean-2mix`
+- `librispeechmix-sot.under20s-test-clean-3mix`
+- `librispeechmix-sot.over20s-test-clean-1mix`
+- `librispeechmix-sot.over20s-test-clean-2mix`
+- `librispeechmix-sot.over20s-test-clean-3mix`
+
+Dev smoke variants are also available with `dev-clean` in place of `test-clean`. The duration split is explicit: `under20s` contains records with mixed duration `<= 20.0` seconds, and `over20s` contains records with duration `> 20.0` seconds.
+
+Prepare the required test-clean variants:
+
+```bash
+ns prepare_data librispeechmix-sot --cluster=local --data_dir=/path/to/ns-data --splits test-clean --mixes 1mix 2mix 3mix
+```
+
+For a tiny local smoke using a pre-cloned LibriSpeechMix checkout:
+
+```bash
+python -m nemo_skills.dataset.prepare librispeechmix-sot \
+    --data_dir=/path/to/lsm-sot-data \
+    --librispeechmix-dir=/path/to/LibriSpeechMix \
+    --splits test-clean \
+    --mixes 1mix \
+    --max-samples 2 \
+    --no-audio
+```
+
+The primary metric is corpus-level no-PnC cpWER. The scorer reads reference SOT text from `text`, reads hypotheses from `pred_text` when available and otherwise `generation`, splits both by `[sN]` tags, removes punctuation/case, and applies minimum speaker permutation matching. It also writes `*_ref.seglst.json` and `*_hyp.seglst.json`; when `meeteval-wer` is installed it additionally runs MeetEval cpWER with `--normalizer chime8 --partial`.
+
+True speaker-attributed baselines should use a SOT-capable generation module, for example:
+
+```bash
+ns eval \
+    --benchmarks=librispeechmix-sot.under20s-test-clean-1mix \
+    --generation_module=nemo_skills.inference.librispeechmix_rtmtasr \
+    --server_type=none \
+    --server_gpus=0 \
+    --extra_arguments="++model_path=/path/to/MSCanary-v2.nemo ++spk_supervision=rttm ++nemo_root=/path/to/NeMo"
+```
+
+`nemo_skills.inference.librispeechmix_multitalker_parakeet` wraps the streaming Multitalker Parakeet path. The generic SALM backend can run `nvidia/canary-qwen-2.5b`, but the public model card documents generic ASR output, not SOT speaker-tagged output, so treat it as a generic ASR baseline unless a run proves it emits `[sN]` speaker-attributed text.
+


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add an expected-results snippet for the tested baseline(s).

This section has good prep/run coverage, but it never shows the cpWER numbers users should expect from the benchmark runs that were already validated in this PR. That makes it hard to sanity-check a fresh setup.

As per coding guidelines: "When adding new benchmarks, add it to the corresponding place in the documentation with example commands for running evaluation and expected results for tested models"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/evaluation/speech-audio.md` around lines 411 - 458, The docs/evaluation/speech-audio.md section "LibriSpeechMix SOT" is missing an expected-results snippet for the validated baselines; add a short "Expected results" subsection under that section showing corpus-level no-PnC cpWER numbers for the tested baseline(s) (e.g., results for the recommended generation_module(s) such as nemo_skills.inference.librispeechmix_rtmtasr and nemo_skills.inference.librispeechmix_multitalker_parakeet), include the exact eval command(s) users can run (matching the examples already present like the ns eval command) and list the cpWER value(s) and any notes (e.g., whether MeetEval was used or --normalizer chime8 applied) so readers can sanity-check fresh runs; place the snippet near the prep/run examples in the LibriSpeechMix SOT section and label it "Expected results" for clarity.

coderabbitai · 2026-05-01T17:29:28Z

+    for benchmark_metrics in benchmarks.values():
+        metrics = benchmark_metrics.get(eval_mode)
+        if not metrics:
+            continue
+        total_entries += metrics.get("num_entries", 0)
+        errors += metrics.get("cpwer_errors", 0)
+        substitutions += metrics.get("cpwer_substitutions", 0)
+        insertions += metrics.get("cpwer_insertions", 0)
+        deletions += metrics.get("cpwer_deletions", 0)
+        ref_words += metrics.get("cpwer_ref_words", 0)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on missing cpWER aggregate fields.

These raw counts are produced explicitly by LibriSpeechMixSOTMetrics.get_metrics(). Treating a missing eval_mode or count key as zero here will silently corrupt the rolled-up score instead of surfacing a bad input shape.

As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/librispeechmix-sot/librispeechmix_sot_score.py` around lines 18 - 27, The loop that aggregates counts (iterating over benchmarks.values(), using eval_mode and metrics keys) must fail fast instead of defaulting missing fields to zero: replace metrics = benchmark_metrics.get(eval_mode) with direct indexing (metrics = benchmark_metrics[eval_mode]) and replace all metrics.get("cpwer_*", 0) accesses with direct indexing (e.g., metrics["cpwer_errors"], metrics["cpwer_substitutions"], metrics["cpwer_insertions"], metrics["cpwer_deletions"], metrics["cpwer_ref_words"], metrics["num_entries"]) so a KeyError surfaces on malformed input; if you want a clearer message wrap in a try/except and re-raise a ValueError that includes the benchmark identifier and eval_mode for debugging.

coderabbitai · 2026-05-01T17:29:28Z

+def mixed_duration(row: dict) -> float:
+    """Return mixed utterance duration from source delays and durations."""
+    return float(max(delay + duration for delay, duration in zip(row["delays"], row["durations"])))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate row field lengths before zipping speaker-aligned lists.

zip() will silently truncate malformed LibriSpeechMix rows, so a bad JSONL entry can corrupt duration bucketing, RTTM/SegLST sidecars, and the mixed WAV without any error. Please validate the parallel arrays once or use strict=True on each zip(...).

Also applies to: 166-175, 223-231

🧰 Tools

🪛 Ruff (0.15.12)

[warning] 125-125: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/librispeechmix-sot/prepare.py` around lines 123 - 125, The mixed_duration function (and the other places that zip parallel row arrays) silently truncate malformed rows; validate that the parallel lists have equal lengths before zipping (e.g., assert len(row["delays"]) == len(row["durations"]) and raise a descriptive ValueError including an identifier from row like row.get("id") or the row itself) or switch to zip(..., strict=True) to force an error on mismatch; apply the same length-check or strict zip change to the other zip usages referenced (the blocks around lines 166-175 and 223-231) so malformed JSONL entries fail fast with a clear message.

coderabbitai · 2026-05-01T17:29:28Z

+    def _get_hypothesis(self, data_point: dict[str, Any]) -> str:
+        keys = self.config.prediction_keys or ["pred_text", "generation"]
+        for key in keys:
+            value = data_point.get(key)
+            if value is not None:
+                return str(value)
+        return ""


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Invalid prediction_keys should fail, not score as empty text.

With the current fallback, a typo in prediction_keys quietly returns "", and cpwer() turns every sample into a full-deletion error. Raise once none of the configured keys are present in the record.

As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/librispeechmix_sot.py` around lines 34 - 40, In _get_hypothesis, don't silently return "" when user-supplied prediction_keys are wrong: if self.config.prediction_keys is set, iterate those keys and if none are found raise a ValueError (or custom exception) identifying the missing keys and the data_point (e.g., include an id or the keys tried); if self.config.prediction_keys is not set, keep the existing fallback behavior using the default keys ["pred_text","generation"] and return "" only as a last resort. Update the logic in method _get_hypothesis of librispeechmix_sot.py to validate and fail fast when configured prediction_keys are invalid.

coderabbitai · 2026-05-01T17:29:28Z

+    def update(self, predictions: list[dict]) -> None:
+        super().update(predictions)
+        predicted_answers = [p.get("generation") or p.get("pred_text") or None for p in predictions]
+        self._compute_pass_at_k(predictions=predictions, predicted_answers=predicted_answers)
+        if len(predictions) > 1:
+            self._compute_majority_at_k(predictions=predictions, predicted_answers=predicted_answers)
+
+        first = predictions[0]
+        for mode in ["pass@1", "pass@1[avg-of-1]"]:
+            self.eval_dict[mode]["cpwer_errors"] += first.get("cpwer_errors", 0)
+            self.eval_dict[mode]["cpwer_substitutions"] += first.get("cpwer_substitutions", 0)
+            self.eval_dict[mode]["cpwer_insertions"] += first.get("cpwer_insertions", 0)
+            self.eval_dict[mode]["cpwer_deletions"] += first.get("cpwer_deletions", 0)
+            self.eval_dict[mode]["cpwer_ref_words"] += first.get("cpwer_ref_words", 0)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prefer pred_text over generation here too.

LibriSpeechMixSOTEvaluator._get_hypothesis() picks pred_text first, but update() reverses that order. If both fields exist and differ, pass@k/no-answer bookkeeping can drift from the cpWER counts accumulated from the evaluated hypothesis.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/librispeechmix_sot_metrics.py` around lines 31 - 44, The update() logic currently prefers "generation" over "pred_text", which is inconsistent with LibriSpeechMixSOTEvaluator._get_hypothesis(); change the predicted_answers list comprehension in update() to prefer pred_text first (use p.get("pred_text") or p.get("generation") or None) and also when accumulating cpwer counts for the representative item, derive the chosen hypothesis consistently (e.g., chosen = first.get("pred_text") or first.get("generation")) so the pass@k/no-answer bookkeeping matches the hypothesis used for cpWER aggregation.

coderabbitai · 2026-05-01T17:29:28Z

+        "langid=en",
+        "gt_lang_attr_name=target_lang",


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Validate or honor non-English language settings.

langid is hardcoded to en, so callers can pass source_lang / target_lang values that the actual RTMT-ASR invocation ignores. Either derive langid from the config or reject anything but English here so unsupported settings fail loudly.

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/librispeechmix_rtmtasr.py` around lines 79 - 80, The code currently hardcodes langid="en" while accepting source_lang/target_lang, so either derive langid from the incoming config (e.g., use source_lang or target_lang when present and map to the expected token like "en"/"fr") or validate and raise an explicit error if a non-"en" language is passed; update the RTMT-ASR invocation where "langid=en" and "gt_lang_attr_name=target_lang" are composed to replace the hardcoded literal with the resolved/validated langid and ensure callers cannot pass unused source_lang/target_lang values (throw an exception or return a clear validation error when unsupported languages are supplied).

coderabbitai · 2026-05-01T17:29:28Z

+            row = json.loads(line)
+            pred_text = row.get("pred_text") or row.get("generation") or ""
+            row["pred_text"] = pred_text
+            row["generation"] = row.get("generation") or pred_text


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't normalize missing prediction fields into empty hypotheses.

If the RTMT-ASR script changes schema or emits a bad row, this rewrites it as an empty hypothesis and turns the failure into deletion-heavy cpWER instead of surfacing the incompatibility. Require a supported prediction field here and raise if it is absent.

Suggested fix

- pred_text = row.get("pred_text") or row.get("generation") or "" + pred_text = row.get("pred_text") or row.get("generation") + if not pred_text: + raise ValueError(f"Missing prediction text in RTMT-ASR output row: {row}") row["pred_text"] = pred_text row["generation"] = row.get("generation") or pred_text

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/librispeechmix_rtmtasr.py` around lines 105 - 108, Currently the code silently normalizes missing prediction fields into empty strings by using row.get(...) which masks schema problems; change it to require a supported prediction field and fail loudly: check row for the presence of "pred_text" or "generation", raise a ValueError (or KeyError) with a clear message if neither exists, and then set row["pred_text"] = row["pred_text"] if present else row["generation"] and ensure row["generation"] is set from the present field — replace uses of row.get("pred_text")/row.get("generation") with direct indexing and an explicit presence check to surface incompatible/malformed rows.

coderabbitai · 2026-05-01T17:29:28Z

+        known = {field.name for field in cls.__dataclass_fields__.values()}
+        return cls(
+            **{k: v for k, v in d.items() if k in known and k != "extra_config"},
+            extra_config={k: v for k, v in d.items() if k not in known},
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject unknown backend config keys instead of ignoring them.

Anything outside known is moved into extra_config and then never applied by this backend. A misspelled option will look accepted but won't affect inference.

As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@recipes/multimodal/server/backends/multitalker_parakeet_backend.py` around lines 98 - 102, The current deserialization logic in the MultitalkerParakeetBackend class (the method that uses cls.__dataclass_fields__ to build cls(...)) silently moves unknown keys into extra_config; instead, validate the input dict by computing unknown_keys = set(d.keys()) - known - {"extra_config"} and if unknown_keys is non-empty raise a TypeError (or ValueError) listing the offending keys so typos are rejected, then construct the dataclass using only known keys (and handle extra_config only if explicitly provided). Update the method (the classmethod that returns cls(...)) to perform this check before instantiating the class.

coderabbitai · 2026-05-01T17:29:28Z

+        except Exception as exc:
+            logger.warning("SALM warmup skipped due to error: %s", exc)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't hide warmup failures behind a warning.

Warmup is the only place that verifies this backend can actually build prompts and call generate(). Swallowing every exception here means an incompatible model can report as loaded and only fail on the first real request. Let unexpected failures abort load_model(), or narrow the catch to a documented benign case.

As per coding guidelines, "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving".

🧰 Tools

🪛 Ruff (0.15.12)

[warning] 121-121: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@recipes/multimodal/server/backends/salm_backend.py` around lines 121 - 122, The warmup exception handling in load_model()/SALM backend currently swallows all exceptions (except Exception) which can hide fatal incompatibilities; change the except block in the warmup routine so that only documented/benign exceptions are caught (e.g., the specific warmup-related exception type your backend raises) and logged, and re-raise any other exceptions so load_model() fails fast; alternatively, log the error and immediately raise the caught exception after logging (preserving stack) instead of silently returning success. Ensure you reference the warmup call inside load_model()/the SALM backend’s warmup block and adjust the except Exception as written to either a specific exception class or to re-raise after logger.warning.

coderabbitai · 2026-05-01T17:29:28Z

+                max_new_tokens = self.config.max_new_tokens
+                audio_tag = self._model.audio_locator_tag
+                for out_idx, path in enumerate(temp_paths):
+                    request = requests[valid_indices[out_idx]]
+                    user_prompt = request.user_prompt or request.extra_params.get("user_prompt", self.salm_config.user_prompt)
+                    prompts.append([{"role": "user", "content": f"{user_prompt} {audio_tag}", "audio": [path]}])
+                    if request.max_new_tokens:
+                        max_new_tokens = max(max_new_tokens, request.max_new_tokens)
+
+                output_ids = self._model.generate(prompts=prompts, max_new_tokens=max_new_tokens)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor batch_size instead of sending the whole request list at once.

This backend exposes batch_size, but generate() still submits every valid request in one model call. Large request bursts can still OOM or spike latency even when the backend is configured smaller. Chunk prompts by self.salm_config.batch_size, or reject oversized batches explicitly.

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@recipes/multimodal/server/backends/salm_backend.py` around lines 166 - 175, The current loop builds a single prompts list for all valid requests and calls self._model.generate once, ignoring self.salm_config.batch_size; change this to respect batch_size by chunking the prompts/temp_paths/valid_indices into slices of size self.salm_config.batch_size, for each chunk compute a chunk_max_new_tokens (start from self.config.max_new_tokens and take the max with any request.max_new_tokens in that slice), call self._model.generate(prompts=chunk_prompts, max_new_tokens=chunk_max_new_tokens) for each chunk, and concatenate the returned output_ids in order; ensure you update references to prompts, temp_paths and valid_indices used in the original loop (e.g., the loop that builds prompts and the single generate() call) so oversized batches are processed in bounded-size model calls rather than one big call.

Add LibriSpeechMix SOT benchmark

e6a3726

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

		except Exception as exc:
		logger.warning("SALM warmup skipped due to error: %s", exc)

Uh oh!

Conversation

vmendelev commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Cluster Results

Notes

Uh oh!

coderabbitai Bot commented May 1, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot commented May 1, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vmendelev commented May 1, 2026 •

edited

Loading