Skip to content

fix(asr): clamp diarization cluster count to max_num_speakers#15835

Open
vprosoho wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
vprosoho:fix/diarization-clustering-respect-max-num-speakers
Open

fix(asr): clamp diarization cluster count to max_num_speakers#15835
vprosoho wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
vprosoho:fix/diarization-clustering-respect-max-num-speakers

Conversation

@vprosoho

Copy link
Copy Markdown

What does this PR do ?

Small fix to speaker over-counting in clustering diarization for short sessions: the final number of clusters could exceed the configured max_num_speakers.

Collection: ASR (speaker diarization / clustering)

Changelog

  • nemo/collections/asr/parts/utils/offline_clustering.py: in SpeakerClustering.forward_unit_infer, limit the chosen cluster count with n_clusters = min(n_clusters, max_num_speakers).
  • tests/collections/speaker_tasks/utils/test_diar_utils.py: add test_offline_speaker_clustering_enhanced_count_respects_max_num_speakers_cpu, unit test for verifying with count larger than max_num_speakers.

Usage

No usage change. Behavior is the same as before, just fixes the problem.

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Hopefully these are right CCs from what I can tell by looking at git...
cc @tango4j & @nithinraok
Apologies if not!

Additional Information

vprosoho added 2 commits June 26, 2026 11:39
For short sessions, SpeakerClustering.forward_infer estimates the speaker
count via getEnhancedSpeakerCount(), which constructs NMESC with
max_num_speakers=emb.shape[0] (the number of embedding segments) instead of
the configured max_num_speakers. The resulting est_num_of_spk_enhanced is
then consumed in forward_unit_infer without re-applying the limit, so a
short audio file can be clustered into more speakers than max_num_speakers
allows.

Clamp n_clusters to max_num_speakers after the speaker count is selected.
This is a no-op for the oracle and standard NME estimation paths (both
already bounded by max_num_speakers) and fixes the over-counting that can
occur on the enhanced-count path.

Signed-off-by: Vadym Prokopov <vprokopov@sohosquared.com>
getEnhancedSpeakerCount estimates the speaker count with
max_num_speakers=emb.shape[0], so for short sessions est_num_of_spk_enhanced
can exceed the requested max_num_speakers. Add a CPU unit test that calls
SpeakerClustering.forward_unit_infer with an enhanced count larger than
max_num_speakers and asserts the number of output clusters is capped at
max_num_speakers. Fails before the clamp fix (returns 8 clusters), passes
after (capped at 2/3).

Signed-off-by: Vadym Prokopov <vprokopov@sohosquared.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants