Real-time meeting transcription with fast speaker identification using NVIDIA NeMo.
- Fast Speaker Identification: 1-2 second latency (2-3x faster than traditional approaches)
- Hybrid Diarization: Combines fast TitaNet embeddings with accurate MSDD refinement
- Self-Correcting Labels: Automatically improves speaker assignments over time
- Live Transcription: Real-time speech-to-text with Canary ASR
- GPU-Accelerated: Optimized for NVIDIA GPUs
# Install dependencies
pip install -r requirements.txt
# Or using uv (recommended)
uv pip install -r requirements.txt# Use default settings
python main.py
# Or specify GPU device
python main.py --device cuda:0Example Output:
[001.0–003.0] (Speaker_0): hello everyone
[003.0–005.0] (Speaker_0): welcome to the meeting
[005.5–007.5] (Speaker_1): thanks for having me
Key settings in Config class:
# Fast diarization
embed_window_seconds = 1.5 # Embedding window
embed_hop_seconds = 0.75 # Embedding frequency
speaker_similarity_threshold = 0.60 # New speaker thresholdSee inline comments in main.py for full details.
| Metric | Value |
|---|---|
| Time to first speaker label | 1-2 seconds ⚡ |
| Time to new speaker label | 1-2 seconds ⚡ |
| Initial accuracy (fast path) | 80-85% |
| Final accuracy (after MSDD) | 90-95% |
| GPU memory required | 8-10GB |
nemo/
├── main.py # Enhanced live diarization
├── cli.py # CLI version with file output
├── requirements.txt # Python dependencies
├── README.md # This file
└── summary.md # Optimization recommendations
CUDA out of memory: Reduce batch size to 32 Too many speakers: Increase threshold to 0.65 Speakers merged: Lower threshold to 0.55
python test_fast_diarization.py # Test components
python main.py # Go live!