Describe the bug
model.transcribe() mutates shared encoder state via freeze()/unfreeze() calls in _transcribe_on_begin()/_transcribe_on_end(), causing ValueError and AttributeError crashes when called concurrently from multiple threads.
Two distinct failure modes observed (98 crashes analyzed):
1. ValueError (97/98 crashes):
ValueError: Cannot unfreeze partially without first freezing the module with `freeze()`
Full traceback:
model.transcribe([file_path], timestamps=True)
→ nemo/collections/asr/parts/mixins/transcription.py:410 _transcribe_on_end()
→ nemo/collections/asr/models/rnnt_models.py:1035 super()._transcribe_on_end()
→ nemo/collections/asr/parts/mixins/transcription.py:795 self.encoder.unfreeze(partial=True)
→ nemo/core/classes/module.py:114 raise ValueError(...)
2. AttributeError (1/98 crashes):
AttributeError: 'ConformerEncoder' object has no attribute '_frozen_grad_map'
Same path but at module.py:124 — _frozen_grad_map deleted by another thread mid-iteration.
Race condition mechanism:
- Thread A calls
model.transcribe() → _transcribe_on_begin() calls self.encoder.freeze() → sets _frozen_grad_map
- Thread B calls
model.transcribe() concurrently → also calls self.encoder.freeze() → overwrites _frozen_grad_map
- Thread A finishes →
_transcribe_on_end() calls self.encoder.unfreeze(partial=True) → finds inconsistent state → ValueError
The AttributeError variant occurs when Thread B's unfreeze() deletes _frozen_grad_map while Thread A is reading it.
Question for NeMo team: Is the freeze()/unfreeze() cycle in the transcribe path intentional behavior, or unintended? In #13988, @nithinraok asked "@titu1994 what is the reason behind unfreeze in transcription mixin?" — this was never answered.
Steps/Code to reproduce bug
Minimal self-contained reproduction:
import threading
import wave
import struct
import nemo.collections.asr as nemo_asr
# 1. Load model
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v3")
model = model.to(device="cuda")
model.eval()
# 2. Generate test audio (5s, 16kHz mono)
with wave.open("/tmp/test.wav", "w") as w:
w.setnchannels(1)
w.setsampwidth(2)
w.setframerate(16000)
w.writeframes(struct.pack("<80000h", *([0] * 80000)))
# 3. Launch concurrent transcriptions
errors = []
def worker(tid):
try:
model.transcribe(["/tmp/test.wav"], timestamps=True)
print(f"Thread {tid}: OK")
except Exception as e:
errors.append((tid, type(e).__name__, str(e)))
print(f"Thread {tid}: {type(e).__name__}")
threads = [threading.Thread(target=worker, args=(i,)) for i in range(10)]
for t in threads:
t.start()
for t in threads:
t.join()
print(f"\nResult: {len(errors)}/10 crashed")
# Expected: 0/10 crashed
# Actual: ~7-8/10 crashed (75-83% failure rate)
Quantified results from stress test on GKE (125 requests total):
| Concurrency |
Requests |
Success |
Failures |
Error Rate |
| 1 (serial) |
5 |
5 |
0 |
0% |
| 5 |
20 |
5 |
15 |
75% |
| 33 |
100 |
17 |
83 |
83% |
Crashes begin at just 2 concurrent threads. Error rate stabilizes at ~75-83% regardless of concurrency level.
Expected behavior
Concurrent model.transcribe() calls return valid results when the model is in eval mode (model.eval()).
Environment overview (please complete the following information)
- Environment location: GKE (Google Kubernetes Engine), also reproduced on bare-metal
- Method of NeMo install:
pip install nemo_toolkit[asr]
- Serving framework: FastAPI + uvicorn (single worker process, thread pool executor for concurrent requests)
Environment details
- OS: Ubuntu 22.04 (GKE node)
- PyTorch: 2.7.0+cu128
- Python: 3.11
- NeMo: 2.3.0 (also confirmed on 2.6.0)
- CUDA: 12.8
Additional context
- GPU: NVIDIA L4 (24GB VRAM)
- Model tested:
nvidia/parakeet-tdt-0.6b-v3 (EncDecRNNTBPEModel). Also reproduced on parakeet-tdt-0.6b-v2 and parakeet-rnnt-1.1b.
- Production impact: 33 concurrent requests caused 2,103 HTTP 500 errors in ~10 minutes in our deployment.
- Zero CUDA errors observed — all 98 failures are Python-level state corruption (no OOM, no GPU crashes, no pod restarts).
NeMo code locations involved:
nemo/collections/asr/parts/mixins/transcription.py — _transcribe_on_begin() calls encoder.freeze(), _transcribe_on_end() calls encoder.unfreeze(partial=True)
nemo/core/classes/module.py — freeze() mutates _frozen_grad_map, unfreeze() reads/deletes it
Production incident reference:
This bug caused a production outage in Omi (open-source AI wearable, 13k+ stars), serving parakeet-tdt-0.6b-v3 on GKE with L4 GPU via FastAPI:
- PR: BasedHardware/omi#7653 — Parakeet ASR production deployment
- Incident report: PR comment — 33 concurrent
/v2/transcribe requests caused 2,103 HTTP 500 errors in ~10 minutes (2026-06-08 06:15 UTC)
- Incident timeline: T+1h checkpoint — routing enabled at 05:30, crash at 06:15, rollback at 06:20, hotfix deployed at 06:28
- Root cause confirmation: NeMo research comment — confirmed
model.transcribe() thread-safety as root cause
- Serialization fix: commit a540a76 — added
threading.Semaphore(1) to serialize all model access
- Post-fix verification: T+24h final — 7,005 requests served, 0% error rate after serialization
- Tracking issue: BasedHardware/omi#7651
Stress test reproduction (dev environment):
After the production incident, we reproduced the crash on a dev GKE cluster by removing the semaphore and sending concurrent requests to the same model/NeMo version/GPU configuration:
- 125 total requests across 3 phases (serial baseline, concurrency=5, concurrency=33)
- 98/120 concurrent requests crashed (0/5 serial requests crashed)
- Pod logs: 97× ValueError + 1× AttributeError, zero CUDA errors, zero OOM, zero pod restarts
Related issues: #13988 (closed without fix), #5755 (memory leak, potentially related), #15423 (CUDA graph corruption)
Describe the bug
model.transcribe()mutates shared encoder state viafreeze()/unfreeze()calls in_transcribe_on_begin()/_transcribe_on_end(), causingValueErrorandAttributeErrorcrashes when called concurrently from multiple threads.Two distinct failure modes observed (98 crashes analyzed):
1. ValueError (97/98 crashes):
Full traceback:
2. AttributeError (1/98 crashes):
Same path but at
module.py:124—_frozen_grad_mapdeleted by another thread mid-iteration.Race condition mechanism:
model.transcribe()→_transcribe_on_begin()callsself.encoder.freeze()→ sets_frozen_grad_mapmodel.transcribe()concurrently → also callsself.encoder.freeze()→ overwrites_frozen_grad_map_transcribe_on_end()callsself.encoder.unfreeze(partial=True)→ finds inconsistent state → ValueErrorThe
AttributeErrorvariant occurs when Thread B'sunfreeze()deletes_frozen_grad_mapwhile Thread A is reading it.Question for NeMo team: Is the
freeze()/unfreeze()cycle in the transcribe path intentional behavior, or unintended? In #13988, @nithinraok asked "@titu1994 what is the reason behind unfreeze in transcription mixin?" — this was never answered.Steps/Code to reproduce bug
Minimal self-contained reproduction:
Quantified results from stress test on GKE (125 requests total):
Crashes begin at just 2 concurrent threads. Error rate stabilizes at ~75-83% regardless of concurrency level.
Expected behavior
Concurrent
model.transcribe()calls return valid results when the model is in eval mode (model.eval()).Environment overview (please complete the following information)
pip install nemo_toolkit[asr]Environment details
Additional context
nvidia/parakeet-tdt-0.6b-v3(EncDecRNNTBPEModel). Also reproduced onparakeet-tdt-0.6b-v2andparakeet-rnnt-1.1b.NeMo code locations involved:
nemo/collections/asr/parts/mixins/transcription.py—_transcribe_on_begin()callsencoder.freeze(),_transcribe_on_end()callsencoder.unfreeze(partial=True)nemo/core/classes/module.py—freeze()mutates_frozen_grad_map,unfreeze()reads/deletes itProduction incident reference:
This bug caused a production outage in Omi (open-source AI wearable, 13k+ stars), serving
parakeet-tdt-0.6b-v3on GKE with L4 GPU via FastAPI:/v2/transcriberequests caused 2,103 HTTP 500 errors in ~10 minutes (2026-06-08 06:15 UTC)model.transcribe()thread-safety as root causethreading.Semaphore(1)to serialize all model accessStress test reproduction (dev environment):
After the production incident, we reproduced the crash on a dev GKE cluster by removing the semaphore and sending concurrent requests to the same model/NeMo version/GPU configuration:
Related issues: #13988 (closed without fix), #5755 (memory leak, potentially related), #15423 (CUDA graph corruption)