model.transcribe() is not thread-safe: encoder.freeze()/unfreeze() race causes ValueError under concurrent inference

**Describe the bug**

`model.transcribe()` mutates shared encoder state via `freeze()`/`unfreeze()` calls in `_transcribe_on_begin()`/`_transcribe_on_end()`, causing `ValueError` and `AttributeError` crashes when called concurrently from multiple threads.

Two distinct failure modes observed (98 crashes analyzed):

**1. ValueError (97/98 crashes):**
```
ValueError: Cannot unfreeze partially without first freezing the module with `freeze()`
```

Full traceback:
```
model.transcribe([file_path], timestamps=True)
  → nemo/collections/asr/parts/mixins/transcription.py:410  _transcribe_on_end()
    → nemo/collections/asr/models/rnnt_models.py:1035       super()._transcribe_on_end()
      → nemo/collections/asr/parts/mixins/transcription.py:795  self.encoder.unfreeze(partial=True)
        → nemo/core/classes/module.py:114  raise ValueError(...)
```

**2. AttributeError (1/98 crashes):**
```
AttributeError: 'ConformerEncoder' object has no attribute '_frozen_grad_map'
```
Same path but at `module.py:124` — `_frozen_grad_map` deleted by another thread mid-iteration.

**Race condition mechanism:**
1. Thread A calls `model.transcribe()` → `_transcribe_on_begin()` calls `self.encoder.freeze()` → sets `_frozen_grad_map`
2. Thread B calls `model.transcribe()` concurrently → also calls `self.encoder.freeze()` → overwrites `_frozen_grad_map`
3. Thread A finishes → `_transcribe_on_end()` calls `self.encoder.unfreeze(partial=True)` → finds inconsistent state → **ValueError**

The `AttributeError` variant occurs when Thread B's `unfreeze()` deletes `_frozen_grad_map` while Thread A is reading it.

**Question for NeMo team:** Is the `freeze()`/`unfreeze()` cycle in the transcribe path intentional behavior, or unintended? In #13988, @nithinraok asked "@titu1994 what is the reason behind unfreeze in transcription mixin?" — this was never answered.

**Steps/Code to reproduce bug**

Minimal self-contained reproduction:

```python
import threading
import wave
import struct
import nemo.collections.asr as nemo_asr

# 1. Load model
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v3")
model = model.to(device="cuda")
model.eval()

# 2. Generate test audio (5s, 16kHz mono)
with wave.open("/tmp/test.wav", "w") as w:
    w.setnchannels(1)
    w.setsampwidth(2)
    w.setframerate(16000)
    w.writeframes(struct.pack("<80000h", *([0] * 80000)))

# 3. Launch concurrent transcriptions
errors = []

def worker(tid):
    try:
        model.transcribe(["/tmp/test.wav"], timestamps=True)
        print(f"Thread {tid}: OK")
    except Exception as e:
        errors.append((tid, type(e).__name__, str(e)))
        print(f"Thread {tid}: {type(e).__name__}")

threads = [threading.Thread(target=worker, args=(i,)) for i in range(10)]
for t in threads:
    t.start()
for t in threads:
    t.join()

print(f"\nResult: {len(errors)}/10 crashed")
# Expected: 0/10 crashed
# Actual:   ~7-8/10 crashed (75-83% failure rate)
```

Quantified results from stress test on GKE (125 requests total):

| Concurrency | Requests | Success | Failures | Error Rate |
|-------------|----------|---------|----------|------------|
| 1 (serial)  | 5        | 5       | 0        | **0%**     |
| 5           | 20       | 5       | 15       | **75%**    |
| 33          | 100      | 17      | 83       | **83%**    |

Crashes begin at just 2 concurrent threads. Error rate stabilizes at ~75-83% regardless of concurrency level.

**Expected behavior**

Concurrent `model.transcribe()` calls return valid results when the model is in eval mode (`model.eval()`).

**Environment overview (please complete the following information)**

- Environment location: GKE (Google Kubernetes Engine), also reproduced on bare-metal
- Method of NeMo install: `pip install nemo_toolkit[asr]`
- Serving framework: FastAPI + uvicorn (single worker process, thread pool executor for concurrent requests)

**Environment details**

- OS: Ubuntu 22.04 (GKE node)
- PyTorch: 2.7.0+cu128
- Python: 3.11
- NeMo: 2.3.0 (also confirmed on 2.6.0)
- CUDA: 12.8

**Additional context**

- GPU: NVIDIA L4 (24GB VRAM)
- Model tested: `nvidia/parakeet-tdt-0.6b-v3` (EncDecRNNTBPEModel). Also reproduced on `parakeet-tdt-0.6b-v2` and `parakeet-rnnt-1.1b`.
- Production impact: 33 concurrent requests caused 2,103 HTTP 500 errors in ~10 minutes in our deployment.
- Zero CUDA errors observed — all 98 failures are Python-level state corruption (no OOM, no GPU crashes, no pod restarts).

**NeMo code locations involved:**
1. `nemo/collections/asr/parts/mixins/transcription.py` — `_transcribe_on_begin()` calls `encoder.freeze()`, `_transcribe_on_end()` calls `encoder.unfreeze(partial=True)`
2. `nemo/core/classes/module.py` — `freeze()` mutates `_frozen_grad_map`, `unfreeze()` reads/deletes it

**Production incident reference:**

This bug caused a production outage in [Omi](https://github.com/BasedHardware/omi) (open-source AI wearable, 13k+ stars), serving `parakeet-tdt-0.6b-v3` on GKE with L4 GPU via FastAPI:

- **PR**: [BasedHardware/omi#7653](https://github.com/BasedHardware/omi/pull/7653) — Parakeet ASR production deployment
- **Incident report**: [PR comment](https://github.com/BasedHardware/omi/pull/7653#issuecomment-4645944074) — 33 concurrent `/v2/transcribe` requests caused 2,103 HTTP 500 errors in ~10 minutes (2026-06-08 06:15 UTC)
- **Incident timeline**: [T+1h checkpoint](https://github.com/BasedHardware/omi/pull/7653#issuecomment-4646075967) — routing enabled at 05:30, crash at 06:15, rollback at 06:20, hotfix deployed at 06:28
- **Root cause confirmation**: [NeMo research comment](https://github.com/BasedHardware/omi/pull/7653#issuecomment-4646107007) — confirmed `model.transcribe()` thread-safety as root cause
- **Serialization fix**: [commit a540a76](https://github.com/BasedHardware/omi/commit/a540a76b1dc5c231999135f68505e0c91b47ca86) — added `threading.Semaphore(1)` to serialize all model access
- **Post-fix verification**: [T+24h final](https://github.com/BasedHardware/omi/pull/7653#issuecomment-4647953215) — 7,005 requests served, 0% error rate after serialization
- **Tracking issue**: [BasedHardware/omi#7651](https://github.com/BasedHardware/omi/issues/7651)

**Stress test reproduction (dev environment):**

After the production incident, we reproduced the crash on a dev GKE cluster by removing the semaphore and sending concurrent requests to the same model/NeMo version/GPU configuration:

- 125 total requests across 3 phases (serial baseline, concurrency=5, concurrency=33)
- 98/120 concurrent requests crashed (0/5 serial requests crashed)
- Pod logs: 97× ValueError + 1× AttributeError, zero CUDA errors, zero OOM, zero pod restarts

**Related issues:** #13988 (closed without fix), #5755 (memory leak, potentially related), #15423 (CUDA graph corruption)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

model.transcribe() is not thread-safe: encoder.freeze()/unfreeze() race causes ValueError under concurrent inference #15771

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

model.transcribe() is not thread-safe: encoder.freeze()/unfreeze() race causes ValueError under concurrent inference #15771

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions