Skip to content

model.transcribe() is not thread-safe: encoder.freeze()/unfreeze() race causes ValueError under concurrent inference #15771

Description

@beastoin

Describe the bug

model.transcribe() mutates shared encoder state via freeze()/unfreeze() calls in _transcribe_on_begin()/_transcribe_on_end(), causing ValueError and AttributeError crashes when called concurrently from multiple threads.

Two distinct failure modes observed (98 crashes analyzed):

1. ValueError (97/98 crashes):

ValueError: Cannot unfreeze partially without first freezing the module with `freeze()`

Full traceback:

model.transcribe([file_path], timestamps=True)
  → nemo/collections/asr/parts/mixins/transcription.py:410  _transcribe_on_end()
    → nemo/collections/asr/models/rnnt_models.py:1035       super()._transcribe_on_end()
      → nemo/collections/asr/parts/mixins/transcription.py:795  self.encoder.unfreeze(partial=True)
        → nemo/core/classes/module.py:114  raise ValueError(...)

2. AttributeError (1/98 crashes):

AttributeError: 'ConformerEncoder' object has no attribute '_frozen_grad_map'

Same path but at module.py:124_frozen_grad_map deleted by another thread mid-iteration.

Race condition mechanism:

  1. Thread A calls model.transcribe()_transcribe_on_begin() calls self.encoder.freeze() → sets _frozen_grad_map
  2. Thread B calls model.transcribe() concurrently → also calls self.encoder.freeze() → overwrites _frozen_grad_map
  3. Thread A finishes → _transcribe_on_end() calls self.encoder.unfreeze(partial=True) → finds inconsistent state → ValueError

The AttributeError variant occurs when Thread B's unfreeze() deletes _frozen_grad_map while Thread A is reading it.

Question for NeMo team: Is the freeze()/unfreeze() cycle in the transcribe path intentional behavior, or unintended? In #13988, @nithinraok asked "@titu1994 what is the reason behind unfreeze in transcription mixin?" — this was never answered.

Steps/Code to reproduce bug

Minimal self-contained reproduction:

import threading
import wave
import struct
import nemo.collections.asr as nemo_asr

# 1. Load model
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v3")
model = model.to(device="cuda")
model.eval()

# 2. Generate test audio (5s, 16kHz mono)
with wave.open("/tmp/test.wav", "w") as w:
    w.setnchannels(1)
    w.setsampwidth(2)
    w.setframerate(16000)
    w.writeframes(struct.pack("<80000h", *([0] * 80000)))

# 3. Launch concurrent transcriptions
errors = []

def worker(tid):
    try:
        model.transcribe(["/tmp/test.wav"], timestamps=True)
        print(f"Thread {tid}: OK")
    except Exception as e:
        errors.append((tid, type(e).__name__, str(e)))
        print(f"Thread {tid}: {type(e).__name__}")

threads = [threading.Thread(target=worker, args=(i,)) for i in range(10)]
for t in threads:
    t.start()
for t in threads:
    t.join()

print(f"\nResult: {len(errors)}/10 crashed")
# Expected: 0/10 crashed
# Actual:   ~7-8/10 crashed (75-83% failure rate)

Quantified results from stress test on GKE (125 requests total):

Concurrency Requests Success Failures Error Rate
1 (serial) 5 5 0 0%
5 20 5 15 75%
33 100 17 83 83%

Crashes begin at just 2 concurrent threads. Error rate stabilizes at ~75-83% regardless of concurrency level.

Expected behavior

Concurrent model.transcribe() calls return valid results when the model is in eval mode (model.eval()).

Environment overview (please complete the following information)

  • Environment location: GKE (Google Kubernetes Engine), also reproduced on bare-metal
  • Method of NeMo install: pip install nemo_toolkit[asr]
  • Serving framework: FastAPI + uvicorn (single worker process, thread pool executor for concurrent requests)

Environment details

  • OS: Ubuntu 22.04 (GKE node)
  • PyTorch: 2.7.0+cu128
  • Python: 3.11
  • NeMo: 2.3.0 (also confirmed on 2.6.0)
  • CUDA: 12.8

Additional context

  • GPU: NVIDIA L4 (24GB VRAM)
  • Model tested: nvidia/parakeet-tdt-0.6b-v3 (EncDecRNNTBPEModel). Also reproduced on parakeet-tdt-0.6b-v2 and parakeet-rnnt-1.1b.
  • Production impact: 33 concurrent requests caused 2,103 HTTP 500 errors in ~10 minutes in our deployment.
  • Zero CUDA errors observed — all 98 failures are Python-level state corruption (no OOM, no GPU crashes, no pod restarts).

NeMo code locations involved:

  1. nemo/collections/asr/parts/mixins/transcription.py_transcribe_on_begin() calls encoder.freeze(), _transcribe_on_end() calls encoder.unfreeze(partial=True)
  2. nemo/core/classes/module.pyfreeze() mutates _frozen_grad_map, unfreeze() reads/deletes it

Production incident reference:

This bug caused a production outage in Omi (open-source AI wearable, 13k+ stars), serving parakeet-tdt-0.6b-v3 on GKE with L4 GPU via FastAPI:

  • PR: BasedHardware/omi#7653 — Parakeet ASR production deployment
  • Incident report: PR comment — 33 concurrent /v2/transcribe requests caused 2,103 HTTP 500 errors in ~10 minutes (2026-06-08 06:15 UTC)
  • Incident timeline: T+1h checkpoint — routing enabled at 05:30, crash at 06:15, rollback at 06:20, hotfix deployed at 06:28
  • Root cause confirmation: NeMo research comment — confirmed model.transcribe() thread-safety as root cause
  • Serialization fix: commit a540a76 — added threading.Semaphore(1) to serialize all model access
  • Post-fix verification: T+24h final — 7,005 requests served, 0% error rate after serialization
  • Tracking issue: BasedHardware/omi#7651

Stress test reproduction (dev environment):

After the production incident, we reproduced the crash on a dev GKE cluster by removing the semaphore and sending concurrent requests to the same model/NeMo version/GPU configuration:

  • 125 total requests across 3 phases (serial baseline, concurrency=5, concurrency=33)
  • 98/120 concurrent requests crashed (0/5 serial requests crashed)
  • Pod logs: 97× ValueError + 1× AttributeError, zero CUDA errors, zero OOM, zero pod restarts

Related issues: #13988 (closed without fix), #5755 (memory leak, potentially related), #15423 (CUDA graph corruption)

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions