Describe the bug
nemo/core/utils/cuda_python_utils.py::run_nvrtc() fetches the NVRTC compile log with:
err, size = nvrtc.nvrtcGetProgramLogSize(prog)
buf = b" " * size
(err,) = nvrtc.nvrtcGetProgramLog(prog, buf)
On a clean compile the log is empty, so size == 1. CPython's bytes-repeat returns the original object for * 1, and (when the module is byte-compiled from source) the b" " literal is the interned, immortal 1-byte singleton shared by the whole process. nvrtcGetProgramLog then writes its C-string NUL terminator straight into that object's buffer.
Result: every 1-byte bytes of value 0x20 in the entire process becomes b'\x00', permanently — bytes([32]), 1-byte slices, byte-wise stream reads, and every b" " literal (they all alias one cached object).
Steps/Code to reproduce bug
Self-contained — no GPU device, no driver, no NeMo import needed (verified in a plain python:3.12-slim container):
pip install cuda-python nvidia-cuda-nvrtc
# reproduce.py
from cuda.bindings import nvrtc
err, prog = nvrtc.nvrtcCreateProgram(b'extern "C" __global__ void k(){}\n', b"k.cu", 0, [], [])
(err,) = nvrtc.nvrtcCompileProgram(prog, 0, [])
err, size = nvrtc.nvrtcGetProgramLogSize(prog)
print("logSize:", size) # 1 (empty log, just the NUL terminator)
print("before :", bytes([32])) # b' '
# --- exact idiom from nemo/core/utils/cuda_python_utils.py::run_nvrtc ---
buf = b" " * size
(err,) = nvrtc.nvrtcGetProgramLog(prog, buf)
print("after :", bytes([32])) # b'\x00' <- process-wide corruption
assert bytes([32])[0] == 32, "interned b' ' singleton corrupted!"
Output:
logSize: 1
before : b' '
after : b'\x00'
AssertionError: interned b' ' singleton corrupted!
(Note: a check like bytes([32]) == b" " cannot detect the corruption — the literal is the same corrupted object — hence the integer comparison.)
End-to-end trigger inside NeMo: any TDT/RNNT (and batched-CTC/MALSD) decode path that enables CUDA-graphs conditional nodes, e.g. ASRModel.restore_from("parakeet-tdt-0.6b-v3.nemo").transcribe([wav]) on a host whose driver reports CUDA >= 12.6.
Expected behavior
Fetching the NVRTC log must not mutate a shared immutable object: bytes([32]) stays b' ' afterwards. Or use a writable buffer (or at minimum never a length<=1 bytes)
Notably, NVIDIA's own cuda.core already avoids this exact trap: its Program implementation guards the NVRTC log fetch with if logsize > 1: (so the buffer can never alias the 1-byte singleton; present since the very first cuda.py prototype commit, NVIDIA/cuda-python@3c7f0e17) and the newer NVVM paths use bytearray(logsize).
Environment overview (please complete the following information)
- Environment location: Cloud — AWS EC2 g4dn.xlarge (Tesla T4), Docker container based on
nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04; host AMI al2023-ami-ecs-gpu-hvm (NVIDIA driver 580.159.03 = CUDA 13.0)
- Method of NeMo install:
uv pip install (nemo-toolkit[asr]==2.6.0). Note: the installer matters for visibility of this bug — see Additional context.
- The minimal reproducer above also runs in a plain
docker run -it python:3.12-slim with only the two pip wheels, no GPU.
Environment details
- OS: Ubuntu 22.04 (container) on Amazon Linux 2023 (host)
- Python: 3.12.13
- PyTorch: 2.4.0
- nemo-toolkit: 2.6.0 (the code is unchanged on
main as of 2026-06)
- cuda-python / cuda-bindings: 13.0.3; nvidia-cuda-nvrtc: 13.0.88
Additional context
- GPU: Tesla T4 (but no GPU is required for the reproducer; NVRTC is host-side).
- The
b" " * size idiom originates from the official cuda-python examples and is still taught today: cuda_bindings/examples/extra/jit_program.py#L84-L88 (the exact nvrtcGetProgramLogSize -> b" " * log_size -> nvrtcGetProgramLog sequence run_nvrtc mirrors) and the cuda-bindings overview tutorial (ptx = b" " * ptxSize). The binding itself (nvrtc.pyx, def nvrtcGetProgramLog(prog, char* log)) is a zero-copy passthrough — Cython converts the bytes to the object's internal char* and the C library writes through it — so other projects copying the sample likely carry the same latent bug.
- Loaded from a
.pyc, marshal materializes the constant as a private object (bypassing the 1-byte cache), so NVRTC merely NULs NeMo's own module constant. uv does not pre-compile by default
# python3 /tmp/reproduce.py
bytes([32]) before : b' '
compile OK, logSize: 1 (1 == empty log, just the NUL terminator)
bytes([32]) after : b'\x00'
RESULT: process-wide bytes singleton b' ' is now b'\x00' (irreversible)
# python3 -m py_compile /tmp/reproduce.py
# ll /tmp/__pycache__/reproduce.cpython-312.pyc
-rw-r--r--. 1 root root 4454 Jun 11 09:51 /tmp/__pycache__/reproduce.cpython-312.pyc
# python3 /tmp/__pycache__/reproduce.cpython-312.pyc
bytes([32]) before : b' '
compile OK, logSize: 1 (1 == empty log, just the NUL terminator)
bytes([32]) after : b' '
RESULT: singleton intact
Describe the bug
nemo/core/utils/cuda_python_utils.py::run_nvrtc()fetches the NVRTC compile log with:On a clean compile the log is empty, so
size == 1. CPython's bytes-repeat returns the original object for* 1, and (when the module is byte-compiled from source) theb" "literal is the interned, immortal 1-byte singleton shared by the whole process.nvrtcGetProgramLogthen writes its C-string NUL terminator straight into that object's buffer.Result: every 1-byte
bytesof value0x20in the entire process becomesb'\x00', permanently —bytes([32]), 1-byte slices, byte-wise stream reads, and everyb" "literal (they all alias one cached object).Steps/Code to reproduce bug
Self-contained — no GPU device, no driver, no NeMo import needed (verified in a plain
python:3.12-slimcontainer):Output:
(Note: a check like
bytes([32]) == b" "cannot detect the corruption — the literal is the same corrupted object — hence the integer comparison.)End-to-end trigger inside NeMo: any TDT/RNNT (and batched-CTC/MALSD) decode path that enables CUDA-graphs conditional nodes, e.g.
ASRModel.restore_from("parakeet-tdt-0.6b-v3.nemo").transcribe([wav])on a host whose driver reports CUDA >= 12.6.Expected behavior
Fetching the NVRTC log must not mutate a shared immutable object:
bytes([32])staysb' 'afterwards. Or use a writable buffer (or at minimum never a length<=1bytes)Notably, NVIDIA's own
cuda.corealready avoids this exact trap: itsProgramimplementation guards the NVRTC log fetch withif logsize > 1:(so the buffer can never alias the 1-byte singleton; present since the very firstcuda.pyprototype commit, NVIDIA/cuda-python@3c7f0e17) and the newer NVVM paths usebytearray(logsize).Environment overview (please complete the following information)
nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04; host AMIal2023-ami-ecs-gpu-hvm(NVIDIA driver 580.159.03 = CUDA 13.0)uv pip install(nemo-toolkit[asr]==2.6.0). Note: the installer matters for visibility of this bug — see Additional context.docker run -it python:3.12-slimwith only the two pip wheels, no GPU.Environment details
mainas of 2026-06)Additional context
b" " * sizeidiom originates from the official cuda-python examples and is still taught today: cuda_bindings/examples/extra/jit_program.py#L84-L88 (the exactnvrtcGetProgramLogSize->b" " * log_size->nvrtcGetProgramLogsequencerun_nvrtcmirrors) and the cuda-bindings overview tutorial (ptx = b" " * ptxSize). The binding itself (nvrtc.pyx,def nvrtcGetProgramLog(prog, char* log)) is a zero-copy passthrough — Cython converts thebytesto the object's internalchar*and the C library writes through it — so other projects copying the sample likely carry the same latent bug..pyc, marshal materializes the constant as a private object (bypassing the 1-byte cache), so NVRTC merely NULs NeMo's own module constant.uvdoes not pre-compile by default