Skip to content

run_nvrtc() corrupts CPython's interned 1-byte bytes singleton b' ' when the NVRTC compile log is empty #15790

Description

@wrchen-ef

Describe the bug

nemo/core/utils/cuda_python_utils.py::run_nvrtc() fetches the NVRTC compile log with:

err, size = nvrtc.nvrtcGetProgramLogSize(prog)
buf = b" " * size
(err,) = nvrtc.nvrtcGetProgramLog(prog, buf)

On a clean compile the log is empty, so size == 1. CPython's bytes-repeat returns the original object for * 1, and (when the module is byte-compiled from source) the b" " literal is the interned, immortal 1-byte singleton shared by the whole process. nvrtcGetProgramLog then writes its C-string NUL terminator straight into that object's buffer.

Result: every 1-byte bytes of value 0x20 in the entire process becomes b'\x00', permanentlybytes([32]), 1-byte slices, byte-wise stream reads, and every b" " literal (they all alias one cached object).

Steps/Code to reproduce bug

Self-contained — no GPU device, no driver, no NeMo import needed (verified in a plain python:3.12-slim container):

pip install cuda-python nvidia-cuda-nvrtc
# reproduce.py
from cuda.bindings import nvrtc

err, prog = nvrtc.nvrtcCreateProgram(b'extern "C" __global__ void k(){}\n', b"k.cu", 0, [], [])
(err,) = nvrtc.nvrtcCompileProgram(prog, 0, [])
err, size = nvrtc.nvrtcGetProgramLogSize(prog)
print("logSize:", size)                  # 1  (empty log, just the NUL terminator)
print("before :", bytes([32]))           # b' '

# --- exact idiom from nemo/core/utils/cuda_python_utils.py::run_nvrtc ---
buf = b" " * size
(err,) = nvrtc.nvrtcGetProgramLog(prog, buf)

print("after  :", bytes([32]))           # b'\x00'   <- process-wide corruption
assert bytes([32])[0] == 32, "interned b' ' singleton corrupted!"

Output:

logSize: 1
before : b' '
after  : b'\x00'
AssertionError: interned b' ' singleton corrupted!

(Note: a check like bytes([32]) == b" " cannot detect the corruption — the literal is the same corrupted object — hence the integer comparison.)

End-to-end trigger inside NeMo: any TDT/RNNT (and batched-CTC/MALSD) decode path that enables CUDA-graphs conditional nodes, e.g. ASRModel.restore_from("parakeet-tdt-0.6b-v3.nemo").transcribe([wav]) on a host whose driver reports CUDA >= 12.6.

Expected behavior

Fetching the NVRTC log must not mutate a shared immutable object: bytes([32]) stays b' ' afterwards. Or use a writable buffer (or at minimum never a length<=1 bytes)
Notably, NVIDIA's own cuda.core already avoids this exact trap: its Program implementation guards the NVRTC log fetch with if logsize > 1: (so the buffer can never alias the 1-byte singleton; present since the very first cuda.py prototype commit, NVIDIA/cuda-python@3c7f0e17) and the newer NVVM paths use bytearray(logsize).

Environment overview (please complete the following information)

  • Environment location: Cloud — AWS EC2 g4dn.xlarge (Tesla T4), Docker container based on nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04; host AMI al2023-ami-ecs-gpu-hvm (NVIDIA driver 580.159.03 = CUDA 13.0)
  • Method of NeMo install: uv pip install (nemo-toolkit[asr]==2.6.0). Note: the installer matters for visibility of this bug — see Additional context.
  • The minimal reproducer above also runs in a plain docker run -it python:3.12-slim with only the two pip wheels, no GPU.

Environment details

  • OS: Ubuntu 22.04 (container) on Amazon Linux 2023 (host)
  • Python: 3.12.13
  • PyTorch: 2.4.0
  • nemo-toolkit: 2.6.0 (the code is unchanged on main as of 2026-06)
  • cuda-python / cuda-bindings: 13.0.3; nvidia-cuda-nvrtc: 13.0.88

Additional context

  • GPU: Tesla T4 (but no GPU is required for the reproducer; NVRTC is host-side).
  • The b" " * size idiom originates from the official cuda-python examples and is still taught today: cuda_bindings/examples/extra/jit_program.py#L84-L88 (the exact nvrtcGetProgramLogSize -> b" " * log_size -> nvrtcGetProgramLog sequence run_nvrtc mirrors) and the cuda-bindings overview tutorial (ptx = b" " * ptxSize). The binding itself (nvrtc.pyx, def nvrtcGetProgramLog(prog, char* log)) is a zero-copy passthrough — Cython converts the bytes to the object's internal char* and the C library writes through it — so other projects copying the sample likely carry the same latent bug.
  • Loaded from a .pyc, marshal materializes the constant as a private object (bypassing the 1-byte cache), so NVRTC merely NULs NeMo's own module constant. uv does not pre-compile by default
# python3 /tmp/reproduce.py 
bytes([32]) before : b' '
compile OK, logSize: 1  (1 == empty log, just the NUL terminator)
bytes([32]) after  : b'\x00'
RESULT: process-wide bytes singleton b' ' is now b'\x00' (irreversible)
# python3 -m py_compile /tmp/reproduce.py
# ll /tmp/__pycache__/reproduce.cpython-312.pyc 
-rw-r--r--. 1 root root 4454 Jun 11 09:51 /tmp/__pycache__/reproduce.cpython-312.pyc
# python3 /tmp/__pycache__/reproduce.cpython-312.pyc
bytes([32]) before : b' '
compile OK, logSize: 1  (1 == empty log, just the NUL terminator)
bytes([32]) after  : b' '
RESULT: singleton intact

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions