`run_nvrtc()` corrupts CPython's interned 1-byte bytes singleton `b' '` when the NVRTC compile log is empty

**Describe the bug**

`nemo/core/utils/cuda_python_utils.py::run_nvrtc()` fetches the NVRTC compile log with:

```python
err, size = nvrtc.nvrtcGetProgramLogSize(prog)
buf = b" " * size
(err,) = nvrtc.nvrtcGetProgramLog(prog, buf)
```

On a clean compile the log is empty, so `size == 1`. CPython's bytes-repeat returns the original object for `* 1`, and (when the module is byte-compiled from source) the `b" "` literal is the **interned, immortal 1-byte singleton** shared by the whole process. `nvrtcGetProgramLog` then writes its C-string NUL terminator straight into that object's buffer.

Result: **every 1-byte `bytes` of value `0x20` in the entire process becomes `b'\x00'`, permanently** — `bytes([32])`, 1-byte slices, byte-wise stream reads, and every `b" "` literal (they all alias one cached object).

**Steps/Code to reproduce bug**

Self-contained — no GPU device, no driver, no NeMo import needed (verified in a plain `python:3.12-slim` container):

```bash
pip install cuda-python nvidia-cuda-nvrtc
```

```python
# reproduce.py
from cuda.bindings import nvrtc

err, prog = nvrtc.nvrtcCreateProgram(b'extern "C" __global__ void k(){}\n', b"k.cu", 0, [], [])
(err,) = nvrtc.nvrtcCompileProgram(prog, 0, [])
err, size = nvrtc.nvrtcGetProgramLogSize(prog)
print("logSize:", size)                  # 1  (empty log, just the NUL terminator)
print("before :", bytes([32]))           # b' '

# --- exact idiom from nemo/core/utils/cuda_python_utils.py::run_nvrtc ---
buf = b" " * size
(err,) = nvrtc.nvrtcGetProgramLog(prog, buf)

print("after  :", bytes([32]))           # b'\x00'   <- process-wide corruption
assert bytes([32])[0] == 32, "interned b' ' singleton corrupted!"
```

Output:

```text
logSize: 1
before : b' '
after  : b'\x00'
AssertionError: interned b' ' singleton corrupted!
```

(Note: a check like `bytes([32]) == b" "` cannot detect the corruption — the literal is the same corrupted object — hence the integer comparison.)

End-to-end trigger inside NeMo: any TDT/RNNT (and batched-CTC/MALSD) decode path that enables CUDA-graphs conditional nodes, e.g. `ASRModel.restore_from("parakeet-tdt-0.6b-v3.nemo").transcribe([wav])` on a host whose driver reports CUDA >= 12.6.

**Expected behavior**

Fetching the NVRTC log must not mutate a shared immutable object: `bytes([32])` stays `b' '` afterwards. Or use a writable buffer (or at minimum never a length<=1 `bytes`)
Notably, NVIDIA's own `cuda.core` already avoids this exact trap: its `Program` implementation guards the NVRTC log fetch with `if logsize > 1:` (so the buffer can never alias the 1-byte singleton; present since the very first `cuda.py` prototype commit, NVIDIA/cuda-python@3c7f0e17) and the newer NVVM paths use `bytearray(logsize)`.

**Environment overview (please complete the following information)**

- Environment location: Cloud — AWS EC2 g4dn.xlarge (Tesla T4), Docker container based on `nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04`; host AMI `al2023-ami-ecs-gpu-hvm` (NVIDIA driver 580.159.03 = CUDA 13.0)
- Method of NeMo install: `uv pip install` (nemo-toolkit[asr]==2.6.0). Note: the installer matters for visibility of this bug — see Additional context.
- The minimal reproducer above also runs in a plain `docker run -it python:3.12-slim` with only the two pip wheels, no GPU.


**Environment details**

- OS: Ubuntu 22.04 (container) on Amazon Linux 2023 (host)
- Python: 3.12.13
- PyTorch: 2.4.0
- nemo-toolkit: 2.6.0 (the code is unchanged on `main` as of 2026-06)
- cuda-python / cuda-bindings: 13.0.3; nvidia-cuda-nvrtc: 13.0.88

**Additional context**
- GPU: Tesla T4 (but no GPU is required for the reproducer; NVRTC is host-side).
- The `b" " * size` idiom originates from the official cuda-python examples and is still taught today: [cuda_bindings/examples/extra/jit_program.py#L84-L88](https://github.com/NVIDIA/cuda-python/blob/main/cuda_bindings/examples/extra/jit_program.py) (the exact `nvrtcGetProgramLogSize` -> `b" " * log_size` -> `nvrtcGetProgramLog` sequence `run_nvrtc` mirrors) and the [cuda-bindings overview tutorial](https://nvidia.github.io/cuda-python/cuda-bindings/latest/overview.html) (`ptx = b" " * ptxSize`). The binding itself ([nvrtc.pyx, `def nvrtcGetProgramLog(prog, char* log)`](https://github.com/NVIDIA/cuda-python/blob/main/cuda_bindings/cuda/bindings/nvrtc.pyx)) is a zero-copy passthrough — Cython converts the `bytes` to the object's internal `char*` and the C library writes through it — so other projects copying the sample likely carry the same latent bug.
- Loaded from a `.pyc`, marshal materializes the constant as a private object (bypassing the 1-byte cache), so NVRTC merely NULs NeMo's own module constant. `uv` does not pre-compile by default 
```
# python3 /tmp/reproduce.py 
bytes([32]) before : b' '
compile OK, logSize: 1  (1 == empty log, just the NUL terminator)
bytes([32]) after  : b'\x00'
RESULT: process-wide bytes singleton b' ' is now b'\x00' (irreversible)
# python3 -m py_compile /tmp/reproduce.py
# ll /tmp/__pycache__/reproduce.cpython-312.pyc 
-rw-r--r--. 1 root root 4454 Jun 11 09:51 /tmp/__pycache__/reproduce.cpython-312.pyc
# python3 /tmp/__pycache__/reproduce.cpython-312.pyc
bytes([32]) before : b' '
compile OK, logSize: 1  (1 == empty log, just the NUL terminator)
bytes([32]) after  : b' '
RESULT: singleton intact
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`run_nvrtc()` corrupts CPython's interned 1-byte bytes singleton `b' '` when the NVRTC compile log is empty #15790

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

run_nvrtc() corrupts CPython's interned 1-byte bytes singleton b' ' when the NVRTC compile log is empty #15790

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`run_nvrtc()` corrupts CPython's interned 1-byte bytes singleton `b' '` when the NVRTC compile log is empty #15790