cuda-link

Zero-copy GPU texture transfer between TouchDesigner and Python processes using CUDA IPC.

Overview

This component enables zero-copy GPU texture sharing between TouchDesigner and Python processes using CUDA Inter-Process Communication (IPC). It eliminates CPU memory copies for real-time AI pipelines, video processing, and other GPU-accelerated workflows.

Key Features

Zero-copy GPU transfer - Textures stay on GPU, no CPU memory copies
Bidirectional IPC - TD → Python (input capture) AND Python → TD (AI output display)
Low-overhead IPC - export_frame() 22–367 µs p50 (512×512 → 4K float32, EXPORT_SYNC=1); get_frame_numpy() D2H 0.18–5.7 ms p50 (PCIe 4.0 ~22–24 GB/s); IPC notification ~136–286 µs cross-process (see docs/BENCHMARKS.md)
Ring buffer architecture - N-slot pipeline prevents producer/consumer blocking
GPU-side synchronization - CUDA IPC events eliminate CPU polling
Triple output modes - PyTorch tensors (GPU, zero-copy), CuPy arrays (GPU, zero-copy), or numpy arrays (CPU, D2H copy)
Production-ready - Tested at 30+ FPS for hours, handles dynamic resolution changes

Performance

Measured on RTX 4090 / PCIe 4.0 x16 / Windows 11 / driver 596.36. All Python-side.

Operation	p50	Notes
`export_frame()` — 512×512 RGBA float32	22 µs	Standalone, EXPORT_SYNC=1; GPU D2D + stream_synchronize
`export_frame()` — 1080p RGBA float32	117 µs	Standalone, EXPORT_SYNC=1
`export_frame()` — 4K RGBA float32	367 µs	Standalone, EXPORT_SYNC=1
`get_frame_numpy()` D2H — 512×512 float32	0.18 ms	Standalone, ~22 GB/s
`get_frame_numpy()` D2H — 1080p float32	1.32 ms	Standalone, ~24 GB/s PCIe 4.0
`get_frame_numpy()` D2H — 4K float32	5.7 ms	Standalone, ~21 GB/s PCIe 4.0
`get_frame()` / `get_frame_cupy()` GPU	<5 µs	Zero-copy tensor/array view, no D2H
IPC notification latency	~136–286 µs	Producer publish → consumer detect (cross-process)
Initialization	~50–100 µs	One-time IPC handle opening

Requirements

OS: Windows 10/11 (CUDA IPC is Windows-only)
CUDA: 12.x (tested with 12.4)
GPU: NVIDIA GPU with CUDA compute capability 3.5+
TouchDesigner: 2022.x or later (for producer side)
Python: 3.9+ (for consumer side)

Python Dependencies

Required: None (pure ctypes CUDA wrapper)

Optional:

torch>=2.0 - For zero-copy GPU tensor output (recommended for AI pipelines)
cupy-cuda12x>=12.0 - For zero-copy GPU array output (CuPy/JAX workflows)
numpy>=1.21 - For CPU array output (for OpenCV, etc.)

Quick Start

1. TouchDesigner Side (Exporter)

Option A: Use the .tox component (recommended)

Drag CUDAIPCLink_v1.4.1.tox into your TD network
Wire your source TOP to the input In TOP
Set Ipcmemname parameter (e.g., "my_texture_ipc")
Enable Active toggle

The component displays its transfer state in the read-only Status custom parameter: "<W>x<H> <dtype> <ch>ch" during active transfer, "WARNING: ..." or "ERROR: ..." on faults, and "Idle" when inactive. A warning_emitter Script TOP inside the COMP also shows a local warning badge when the component is open. See td_exporter/HELP_DOC.md for per-parameter documentation.

Option B: Build from source

See docs/TOX_BUILD_GUIDE.md for step-by-step assembly.

2. Python Side (Importer)

Install the package

# Option A: Build wheel and install (recommended — portable, no source needed):
cd C:\path\to\CUDA_IPC
build_wheel.cmd                             # Builds dist\cuda_link-1.4.1-py3-none-any.whl

pip install "dist\cuda_link-1.4.1-py3-none-any.whl[torch]"   # PyTorch GPU tensors
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[cupy]"    # CuPy GPU arrays
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[numpy]"   # NumPy CPU arrays
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[all]"     # All output modes

# Option B: Editable install from source (for development — changes apply immediately):
pip install -e ".[torch]"
pip install -e ".[all]"

# From PyPI (coming soon):
# pip install cuda-link[torch]

Use in your Python script

from cuda_link import CUDAIPCImporter

# Initialize (use same name as TD's Ipcmemname parameter)
importer = CUDAIPCImporter(
    shm_name="my_texture_ipc",
    shape=(1080, 1920, 4),  # height, width, channels (RGBA) — or None for auto-detect
    dtype="float32",         # "float32", "float16", or "uint8" — or None for auto-detect
    debug=False,
    timeout_ms=5000.0,       # Wait up to 5s for producer to appear (default)
)

# Option 1: Get torch.Tensor (GPU, zero-copy)
if importer.is_ready():
    tensor = importer.get_frame()  # torch.Tensor on GPU, shape (1080, 1920, 4)
    # Use directly in AI model:
    # output = model(tensor)

# Option 2: Get numpy array (CPU, involves D2H copy)
if importer.is_ready():
    array = importer.get_frame_numpy()  # numpy.ndarray on CPU
    # Use in OpenCV, PIL, etc.:
    # cv2.imwrite("frame.png", array)

# Option 3: Get CuPy array (GPU, zero-copy)
if importer.is_ready():
    cupy_arr = importer.get_frame_cupy()  # cupy.ndarray on GPU
    # Use in CuPy/JAX workflows

# Context manager (recommended — ensures cleanup on exit)
with CUDAIPCImporter(shm_name="my_texture_ipc") as importer:
    for _ in range(100):
        tensor = importer.get_frame()

# Manual cleanup
importer.cleanup()

3. Python → TouchDesigner (AI Output)

Send AI-generated frames back to TD for display:

from cuda_link import CUDAIPCExporter

exporter = CUDAIPCExporter(
    shm_name="ai_output_ipc",   # Must match TD Receiver's Ipcmemname parameter
    height=512, width=512,
    channels=4, dtype="uint8",
    num_slots=2,                 # Ring buffer slots (double-buffering)
)
exporter.initialize()

# Export each AI-generated frame (~10-20μs overhead at 512x512)
exporter.export_frame(
    gpu_ptr=output_tensor.data_ptr(),
    size=output_tensor.nbytes,
)

exporter.cleanup()

On the TD side, set CUDAIPCExtension Mode to Receiver with matching Ipcmemname.

Architecture

Direction A: TD (Producer) → Python (Consumer)
──────────────────────────────────────────────
CUDAIPCExtension facade
  └── TDSenderEngine               CUDAIPCImporter
        │ export_frame(top_op)       │ get_frame() / get_frame_numpy()
        │ cudaMemcpy D2D → ring buf  │ Waits on IPC event
        └─→ SharedMemory ←───────────┘

Direction B: Python (Producer) → TD (Consumer)
───────────────────────────────────────────────
CUDAIPCExporter                    CUDAIPCExtension facade
  │ export_frame(gpu_ptr, size)      └── TDReceiverEngine
  │ cudaMemcpy D2D → ring buf             │ import_frame(script_top)
  └─→ SharedMemory ←──────────────────────┘ copyCUDAMemory()

Both directions share the same v0.5.0 binary protocol.

The TD extension uses a facade-with-delegation pattern: CUDAIPCExtension (~300 LOC) holds either a TDSenderEngine or TDReceiverEngine and delegates all work to it. Mode switches replace the engine entirely — zero cross-mode state leak. All TouchDesigner runtime access (ownerComp.par.*, top.cudaMemory(), copyCUDAMemory()) goes through the TDHost/TOPHandle adapter seam, making the engine logic testable without a TD runtime.

Ring Buffer (3 Slots)

The system uses a 3-slot ring buffer to allow producer and consumer to work in parallel:

Slot 0: Producer writes frame N
Slot 1: Producer writes frame N+1 while consumer reads frame N
Slot 2: Producer writes frame N+2 while consumer reads frame N+1
Wraps back to Slot 0 for frame N+3

This prevents blocking - producer never waits for consumer, consumer is always 1 frame behind.

SharedMemory Protocol (433 bytes for 3 slots)

[0-3]     magic "CIPD" (4B)       - Protocol validation (0x43495044)
[4-11]    version (8B)             - Increments on TD re-initialization
[12-15]   num_slots (4B)           - Number of ring buffer slots (3)
[16-19]   write_idx (4B)           - Current write index (atomic counter)

Per slot (128 bytes each):
[20+slot*128 : 84+slot*128]   cudaIpcMemHandle_t (64B)  - GPU memory handle
[84+slot*128 : 148+slot*128]  cudaIpcEventHandle_t (64B) - GPU event handle

[20+NUM_SLOTS*128]        shutdown_flag (1B)   - Reasserted to 0 every frame; set to 1 on exit
[21+NUM_SLOTS*128]        metadata (20B)       - width/height/num_comps/dtype/buffer_size
[41+NUM_SLOTS*128]        timestamp (8B)       - Producer perf_counter() for latency

For 3 slots: 20 + (3 × 128) + 1 + 20 + 8 = 433 bytes

Documentation

TOX Build Guide - Step-by-step .tox assembly in TouchDesigner
Architecture - Protocol spec, ring buffer design, GPU sync
Integration Examples - TD→PyTorch, TD→OpenCV, multi-stream

Testing

Run the full test suite:

cd C:\path\to\CUDA_IPC

# Protocol tests (no CUDA needed)
pytest tests/test_shm_protocol.py -v

# Unit tests (requires CUDA)
pytest tests/test_cuda_ipc_wrapper.py -v

# All tests
pytest tests/ -v

# Skip slow multi-process tests
pytest tests/ -v -m "not slow"

Benchmarks

All results on RTX 4090 / PCIe 4.0 x16 / Windows 11 / driver 596.36. RGBA (4-channel) frames.

Key highlights:

export_frame() standalone — 22 µs p50 (512×512) → 367 µs (4K) with EXPORT_SYNC=1. CUDA Graphs saves <5% at these sizes (GPU D2D copy dominates).
get_frame_numpy() D2H — 0.18 ms p50 (512×512) → 5.7 ms (4K) at ~22–24 GB/s PCIe 4.0.
Full IPC roundtrip — IPC notification latency ~136–286 µs cross-process (resolution-independent signaling).
vs CPU SharedMemory — ~3.4× faster E2E at 1080p, ~2.1× at 512×512. Producer write 4–19× faster (no CPU transit). With get_frame() / get_frame_cupy() (zero-copy), the consumer read collapses to <5 µs.

Full tables, per-resolution breakdowns, and CUDA Graphs A/B comparison: docs/BENCHMARKS.md

Performance Tuning (env vars)

Variable	Default	Effect
`CUDALINK_USE_GRAPHS`	`1`	CUDA Graphs for `export_frame()` (Python-side `CUDAIPCExporter`). Collapses the `stream_wait_event + memcpy_async + record_event` triplet into a single `cudaGraphLaunch`, cutting WDDM kernel-mode transitions from 3 → 2 per frame. With EXPORT_SYNC=1 (default) the GPU D2D copy dominates wall-clock time and the net savings are small (<5% at 1080p on PCIe 4.0); see docs/BENCHMARKS.md for measured A/B. Set to `0` to revert to the legacy stream path (e.g., if a driver version rejects graph capture).
`CUDALINK_TD_USE_GRAPHS`	`0`	CUDA Graphs for the TouchDesigner-side `CUDAIPCExtension` Sender. Same mechanism as `CUDALINK_USE_GRAPHS`, gated independently because TD ships `cudart64_110.dll` and the per-frame `cudaGraphExecMemcpyNodeSetParams1D` API requires CUDA 11.3+. Auto-disabled on older runtimes (probed via `cudaRuntimeGetVersion` at `initialize()`). Off by default pending TD-side soak; flip to `1` to enable.
`CUDALINK_D2H_STREAMS`	`1`	Number of parallel streams for `get_frame_numpy()` D2H copy. Values `2`/`4` may help on PCIe 3.0 systems or GPUs with dual DMA engines; on PCIe 4.0 a single stream already saturates the bus (~23–24 GB/s). Check `nvidia-smi -q \| findstr "Async Engines"` before tuning.
`CUDALINK_EXPORT_SYNC`	`1`	Block CPU on the IPC stream after each `export_frame()`. Default on — load-bearing for concurrent topologies (prevents cycle-2 first-settle TDR cascade when a TD Sender shares a process with a TD Receiver, Phase 3.6 confirmed). Cost equals GPU D2D copy time: ~42 µs (512×512) → ~400 µs (4K) float32 RGBA on PCIe 4.0 (see `bench_graphs.py`). Set to `0` to opt out for low-latency single-producer async scenarios.
`CUDALINK_ACTIVATION_BARRIER`	`1`	Python-lib side of the cross-process activation barrier (F9). Reads a tiny SHM counter each `export_frame()` and skips publishing while a TD Sender is in its WDDM-saturating init window. No-op in single-pair topologies (counter stays at 0); gracefully skipped if the SHM segment is absent. Set to `0` to opt out.
`CUDALINK_TD_ACTIVATION_BARRIER`	`1`	TD-side counterpart of `CUDALINK_ACTIVATION_BARRIER` — increments the same SHM counter around Sender `initialize()` so the Python producer backs off. Same no-op / graceful-absence behaviour. Set to `0` to opt out.
`CUDALINK_TD_PERSIST_STREAM`	`1`	Skip `stream_destroy` in TD Sender `cleanup()` so the IPC CUDA stream survives `deactivate`→`reactivate` cycles (F8). Free in single-pair (no deactivation ever happens); load-bearing in concurrent — without it, stream recreate on each reactivation collides with in-flight Receiver work, doubling first-settle `post=` latency (Phase 3.6 confirmed). Set to `0` to opt out.
`CUDALINK_TD_STREAM_PRIO`	`normal`	CUDA stream priority for the TD Sender's IPC stream. Default `normal` is safe for both single-pair and concurrent topologies — in single-pair only one stream exists per process so priority is moot; in concurrent, equal priorities prevent WDDM contention accumulation across reactivation cycles (high/high contention produces non-recovering cycle-3 shutdowns, Phase 3.6 Step C confirmed). Set to `high` only for explicit single-pair lowest-latency optimisation.
`CUDALINK_EXPORT_FLUSH_PROBE`	`1`	Insert a non-blocking `cudaStreamQuery(ipc_stream)` after `check_sticky_error` when `EXPORT_SYNC=0`. Forces WDDM-deferred CUDA submissions to drain each frame, preventing Windows Task Manager's 3D-engine counter from inflating when true compute load (per NVML) is low. NVML readings are unchanged — purely cosmetic/observability. Set to `0` to disable.
`CUDALINK_EXPORT_PROFILE`	`0`	Enable fine-grained per-region sub-timers in `export_frame()` and emit a `[PROFILE] pre=…us interop=…us post=…us memcpy=…us record=…us sync=…us sticky=…us flush_probe=…us shm=…us unacc=…us` line every 97 frames. Force-enables `verbose_performance` (TD) / `debug` (lib). Diagnostic-only; negligible overhead when on, zero when unset.
`CUDALINK_NVTX`	`0`	Enable NVTX range annotations on top-level phase boundaries (`cudalink.exporter.flush_probe`, `cudalink.receiver.import_frame`, `cudalink.receiver.event_wait`, etc.) for Nsight Systems GPU timeline correlation. Zero overhead when off. Set to `1` before running any `nsys` capture; see docs/PROFILING.md.
`CUDALINK_NVTX_VERBOSE`	`0`	Enable additional sub-operation NVTX ranges (sticky-error check, D2A copy submit, SHM header read) inside the top-level phase ranges. Only useful for deep per-frame breakdown captures; implies `CUDALINK_NVTX=1`.
`CUDALINK_TD_GRAPHS_DEFERRED`	`0`	Defer CUDA Graph capture to after the second `export_frame()` call (TD Sender). Avoids a first-frame graph-capture stall in latency-sensitive topologies where the graph build cost would be visible. Only meaningful when `CUDALINK_TD_USE_GRAPHS=1`.
`CUDALINK_TD_INIT_PACE`	`0`	Throttle the TD Sender init sequence to reduce WDDM saturation during activation windows (experimental). Adds a small sleep between consecutive CUDA API calls at `initialize()` time; useful when concurrent Sender+Receiver activation produces WDDM kernel-mode queue backpressure.
`CUDALINK_TD_BARRIER_SETTLE_FRAMES`	`30`	Number of frames the TD activation barrier counter remains armed after a Sender `initialize()` completes, giving the Python producer time to back off before publishing resumes. Increase if your Python producer's poll loop is slower than 30 frames at your target rate; decrease for tighter single-pair topologies.
`CUDALINK_NVML`	`0`	Append NVML GPU telemetry (utilization %, clocks MHz, PCIe Tx/Rx MB/s, temperature °C, power W, throttle reasons) to the 97-frame periodic stats line emitted by `CUDALINK_EXPORT_PROFILE`. Requires `nvidia-ml-py` (`pip install nvidia-ml-py`). Zero overhead when off.

For GPU-timeline profiling (Nsight Systems / Nsight Compute / compute-sanitizer) see docs/PROFILING.md.

Troubleshooting

"SharedMemory not found"

Cause: Python importer started before TD exporter initialized.

Solution: Ensure TD's CUDAIPCExporter is active before starting Python process. If starting both together, use timeout_ms to give the producer time to initialize:

importer = CUDAIPCImporter(shm_name="my_project_ipc", timeout_ms=10000.0)  # Wait up to 10s

"CUDA IPC overhead unexpectedly high"

Cause: In standalone Python processes (WDDM), export_frame() with EXPORT_SYNC=1 typically measures 42–400 µs p50 (512×512 → 4K float32 RGBA, RTX 4090 / PCIe 4.0). Values 2–5× higher than these baselines may indicate GPU driver overhead, context contention, or PCIe bandwidth sharing with other D2H workloads.

Solution: Compare against the baseline numbers in docs/BENCHMARKS.md. Contributors with a local clone may reproduce using python benchmarks/bench_graphs.py (standalone) or python benchmarks/bench_sweep.py --quick (multiprocess).

"Version mismatch" or stale frames

Cause: TD re-exported IPC handles (network reset, resolution change).

Solution: The importer automatically detects version changes and re-opens handles. No action needed.

GPU memory leak

Cause: Importer not cleaned up properly.

Solution: Use the context manager pattern for automatic cleanup:

with CUDAIPCImporter(shm_name="my_project_ipc") as importer:
    # importer.cleanup() is called automatically on exit
    tensor = importer.get_frame()

Or call importer.cleanup() explicitly in a finally block.

Distribution

cuda-link uses a dual distribution model to support both use cases:

For Python Consumers (StreamDiffusion, AI/ML pipelines)

Method 1: Build wheel (recommended — portable, installs into any environment)

git clone https://github.com/forkni/cuda-link.git
cd cuda-link

# Run the build script (uses PEP 517 isolated build via python -m build)
build_wheel.cmd
# Output: dist\cuda_link-1.4.1-py3-none-any.whl  (~30 KB)

# Install into any Python environment — conda, venv, system Python, TouchDesigner Python:
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[torch]"   # PyTorch GPU tensors
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[cupy]"    # CuPy GPU arrays
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[numpy]"   # NumPy CPU arrays
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[all]"     # All output modes

# Force reinstall to update:
pip install --force-reinstall "dist\cuda_link-1.4.1-py3-none-any.whl[torch]"

The wheel is a self-contained archive — copy it anywhere and install without needing the source tree.

Method 2: Editable install from source (for development)

git clone https://github.com/forkni/cuda-link.git
cd cuda-link
pip install -e ".[torch]"   # Changes to src/cuda_link/ apply immediately, no rebuild needed
pip install -e ".[all]"     # All output modes

Method 3: From PyPI (coming soon)

# pip install cuda-link[torch]

Usage:

from cuda_link import CUDAIPCImporter

importer = CUDAIPCImporter(shm_name="my_project_ipc")
tensor = importer.get_frame()  # torch.Tensor, GPU zero-copy

The cuda-link package contains only the consumer-side Python code (src/cuda_link/). The TouchDesigner extension is distributed separately.

For TouchDesigner Integration

Option A: Use the .tox component (recommended)

Drag CUDAIPCLink_v1.4.1.tox into your TouchDesigner network from the project root.

Older versions: Previous .tox releases are available as downloadable assets on the GitHub Releases page — pick the tag matching the TouchDesigner build you target.

Option B: Build from source

Follow the manual build guide at docs/TOX_BUILD_GUIDE.md to assemble the .tox from td_exporter/ source files.

The TouchDesigner extension (td_exporter/) is not included in the pip package because it uses TD-specific APIs (parent(), op(), me, COMP-scoped imports) that cannot run outside TouchDesigner.

Use Cases

Use Case	TD Side	Python Side
TD → Python (StreamDiffusion, AI pipelines)	`.tox` Sender mode	`pip install dist\cuda_link-*.whl[torch]`
Python → TD (AI output display)	`.tox` Receiver mode	`pip install dist\cuda_link-*.whl[torch]`
TD → TD (two instances communicating)	`.tox` on both sides	Not needed

Both sides communicate through the 433-byte SharedMemory protocol — zero import dependencies between TD and Python code.

Changelog

See CHANGELOG.md for the full history.

License

MIT License - See LICENSE file

Credits

Original implementation by Forkni (forkni@gmail.com). Extracted and refactored from the StreamDiffusionTD project.

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
.charlie		.charlie
.claude		.claude
.github		.github
TOXES		TOXES
docs		docs
scripts		scripts
src/cuda_link		src/cuda_link
td_exporter		td_exporter
tests		tests
verification		verification
.cgw.conf		.cgw.conf
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pre-commit-config.yaml		.pre-commit-config.yaml
.ruffignore		.ruffignore
CHANGELOG.md		CHANGELOG.md
CUDA_Link_Example.56.toe		CUDA_Link_Example.56.toe
CUDA_Link_Example.toe		CUDA_Link_Example.toe
LICENSE		LICENSE
README.md		README.md
build_wheel.cmd		build_wheel.cmd
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

cuda-link

Overview

Key Features

Performance

Requirements

Python Dependencies

Quick Start

1. TouchDesigner Side (Exporter)

2. Python Side (Importer)

Install the package

Use in your Python script

3. Python → TouchDesigner (AI Output)

Architecture

Ring Buffer (3 Slots)

SharedMemory Protocol (433 bytes for 3 slots)

Documentation

Testing

Benchmarks

Performance Tuning (env vars)

Troubleshooting

"SharedMemory not found"

"CUDA IPC overhead unexpectedly high"

"Version mismatch" or stale frames

GPU memory leak

Distribution

For Python Consumers (StreamDiffusion, AI/ML pipelines)

Method 1: Build wheel (recommended — portable, installs into any environment)

Method 2: Editable install from source (for development)

Method 3: From PyPI (coming soon)

For TouchDesigner Integration

Use Cases

Changelog

License

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages