Zero-copy GPU texture transfer between TouchDesigner and Python processes using CUDA IPC.
This component enables zero-copy GPU texture sharing between TouchDesigner and Python processes using CUDA Inter-Process Communication (IPC). It eliminates CPU memory copies for real-time AI pipelines, video processing, and other GPU-accelerated workflows.
- Zero-copy GPU transfer - Textures stay on GPU, no CPU memory copies
- Bidirectional IPC - TD → Python (input capture) AND Python → TD (AI output display)
- Low-overhead IPC -
export_frame()22–367 µs p50 (512×512 → 4K float32, EXPORT_SYNC=1);get_frame_numpy()D2H 0.18–5.7 ms p50 (PCIe 4.0 ~22–24 GB/s); IPC notification ~136–286 µs cross-process (see docs/BENCHMARKS.md) - Ring buffer architecture - N-slot pipeline prevents producer/consumer blocking
- GPU-side synchronization - CUDA IPC events eliminate CPU polling
- Triple output modes - PyTorch tensors (GPU, zero-copy), CuPy arrays (GPU, zero-copy), or numpy arrays (CPU, D2H copy)
- Production-ready - Tested at 30+ FPS for hours, handles dynamic resolution changes
Measured on RTX 4090 / PCIe 4.0 x16 / Windows 11 / driver 596.36. All Python-side.
| Operation | p50 | Notes |
|---|---|---|
export_frame() — 512×512 RGBA float32 |
22 µs | Standalone, EXPORT_SYNC=1; GPU D2D + stream_synchronize |
export_frame() — 1080p RGBA float32 |
117 µs | Standalone, EXPORT_SYNC=1 |
export_frame() — 4K RGBA float32 |
367 µs | Standalone, EXPORT_SYNC=1 |
get_frame_numpy() D2H — 512×512 float32 |
0.18 ms | Standalone, ~22 GB/s |
get_frame_numpy() D2H — 1080p float32 |
1.32 ms | Standalone, ~24 GB/s PCIe 4.0 |
get_frame_numpy() D2H — 4K float32 |
5.7 ms | Standalone, ~21 GB/s PCIe 4.0 |
get_frame() / get_frame_cupy() GPU |
<5 µs | Zero-copy tensor/array view, no D2H |
| IPC notification latency | ~136–286 µs | Producer publish → consumer detect (cross-process) |
| Initialization | ~50–100 µs | One-time IPC handle opening |
- OS: Windows 10/11 (CUDA IPC is Windows-only)
- CUDA: 12.x (tested with 12.4)
- GPU: NVIDIA GPU with CUDA compute capability 3.5+
- TouchDesigner: 2022.x or later (for producer side)
- Python: 3.9+ (for consumer side)
Required: None (pure ctypes CUDA wrapper)
Optional:
torch>=2.0- For zero-copy GPU tensor output (recommended for AI pipelines)cupy-cuda12x>=12.0- For zero-copy GPU array output (CuPy/JAX workflows)numpy>=1.21- For CPU array output (for OpenCV, etc.)
Option A: Use the .tox component (recommended)
- Drag
CUDAIPCLink_v1.4.1.toxinto your TD network - Wire your source TOP to the
inputIn TOP - Set
Ipcmemnameparameter (e.g.,"my_texture_ipc") - Enable
Activetoggle
The component displays its transfer state in the read-only Status custom parameter:
"<W>x<H> <dtype> <ch>ch" during active transfer, "WARNING: ..." or "ERROR: ..." on
faults, and "Idle" when inactive. A warning_emitter Script TOP inside the COMP also
shows a local warning badge when the component is open. See td_exporter/HELP_DOC.md
for per-parameter documentation.
Option B: Build from source
See docs/TOX_BUILD_GUIDE.md for step-by-step assembly.
# Option A: Build wheel and install (recommended — portable, no source needed):
cd C:\path\to\CUDA_IPC
build_wheel.cmd # Builds dist\cuda_link-1.4.1-py3-none-any.whl
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[torch]" # PyTorch GPU tensors
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[cupy]" # CuPy GPU arrays
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[numpy]" # NumPy CPU arrays
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[all]" # All output modes
# Option B: Editable install from source (for development — changes apply immediately):
pip install -e ".[torch]"
pip install -e ".[all]"
# From PyPI (coming soon):
# pip install cuda-link[torch]from cuda_link import CUDAIPCImporter
# Initialize (use same name as TD's Ipcmemname parameter)
importer = CUDAIPCImporter(
shm_name="my_texture_ipc",
shape=(1080, 1920, 4), # height, width, channels (RGBA) — or None for auto-detect
dtype="float32", # "float32", "float16", or "uint8" — or None for auto-detect
debug=False,
timeout_ms=5000.0, # Wait up to 5s for producer to appear (default)
)
# Option 1: Get torch.Tensor (GPU, zero-copy)
if importer.is_ready():
tensor = importer.get_frame() # torch.Tensor on GPU, shape (1080, 1920, 4)
# Use directly in AI model:
# output = model(tensor)
# Option 2: Get numpy array (CPU, involves D2H copy)
if importer.is_ready():
array = importer.get_frame_numpy() # numpy.ndarray on CPU
# Use in OpenCV, PIL, etc.:
# cv2.imwrite("frame.png", array)
# Option 3: Get CuPy array (GPU, zero-copy)
if importer.is_ready():
cupy_arr = importer.get_frame_cupy() # cupy.ndarray on GPU
# Use in CuPy/JAX workflows
# Context manager (recommended — ensures cleanup on exit)
with CUDAIPCImporter(shm_name="my_texture_ipc") as importer:
for _ in range(100):
tensor = importer.get_frame()
# Manual cleanup
importer.cleanup()Send AI-generated frames back to TD for display:
from cuda_link import CUDAIPCExporter
exporter = CUDAIPCExporter(
shm_name="ai_output_ipc", # Must match TD Receiver's Ipcmemname parameter
height=512, width=512,
channels=4, dtype="uint8",
num_slots=2, # Ring buffer slots (double-buffering)
)
exporter.initialize()
# Export each AI-generated frame (~10-20μs overhead at 512x512)
exporter.export_frame(
gpu_ptr=output_tensor.data_ptr(),
size=output_tensor.nbytes,
)
exporter.cleanup()On the TD side, set CUDAIPCExtension Mode to Receiver with matching Ipcmemname.
Direction A: TD (Producer) → Python (Consumer)
──────────────────────────────────────────────
CUDAIPCExtension facade
└── TDSenderEngine CUDAIPCImporter
│ export_frame(top_op) │ get_frame() / get_frame_numpy()
│ cudaMemcpy D2D → ring buf │ Waits on IPC event
└─→ SharedMemory ←───────────┘
Direction B: Python (Producer) → TD (Consumer)
───────────────────────────────────────────────
CUDAIPCExporter CUDAIPCExtension facade
│ export_frame(gpu_ptr, size) └── TDReceiverEngine
│ cudaMemcpy D2D → ring buf │ import_frame(script_top)
└─→ SharedMemory ←──────────────────────┘ copyCUDAMemory()
Both directions share the same v0.5.0 binary protocol.
The TD extension uses a facade-with-delegation pattern: CUDAIPCExtension (~300 LOC) holds either a TDSenderEngine or TDReceiverEngine and delegates all work to it. Mode switches replace the engine entirely — zero cross-mode state leak. All TouchDesigner runtime access (ownerComp.par.*, top.cudaMemory(), copyCUDAMemory()) goes through the TDHost/TOPHandle adapter seam, making the engine logic testable without a TD runtime.
The system uses a 3-slot ring buffer to allow producer and consumer to work in parallel:
- Slot 0: Producer writes frame N
- Slot 1: Producer writes frame N+1 while consumer reads frame N
- Slot 2: Producer writes frame N+2 while consumer reads frame N+1
- Wraps back to Slot 0 for frame N+3
This prevents blocking - producer never waits for consumer, consumer is always 1 frame behind.
[0-3] magic "CIPD" (4B) - Protocol validation (0x43495044)
[4-11] version (8B) - Increments on TD re-initialization
[12-15] num_slots (4B) - Number of ring buffer slots (3)
[16-19] write_idx (4B) - Current write index (atomic counter)
Per slot (128 bytes each):
[20+slot*128 : 84+slot*128] cudaIpcMemHandle_t (64B) - GPU memory handle
[84+slot*128 : 148+slot*128] cudaIpcEventHandle_t (64B) - GPU event handle
[20+NUM_SLOTS*128] shutdown_flag (1B) - Reasserted to 0 every frame; set to 1 on exit
[21+NUM_SLOTS*128] metadata (20B) - width/height/num_comps/dtype/buffer_size
[41+NUM_SLOTS*128] timestamp (8B) - Producer perf_counter() for latency
For 3 slots: 20 + (3 × 128) + 1 + 20 + 8 = 433 bytes
- TOX Build Guide - Step-by-step .tox assembly in TouchDesigner
- Architecture - Protocol spec, ring buffer design, GPU sync
- Integration Examples - TD→PyTorch, TD→OpenCV, multi-stream
Run the full test suite:
cd C:\path\to\CUDA_IPC
# Protocol tests (no CUDA needed)
pytest tests/test_shm_protocol.py -v
# Unit tests (requires CUDA)
pytest tests/test_cuda_ipc_wrapper.py -v
# All tests
pytest tests/ -v
# Skip slow multi-process tests
pytest tests/ -v -m "not slow"All results on RTX 4090 / PCIe 4.0 x16 / Windows 11 / driver 596.36. RGBA (4-channel) frames.
Key highlights:
export_frame()standalone — 22 µs p50 (512×512) → 367 µs (4K) with EXPORT_SYNC=1. CUDA Graphs saves <5% at these sizes (GPU D2D copy dominates).get_frame_numpy()D2H — 0.18 ms p50 (512×512) → 5.7 ms (4K) at ~22–24 GB/s PCIe 4.0.- Full IPC roundtrip — IPC notification latency ~136–286 µs cross-process (resolution-independent signaling).
- vs CPU SharedMemory — ~3.4× faster E2E at 1080p, ~2.1× at 512×512. Producer write 4–19× faster (no CPU transit). With
get_frame()/get_frame_cupy()(zero-copy), the consumer read collapses to <5 µs.
Full tables, per-resolution breakdowns, and CUDA Graphs A/B comparison: docs/BENCHMARKS.md
| Variable | Default | Effect |
|---|---|---|
CUDALINK_USE_GRAPHS |
1 |
CUDA Graphs for export_frame() (Python-side CUDAIPCExporter). Collapses the stream_wait_event + memcpy_async + record_event triplet into a single cudaGraphLaunch, cutting WDDM kernel-mode transitions from 3 → 2 per frame. With EXPORT_SYNC=1 (default) the GPU D2D copy dominates wall-clock time and the net savings are small (<5% at 1080p on PCIe 4.0); see docs/BENCHMARKS.md for measured A/B. Set to 0 to revert to the legacy stream path (e.g., if a driver version rejects graph capture). |
CUDALINK_TD_USE_GRAPHS |
0 |
CUDA Graphs for the TouchDesigner-side CUDAIPCExtension Sender. Same mechanism as CUDALINK_USE_GRAPHS, gated independently because TD ships cudart64_110.dll and the per-frame cudaGraphExecMemcpyNodeSetParams1D API requires CUDA 11.3+. Auto-disabled on older runtimes (probed via cudaRuntimeGetVersion at initialize()). Off by default pending TD-side soak; flip to 1 to enable. |
CUDALINK_D2H_STREAMS |
1 |
Number of parallel streams for get_frame_numpy() D2H copy. Values 2/4 may help on PCIe 3.0 systems or GPUs with dual DMA engines; on PCIe 4.0 a single stream already saturates the bus (~23–24 GB/s). Check nvidia-smi -q | findstr "Async Engines" before tuning. |
CUDALINK_EXPORT_SYNC |
1 |
Block CPU on the IPC stream after each export_frame(). Default on — load-bearing for concurrent topologies (prevents cycle-2 first-settle TDR cascade when a TD Sender shares a process with a TD Receiver, Phase 3.6 confirmed). Cost equals GPU D2D copy time: ~42 µs (512×512) → ~400 µs (4K) float32 RGBA on PCIe 4.0 (see bench_graphs.py). Set to 0 to opt out for low-latency single-producer async scenarios. |
CUDALINK_ACTIVATION_BARRIER |
1 |
Python-lib side of the cross-process activation barrier (F9). Reads a tiny SHM counter each export_frame() and skips publishing while a TD Sender is in its WDDM-saturating init window. No-op in single-pair topologies (counter stays at 0); gracefully skipped if the SHM segment is absent. Set to 0 to opt out. |
CUDALINK_TD_ACTIVATION_BARRIER |
1 |
TD-side counterpart of CUDALINK_ACTIVATION_BARRIER — increments the same SHM counter around Sender initialize() so the Python producer backs off. Same no-op / graceful-absence behaviour. Set to 0 to opt out. |
CUDALINK_TD_PERSIST_STREAM |
1 |
Skip stream_destroy in TD Sender cleanup() so the IPC CUDA stream survives deactivate→reactivate cycles (F8). Free in single-pair (no deactivation ever happens); load-bearing in concurrent — without it, stream recreate on each reactivation collides with in-flight Receiver work, doubling first-settle post= latency (Phase 3.6 confirmed). Set to 0 to opt out. |
CUDALINK_TD_STREAM_PRIO |
normal |
CUDA stream priority for the TD Sender's IPC stream. Default normal is safe for both single-pair and concurrent topologies — in single-pair only one stream exists per process so priority is moot; in concurrent, equal priorities prevent WDDM contention accumulation across reactivation cycles (high/high contention produces non-recovering cycle-3 shutdowns, Phase 3.6 Step C confirmed). Set to high only for explicit single-pair lowest-latency optimisation. |
CUDALINK_EXPORT_FLUSH_PROBE |
1 |
Insert a non-blocking cudaStreamQuery(ipc_stream) after check_sticky_error when EXPORT_SYNC=0. Forces WDDM-deferred CUDA submissions to drain each frame, preventing Windows Task Manager's 3D-engine counter from inflating when true compute load (per NVML) is low. NVML readings are unchanged — purely cosmetic/observability. Set to 0 to disable. |
CUDALINK_EXPORT_PROFILE |
0 |
Enable fine-grained per-region sub-timers in export_frame() and emit a [PROFILE] pre=…us interop=…us post=…us memcpy=…us record=…us sync=…us sticky=…us flush_probe=…us shm=…us unacc=…us line every 97 frames. Force-enables verbose_performance (TD) / debug (lib). Diagnostic-only; negligible overhead when on, zero when unset. |
CUDALINK_NVTX |
0 |
Enable NVTX range annotations on top-level phase boundaries (cudalink.exporter.flush_probe, cudalink.receiver.import_frame, cudalink.receiver.event_wait, etc.) for Nsight Systems GPU timeline correlation. Zero overhead when off. Set to 1 before running any nsys capture; see docs/PROFILING.md. |
CUDALINK_NVTX_VERBOSE |
0 |
Enable additional sub-operation NVTX ranges (sticky-error check, D2A copy submit, SHM header read) inside the top-level phase ranges. Only useful for deep per-frame breakdown captures; implies CUDALINK_NVTX=1. |
CUDALINK_TD_GRAPHS_DEFERRED |
0 |
Defer CUDA Graph capture to after the second export_frame() call (TD Sender). Avoids a first-frame graph-capture stall in latency-sensitive topologies where the graph build cost would be visible. Only meaningful when CUDALINK_TD_USE_GRAPHS=1. |
CUDALINK_TD_INIT_PACE |
0 |
Throttle the TD Sender init sequence to reduce WDDM saturation during activation windows (experimental). Adds a small sleep between consecutive CUDA API calls at initialize() time; useful when concurrent Sender+Receiver activation produces WDDM kernel-mode queue backpressure. |
CUDALINK_TD_BARRIER_SETTLE_FRAMES |
30 |
Number of frames the TD activation barrier counter remains armed after a Sender initialize() completes, giving the Python producer time to back off before publishing resumes. Increase if your Python producer's poll loop is slower than 30 frames at your target rate; decrease for tighter single-pair topologies. |
CUDALINK_NVML |
0 |
Append NVML GPU telemetry (utilization %, clocks MHz, PCIe Tx/Rx MB/s, temperature °C, power W, throttle reasons) to the 97-frame periodic stats line emitted by CUDALINK_EXPORT_PROFILE. Requires nvidia-ml-py (pip install nvidia-ml-py). Zero overhead when off. |
For GPU-timeline profiling (Nsight Systems / Nsight Compute / compute-sanitizer) see docs/PROFILING.md.
Cause: Python importer started before TD exporter initialized.
Solution: Ensure TD's CUDAIPCExporter is active before starting Python process. If starting both together, use timeout_ms to give the producer time to initialize:
importer = CUDAIPCImporter(shm_name="my_project_ipc", timeout_ms=10000.0) # Wait up to 10sCause: In standalone Python processes (WDDM), export_frame() with EXPORT_SYNC=1 typically measures 42–400 µs p50 (512×512 → 4K float32 RGBA, RTX 4090 / PCIe 4.0). Values 2–5× higher than these baselines may indicate GPU driver overhead, context contention, or PCIe bandwidth sharing with other D2H workloads.
Solution: Compare against the baseline numbers in docs/BENCHMARKS.md. Contributors with a local clone may reproduce using python benchmarks/bench_graphs.py (standalone) or python benchmarks/bench_sweep.py --quick (multiprocess).
Cause: TD re-exported IPC handles (network reset, resolution change).
Solution: The importer automatically detects version changes and re-opens handles. No action needed.
Cause: Importer not cleaned up properly.
Solution: Use the context manager pattern for automatic cleanup:
with CUDAIPCImporter(shm_name="my_project_ipc") as importer:
# importer.cleanup() is called automatically on exit
tensor = importer.get_frame()Or call importer.cleanup() explicitly in a finally block.
cuda-link uses a dual distribution model to support both use cases:
git clone https://github.com/forkni/cuda-link.git
cd cuda-link
# Run the build script (uses PEP 517 isolated build via python -m build)
build_wheel.cmd
# Output: dist\cuda_link-1.4.1-py3-none-any.whl (~30 KB)
# Install into any Python environment — conda, venv, system Python, TouchDesigner Python:
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[torch]" # PyTorch GPU tensors
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[cupy]" # CuPy GPU arrays
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[numpy]" # NumPy CPU arrays
pip install "dist\cuda_link-1.4.1-py3-none-any.whl[all]" # All output modes
# Force reinstall to update:
pip install --force-reinstall "dist\cuda_link-1.4.1-py3-none-any.whl[torch]"The wheel is a self-contained archive — copy it anywhere and install without needing the source tree.
git clone https://github.com/forkni/cuda-link.git
cd cuda-link
pip install -e ".[torch]" # Changes to src/cuda_link/ apply immediately, no rebuild needed
pip install -e ".[all]" # All output modes# pip install cuda-link[torch]Usage:
from cuda_link import CUDAIPCImporter
importer = CUDAIPCImporter(shm_name="my_project_ipc")
tensor = importer.get_frame() # torch.Tensor, GPU zero-copyThe cuda-link package contains only the consumer-side Python code (src/cuda_link/). The TouchDesigner extension is distributed separately.
Option A: Use the .tox component (recommended)
Drag CUDAIPCLink_v1.4.1.tox into your TouchDesigner network from the project root.
Older versions: Previous
.toxreleases are available as downloadable assets on the GitHub Releases page — pick the tag matching the TouchDesigner build you target.
Option B: Build from source
Follow the manual build guide at docs/TOX_BUILD_GUIDE.md to assemble the .tox from td_exporter/ source files.
The TouchDesigner extension (td_exporter/) is not included in the pip package because it uses TD-specific APIs (parent(), op(), me, COMP-scoped imports) that cannot run outside TouchDesigner.
| Use Case | TD Side | Python Side |
|---|---|---|
| TD → Python (StreamDiffusion, AI pipelines) | .tox Sender mode |
pip install dist\cuda_link-*.whl[torch] |
| Python → TD (AI output display) | .tox Receiver mode |
pip install dist\cuda_link-*.whl[torch] |
| TD → TD (two instances communicating) | .tox on both sides |
Not needed |
Both sides communicate through the 433-byte SharedMemory protocol — zero import dependencies between TD and Python code.
See CHANGELOG.md for the full history.
MIT License - See LICENSE file
Original implementation by Forkni (forkni@gmail.com). Extracted and refactored from the StreamDiffusionTD project.