eBPF agent and MCP server for GPU causal observability
-
Updated
May 16, 2026 - C
eBPF agent and MCP server for GPU causal observability
High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GeForce / RTX PRO (RTX 5090/5080/5070 Ti, RTX PRO 6000; sm_120). 200 tok/s decode on Qwen3.6-35B-A3B-NVFP4 MoE (RTX 5090).
Prefill performance study on Qwen2.5-7B using vLLM. Compares static vs mixed (bucketed) prefill under eager execution and CUDA Graphs, with controlled concurrency and real-world latency/throughput metrics.
Optimized CSM-1B TTS pipeline for RTX 5090 (Blackwell sm_120). CUDA graph replay via patched HF Transformers. ~0.46x RTF. Topics (tags): csm text-to-speech rtx-5090 blackwell cuda-graphs torch-compile sesame streaming pytorch
GB10 inference port; see fork.md
Add a description, image, and links to the cuda-graphs topic page so that developers can more easily learn about it.
To associate your repository with the cuda-graphs topic, visit your repo's landing page and select "manage topics."