Optimizing SuperPoint keypoint detector for edge deployment using TensorRT. PyTorch → ONNX → TensorRT FP16, targeting NVIDIA Jetson Orin.
- Export SuperPoint PyTorch model to ONNX (
export_onnx.py) - Simplify ONNX graph with onnx-simplifier
- Compile TensorRT FP32 and FP16 engines (
build_engines.py) - Benchmark inference latency (
benchmark.py)
Hardware: NVIDIA Tesla T4, CUDA 13.0, TensorRT 10.11
| Backend | Mean Latency | P99 Latency | FPS |
|---|---|---|---|
| PyTorch CUDA | 14.66 ms | 15.28 ms | 68 |
| TensorRT FP32 | 0.61 ms | 1.87 ms | 1642 |
| TensorRT FP16 | 0.45 ms | 1.42 ms | 2203 |
TensorRT FP16 achieves 32× speedup over PyTorch baseline.
pip install torch onnx onnx-simplifier tensorrt pycuda
python export_onnx.py # generates superpoint_simplified.onnx
python build_engines.py # generates fp32 and fp16 .engine files
python benchmark.py # runs latency benchmark