A GPU-accelerated, pipelined image processing system for edge detection and visualization using CUDA. The project was structured to comply as much as possible with the proposed pipelined architecture, which is NOT the most optimized version (see full_parallel branch).
Language: C++17
IDE: Visual Studio 2022
Platform: Windows
- Overview
- Pipeline Architecture
- What Was Implemented
- What Was NOT Implemented
- Design Decisions & Optimizations
- Performance Metrics
- Project Structure
- Building & Running
- Future Work
This project implements a 5-stage GPU-accelerated pipeline for edge detection, designed to process multiple images concurrently using CUDA streams. The system detects edges using Gaussian smoothing followed by Laplacian edge detection, then visualizes results by overlaying colored edges on the original image.
Tick: 0 1 2 3 4 5 ...
Img0: [Load] [Gaussian] [Laplacian] [Binarize] [Visualize]
Img1: [Load] [Gaussian] [Laplacian] [Binarize] [Visualize]
Img2: [Load] [Gaussian] [Laplacian] [Binarize] ...
Each image is assigned its own CUDA stream, enabling concurrent execution of different pipeline stages across multiple images. This maximizes GPU utilization by overlapping:
- Memory transfers (HtoD / DtoH)
- Kernel execution (convolutions, binarization)
- CPU work (visualization overlay)
| Stage | Location | Description |
|---|---|---|
| 1. (Pre-Load) | CPU | All images loaded into pinned memory before pipeline starts |
| 2. Load | GPU | Async HtoD transfer via CUDA stream |
| 3. Gaussian Filter | GPU | 3x3 Gaussian convolution (σ=1) for noise reduction |
| 4. Laplacian Filter | GPU | 3x3 Laplacian convolution for edge detection |
| 5. Binarize | GPU | Threshold edge response + async DtoH transfer |
| 6. Visualize | CPU | Create colored edge overlay on original image |
| 7. (Save) | CPU | Batch save all output images at pipeline end |
| Requirement | Status | Notes |
|---|---|---|
| BMP image loading | ✅ | Supports any resolution via stb_image |
| GPU Gaussian filtering (3x3) | ✅ | Separable kernel, normalized weights |
| GPU Laplacian edge detection | ✅ | 4-connectivity kernel |
| GPU thresholding/binarization | ✅ | Configurable threshold value |
| Colored edge overlay | ✅ | Edges rendered in red on original |
| Pipelined execution | ✅ | Per-image CUDA streams |
| Per-stage timing | ✅ | CUDA events for GPU profiling |
| Output images saved | ✅ | BMP format in output folder |
| Per-image stats file | ✅ | Timing breakdown per stage |
| GPU utilization metrics | ✅ | Nsight analysis of the executable |
| Optimization | Status | Description |
|---|---|---|
| Pinned host memory | ✅ | cudaHostAlloc for faster DMA transfers |
| Async memory transfers | ✅ | cudaMemcpyAsync with streams |
| Double-buffered device memory | ✅ | Avoids in-place convolution race conditions |
| Pre-loading all images | ✅ | Eliminates disk I/O from pipeline hot path |
| Deferred image saving | ✅ | Batch save at end, not blocking pipeline |
| Stream synchronization | ✅ | Minimal syncs, only where required |
| Aspect | Status | Description |
|---|---|---|
| Modular stage functions | ✅ | Each stage is a separate function |
| Clear documentation | ✅ | Doxygen-style comments throughout |
| Error handling | ✅ | CUDA error checking macro ck() |
| Configurable parameters | ✅ | Threshold, kernel size as constants |
| Requirement | Status | Reason/Notes |
|---|---|---|
| Centroid detection | ❌ | Connected component analysis not implemented |
| Object counting | ❌ | Requires centroid/blob detection |
| Balanced stage timing | Convolutions are faster than transfers with this current dataset, but acceptable |
| Improvement | Status | Description |
|---|---|---|
| Threading for I/O | ❌ | Pre-load is sequential (could parallelize) |
| Shared memory convolution | ❌ | Using global memory (could optimize) |
| Threading image saving | ❌ | Sequential save at end (could parallelize) |
| Unit test framework | ❌ | More granular testing, using gtest or other tools |
Each image gets its own CUDA stream, allowing the GPU to interleave operations from different images. This is critical for hiding memory transfer latency behind kernel execution.
All images are loaded into CPU pinned memory before the pipeline starts. This eliminates disk I/O as a bottleneck during pipeline execution, at the cost of higher memory usage.
Convolution requires reading neighbor pixels while writing output.
The visualization stage prepares the overlay but does not write to disk. All images are saved in a batch at the end, keeping the pipeline CPU-bound work minimal. Overlapping and other operations related to this step are ran sequentially in the CPU, while the cuda streams work in the background, since I evaluated that using the GPU would have been less efficient.
Using cudaHostAlloc for host buffers enables DMA transfers that don't block the CPU and achieve higher bandwidth than pageable memory.
Preloading and deferred image saving was detached from the pipeline logic as the profilation shows how expensive they are in terms of time spent. This way I were able to overlap the execution of the stages better, as also shown in the Nsight report.
Per-stage timing is collected using CUDA events and reported at pipeline completion:
========================================
Pipeline Performance Statistics
========================================
Load+Transfer: X.XX ms avg (N images)
Gaussian: X.XX ms avg (N images)
Laplacian: X.XX ms avg (N images)
Binarization: X.XX ms avg (N images)
Save: X.XX ms avg (N images)
----------------------------------------
Total: X.XX ms avg per image
========================================
Individual image stats are written to output_images/output_image_N_stats.txt.
EdgeDetectionPipeline/
├── src/
│ ├── main.cpp # Pipeline orchestration and entry point
│ ├── kernel.cu # CUDA kernels (convolution, binarization)
│ └── bmpread.c # BMP file parsing (unused, using stb_image)
├── include/
│ ├── kernel.cuh # CUDA kernel declarations
│ ├── image_descriptor.hpp # Per-image data structure
│ ├── utils.hpp # I/O and utility functions
│ ├── thread_pool.hpp # Thread pool (available for future use)
│ ├── stb_image.h # Image loading library
│ └── stb_image_write.h # Image writing library
├── tests/
│ └── test_output_images.cpp # Output validation tests
├── output_images/ # Generated output images and stats
├── CMakeLists.txt # Build configuration
└── README.md # This document
- Visual Studio 2022 with C++ desktop development
- CUDA Toolkit 11.0+ (tested with 12.x)
- CMake 3.18+
# From project root
mkdir build
cd build
cmake ..
cmake --build . --config Release- Place input images (BMP format) in
input_images/folder - Run the executable:
cd build
.\Release\EdgeDetectionPipeline.exe- Output images appear in
output_images/
Edit constants in src/main.cpp:
constexpr uint8_t BINARIZE_THRESHOLD = 5; // Edge threshold (0-255)
const std::string input_folder = "..."; // Input path
const std::string output_folder = "..."; // Output path- Centroid Detection: Implement connected component labeling on GPU to identify objects and compute centroids
- GPU Utilization Metrics: Integrate CUPTI or use Nsight for detailed GPU occupancy reporting
- Memory Pooling: Pre-allocate device buffers for max image size to avoid per-image cudaMalloc
- Shared Memory Convolution: Use shared memory tiling for 2-3x faster convolutions
- Parallel I/O: Use thread pool for parallel image loading and saving
- Configurable Kernels: Support different kernel sizes and custom filter coefficients
Daniele Ferrario