Course: CS610 – High Performance Computing
Instructor: Prof. Swarnendu Biswas
Institute: IIT Kanpur
Duration: Aug 2025 – Dec 2025
This repository contains optimized CPU and GPU implementations of core numerical workloads, focusing on instruction-level, thread-level, and accelerator-based parallelism.
- File:
matmul.cpp - Optimized using:
- AVX2 and SSE4 vector intrinsics
- Loop unrolling and cache-friendly access
- Achieved up to 6× speedup over naive implementation
- File:
gridsearch_original.cpp - Reference serial implementation
- Used for correctness and performance comparison
- File:
gridsearch_openmp.cpp - Parallelized using OpenMP
- Achieved ~6.5× speedup on multi-core CPU
- File:
convolution_gpu.cu - GPU implementation using CUDA
- Achieved ~4× speedup compared to serial CPU version
g++(with OpenMP support)nvcc(CUDA Toolkit)- Linux environment recommended
make