An interactive learning platform for the Programming Parallel Computers course (CS-E4580) by Aalto University.
This repo turns the course material into hands-on, visual, and interactive content — making parallel programming concepts easier to understand and experiment with.
🌐 Live platform → ahmedaltu.github.io/programming-parallel-computers
- Interactive visualizations — diagrams, memory layouts, and performance charts you can explore in the browser
- Code walkthroughs — step-by-step breakdowns of V0→V7 optimisations from the case study
- Assembly analysis — annotated assembly showing what the CPU actually executes
- Exercises — fully optimised solutions with documented reasoning
| Chapter | Topic | Key techniques |
|---|---|---|
| Chapter 1 | Role of parallelism — why and how | Moore's Law, latency vs throughput |
| Chapter 2 | CPU optimisation case study — 0.6% → ~100% peak | OpenMP, ILP, SIMD/AVX-512, register tiling, Z-order, prefetch |
| Chapter 3 | Multithreading | OpenMP memory model, false sharing, scheduling, atomics |
| Chapter 4 | GPU programming | CUDA V0→V4, coalescing, shared memory tiling, float4, occupancy, Nsight |
Fully optimised C++ implementation targeting the course grader (AVX-512):
- AVX-512
float16_tSIMD with 6×16 register-tiled kernel - Z-order (Morton) tile traversal for cache locality
- Software prefetching
- Two-pass normalisation
- OpenMP parallelised over all tile pairs
- CPU version: 2D prefix sums reducing O(nx²·ny²·w·h) → O(nx²·ny²), OpenMP parallelised
- GPU version: CUDA kernel with shared memory block reduction and prefix sums on device
| Layer | Tools |
|---|---|
| CPU parallelism | OpenMP, AVX-512 (float16_t, ZMM registers) |
| GPU parallelism | CUDA (nvcc), shared memory, float4 vectorised loads |
| Profiling | Nsight Systems, Nsight Compute, perf |
| Language | C++17, CUDA C++ |
| Platform | Aalto course grader (AVX-512), Maari GPU machines |
Built while studying the course — combining learning with building.