CUDA-AES Benchmark is a reproducible CUDA AES benchmark and GPU AES benchmark suite for CUDA developers. It measures CUDA kernels for AES-128 and AES-256, compares them with an OpenSSL CPU baseline, records raw benchmark artifacts, and documents the correctness and methodology behind the numbers.
Use this project to study AES GPU performance across CUDA AES modes, including AES-GCM CUDA, AES-128 CUDA, AES-256 CUDA, CBC, CFB, OFB, CTR, CCM, XTS-AES, AES-KW, and AES-KWP workloads. The repository is intended as a reproducible cryptography benchmark with raw artifacts and explicit scope notes, not as a source of unsupported speed claims.
This repository is benchmark and research software, not a production cryptography library.
Implemented in the canonical top-level build:
| Mode | AES-128 | AES-256 | Correctness tests | Benchmark rows | Notes |
|---|---|---|---|---|---|
| ECB | Yes | Yes | Yes | Yes | NIST-style known-answer coverage |
| CBC | Yes | Yes | Yes | Yes | Confidentiality-only feedback mode |
| CFB | Yes | Yes | Yes | Yes | CFB-128 full-block segment scope |
| OFB | Yes | Yes | Yes | Yes | Confidentiality-only chained keystream mode |
| CTR | Yes | Yes | Yes | Yes | 96-bit IV/counter helper in benchmark |
| GCM | Yes | Yes | Yes | Yes | 96-bit IV, empty AAD, full blocks |
| CCM | Yes | Yes | Yes | Yes | 96-bit nonce, empty AAD, 16-byte tag, full blocks |
| XTS-AES | Yes | Yes | Yes | Yes | Storage-sector mode, 16-byte sector tweak, full blocks |
| AES-KW | Yes | Yes | Yes | Yes | Key-wrap workload, 16-byte key data records |
| AES-KWP | Yes | Yes | Yes | Yes | Key-wrap-with-padding workload, 20-byte key data records |
Planned coverage includes distinct GMAC/CMAC authentication benchmarking.
Prerequisites:
- NVIDIA GPU with a CUDA-capable driver
- CUDA Toolkit with
nvcc - CMake 3.28 or newer
- CUDA-compatible host C++ compiler
- OpenSSL development package discoverable by CMake
Configure and build:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config ReleaseYou can also run the benchmark globally without manual compilation via npx, assuming you have CMake and the CUDA Toolkit installed:
npx cuda-aes-benchmarkOn Windows, use a Visual Studio Developer Command Prompt or pass the host compiler explicitly:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_CUDA_HOST_COMPILER=<path-to-cl.exe>
cmake --build build --config ReleaseRun correctness checks before interpreting benchmark output:
ctest --test-dir build --output-on-failureRun a small reproducibility smoke benchmark:
./build/CudaProject --runs 1 --sizes 1048576 --bench-dir bench/smoke
python scripts/summarize_benchmarks.py bench/smoke/thr_gpu.csv bench/smoke/thr_cpu.csv -o bench/smoke/summary.mdWindows executable paths may use .\build\Release\CudaProject.exe depending on the generator.
The benchmark writes raw artifacts under bench/ by default, or under --bench-dir:
run_metadata.csvrecords schema version, command line, run count, selected sizes, OS/compiler hints, CUDA runtime/driver versions, GPU name, compute capability, and clocks/persistence note.thr_gpu.csvrecords GPU rows withtiming_scope=kernel_only; this is CUDA event timing around the kernel launch, not end-to-end application throughput.thr_cpu.csvrecords OpenSSL CPU baseline rows withtiming_scope=cpu_baseline.summary.mdis generated from raw CSV files byscripts/summarize_benchmarks.py.
Raw result columns use Phase 3 schema phase3.v1:
schema_version,benchmark_run_id,timing_scope,device,cipher,block_size,run_index,run_count,time_ms,GiB/s,operation,command_line
- Documentation landing page - search-friendly CUDA AES benchmark documentation index
- Architecture - canonical source layout and runtime flow
- Correctness - KAT coverage, GCM scope, and verification limits
- Benchmark Methodology - reproducible run procedure, raw files, timing scope, and summary generation
- Results - how to package and interpret benchmark results
- Profiling - NVTX, Nsight, and PTX dump helpers
- Mode Matrix - implemented, tested, benchmarked, documented, and planned AES mode coverage
- Legacy Tezcan Implementation - provenance and non-canonical legacy implementation notes
Benchmark results are only meaningful after deterministic correctness tests pass. The GPU timing scope is currently kernel_only, which excludes allocation, host-to-device copy, device-to-host copy, output validation, and summary generation. Do not compare kernel-only rows against future end-to-end rows without preserving timing_scope.
ECB, CBC, CFB, OFB, and CTR are confidentiality-only modes; they do not authenticate ciphertext. CBC, CFB, and OFB also have feedback dependencies, so their rows should not be interpreted as CTR-like parallel throughput.
Use repeated runs, fixed GPU clocks, persistence-mode notes, and a quiet system when comparing throughput numbers. Publish raw CSV files and generated summaries together.
main.cu- benchmark runner, CLI parsing, GPU launch orchestration, OpenSSL CPU comparison, CSV output, and debug routinesaes_common.h,aes_tables.cu- shared AES declarations, constants, lookup tables, and key expansion helpersaes128_*.cu,aes256_*.cu- canonical AES kernel implementationstests/kat_main.cu- deterministic known-answer testsscripts/summarize_benchmarks.py- raw CSV to Markdown summary generatordocs/- public documentationv3/- local experimental variant, not the canonical build targetcihangirTezcanAESimplementation/- legacy/provenance implementation
- Runtime CMake/CTest verification in the current development shell is blocked until
nvcccan findcl.exe. - GCM coverage is limited to 96-bit IV, empty AAD, and full 16-byte blocks.
- CCM coverage is limited to 96-bit nonce, empty AAD, 16-byte tag, and full 16-byte blocks.
- XTS-AES coverage is limited to full 16-byte blocks with a 16-byte sector tweak; ciphertext stealing is not implemented.
- AES-KW and AES-KWP benchmark rows are GPU key-wrap workload rows. They are not bulk encryption throughput, and CPU baseline rows are not emitted for these modes yet.
- Partial-block behavior and non-empty AAD are not benchmarked in v1.
- CPU baseline rows are not a controlled CPU performance study.
- This project does not claim to be the fastest GPU AES implementation.
The v1 roadmap focuses on:
- Open-source documentation and governance
- Full practical AES mode coverage
- Discoverability for CUDA AES and GPU AES benchmark searches
- Versioned releases with reproducible raw benchmark artifacts