CUDA-AES Benchmark

CUDA-AES Benchmark is a reproducible CUDA AES benchmark and GPU AES benchmark suite for CUDA developers. It measures CUDA kernels for AES-128 and AES-256, compares them with an OpenSSL CPU baseline, records raw benchmark artifacts, and documents the correctness and methodology behind the numbers.

Use this project to study AES GPU performance across CUDA AES modes, including AES-GCM CUDA, AES-128 CUDA, AES-256 CUDA, CBC, CFB, OFB, CTR, CCM, XTS-AES, AES-KW, and AES-KWP workloads. The repository is intended as a reproducible cryptography benchmark with raw artifacts and explicit scope notes, not as a source of unsupported speed claims.

This repository is benchmark and research software, not a production cryptography library.

CUDA AES Benchmark Coverage

Implemented in the canonical top-level build:

Mode	AES-128	AES-256	Correctness tests	Benchmark rows	Notes
ECB	Yes	Yes	Yes	Yes	NIST-style known-answer coverage
CBC	Yes	Yes	Yes	Yes	Confidentiality-only feedback mode
CFB	Yes	Yes	Yes	Yes	CFB-128 full-block segment scope
OFB	Yes	Yes	Yes	Yes	Confidentiality-only chained keystream mode
CTR	Yes	Yes	Yes	Yes	96-bit IV/counter helper in benchmark
GCM	Yes	Yes	Yes	Yes	96-bit IV, empty AAD, full blocks
CCM	Yes	Yes	Yes	Yes	96-bit nonce, empty AAD, 16-byte tag, full blocks
XTS-AES	Yes	Yes	Yes	Yes	Storage-sector mode, 16-byte sector tweak, full blocks
AES-KW	Yes	Yes	Yes	Yes	Key-wrap workload, 16-byte key data records
AES-KWP	Yes	Yes	Yes	Yes	Key-wrap-with-padding workload, 20-byte key data records

Planned coverage includes distinct GMAC/CMAC authentication benchmarking.

Quick Start

Prerequisites:

NVIDIA GPU with a CUDA-capable driver
CUDA Toolkit with nvcc
CMake 3.28 or newer
CUDA-compatible host C++ compiler
OpenSSL development package discoverable by CMake

Configure and build:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release

Global NPX Execution (New in v2.0)

You can also run the benchmark globally without manual compilation via npx, assuming you have CMake and the CUDA Toolkit installed:

npx cuda-aes-benchmark

On Windows, use a Visual Studio Developer Command Prompt or pass the host compiler explicitly:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_CUDA_HOST_COMPILER=<path-to-cl.exe>
cmake --build build --config Release

Run correctness checks before interpreting benchmark output:

ctest --test-dir build --output-on-failure

Run a small reproducibility smoke benchmark:

./build/CudaProject --runs 1 --sizes 1048576 --bench-dir bench/smoke
python scripts/summarize_benchmarks.py bench/smoke/thr_gpu.csv bench/smoke/thr_cpu.csv -o bench/smoke/summary.md

Windows executable paths may use .\build\Release\CudaProject.exe depending on the generator.

Benchmark Artifacts

The benchmark writes raw artifacts under bench/ by default, or under --bench-dir:

run_metadata.csv records schema version, command line, run count, selected sizes, OS/compiler hints, CUDA runtime/driver versions, GPU name, compute capability, and clocks/persistence note.
thr_gpu.csv records GPU rows with timing_scope=kernel_only; this is CUDA event timing around the kernel launch, not end-to-end application throughput.
thr_cpu.csv records OpenSSL CPU baseline rows with timing_scope=cpu_baseline.
summary.md is generated from raw CSV files by scripts/summarize_benchmarks.py.

Raw result columns use Phase 3 schema phase3.v1:

schema_version,benchmark_run_id,timing_scope,device,cipher,block_size,run_index,run_count,time_ms,GiB/s,operation,command_line

Documentation

Documentation landing page - search-friendly CUDA AES benchmark documentation index
Architecture - canonical source layout and runtime flow
Correctness - KAT coverage, GCM scope, and verification limits
Benchmark Methodology - reproducible run procedure, raw files, timing scope, and summary generation
Results - how to package and interpret benchmark results
Profiling - NVTX, Nsight, and PTX dump helpers
Mode Matrix - implemented, tested, benchmarked, documented, and planned AES mode coverage
Legacy Tezcan Implementation - provenance and non-canonical legacy implementation notes

Contributing And Governance

Methodology Summary

Benchmark results are only meaningful after deterministic correctness tests pass. The GPU timing scope is currently kernel_only, which excludes allocation, host-to-device copy, device-to-host copy, output validation, and summary generation. Do not compare kernel-only rows against future end-to-end rows without preserving timing_scope.

ECB, CBC, CFB, OFB, and CTR are confidentiality-only modes; they do not authenticate ciphertext. CBC, CFB, and OFB also have feedback dependencies, so their rows should not be interpreted as CTR-like parallel throughput.

Use repeated runs, fixed GPU clocks, persistence-mode notes, and a quiet system when comparing throughput numbers. Publish raw CSV files and generated summaries together.

Repository Layout

main.cu - benchmark runner, CLI parsing, GPU launch orchestration, OpenSSL CPU comparison, CSV output, and debug routines
aes_common.h, aes_tables.cu - shared AES declarations, constants, lookup tables, and key expansion helpers
aes128_*.cu, aes256_*.cu - canonical AES kernel implementations
tests/kat_main.cu - deterministic known-answer tests
scripts/summarize_benchmarks.py - raw CSV to Markdown summary generator
docs/ - public documentation
v3/ - local experimental variant, not the canonical build target
cihangirTezcanAESimplementation/ - legacy/provenance implementation

Current Limitations

Runtime CMake/CTest verification in the current development shell is blocked until nvcc can find cl.exe.
GCM coverage is limited to 96-bit IV, empty AAD, and full 16-byte blocks.
CCM coverage is limited to 96-bit nonce, empty AAD, 16-byte tag, and full 16-byte blocks.
XTS-AES coverage is limited to full 16-byte blocks with a 16-byte sector tweak; ciphertext stealing is not implemented.
AES-KW and AES-KWP benchmark rows are GPU key-wrap workload rows. They are not bulk encryption throughput, and CPU baseline rows are not emitted for these modes yet.
Partial-block behavior and non-empty AAD are not benchmarked in v1.
CPU baseline rows are not a controlled CPU performance study.
This project does not claim to be the fastest GPU AES implementation.

Roadmap Direction

The v1 roadmap focuses on:

Open-source documentation and governance
Full practical AES mode coverage
Discoverability for CUDA AES and GPU AES benchmark searches
Versioned releases with reproducible raw benchmark artifacts

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
.github		.github
.planning		.planning
bin		bin
cihangirTezcanAESimplementation		cihangirTezcanAESimplementation
docs		docs
scripts		scripts
tests		tests
v3		v3
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
aes128_cbc.cu		aes128_cbc.cu
aes128_ccm.cu		aes128_ccm.cu
aes128_cfb.cu		aes128_cfb.cu
aes128_ctr.cu		aes128_ctr.cu
aes128_ecb.cu		aes128_ecb.cu
aes128_gcm.cu		aes128_gcm.cu
aes128_kw.cu		aes128_kw.cu
aes128_ofb.cu		aes128_ofb.cu
aes128_xts.cu		aes128_xts.cu
aes256_cbc.cu		aes256_cbc.cu
aes256_ccm.cu		aes256_ccm.cu
aes256_cfb.cu		aes256_cfb.cu
aes256_ctr.cu		aes256_ctr.cu
aes256_ecb.cu		aes256_ecb.cu
aes256_gcm.cu		aes256_gcm.cu
aes256_kw.cu		aes256_kw.cu
aes256_ofb.cu		aes256_ofb.cu
aes256_xts.cu		aes256_xts.cu
aes_block_device.cuh		aes_block_device.cuh
aes_common.h		aes_common.h
aes_tables.cu		aes_tables.cu
main.cu		main.cu
package.json		package.json
profiling_helpers.h		profiling_helpers.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA-AES Benchmark