hwbench

A reproducible, cross-backend benchmarking tool for ML inference. Run the same model across multiple hardware backends, get clean latency / throughput / memory numbers, store the results, regress against them later.

Designed to be the kind of harness an ML benchmarking team actually uses day-to-day. Designed also so a Tenstorrent backend can drop in as a fourth peer when hardware is available.

⚠️ v0.1.0 — early. CPU backend is functional. CUDA backend is implemented but unverified locally. TTNN backend is a stub awaiting hardware access. Issues + PRs welcome.

Quick start

git clone https://github.com/OrangeTangy/hwbench.git
cd hwbench
pip install -e ".[models]"

# Benchmark ResNet18 on CPU, 50 iterations, batch size 1
hwbench bench resnet18 --backend cpu --iters 50 --batch 1

# What models are registered?
hwbench models

# What backends are available on this machine?
hwbench backends

# Inspect a saved result
hwbench report results/resnet18_cpu_*.json

Example output (truncated):

                Benchmark: resnet18 / cpu / batch=1 / iters=50
┌────────────────┬───────────┐
│ Metric         │     Value │
├────────────────┼───────────┤
│ p50 latency    │  18.4 ms  │
│ p95 latency    │  21.2 ms  │
│ p99 latency    │  24.7 ms  │
│ mean latency   │  19.1 ms  │
│ throughput     │  52.3 qps │
│ peak memory    │  412.6 MB │
└────────────────┴───────────┘
Saved → results/resnet18_cpu_20260527-184530.json

What it does

Concern	How hwbench handles it
Cross-backend	Pluggable `Backend` interface — add a new file under `src/hwbench/backends/` and register it
Reproducibility	Pinned deps, deterministic seeds, Docker image, every result JSON includes the full environment fingerprint
Measurement	Warm-up iterations, monotonic clock, percentile statistics (p50/p95/p99), peak memory via psutil
Storage	JSON files in `results/` — easy to diff, easy to feed into a dashboard
Regression tracking	Re-running produces new dated files; a future `hwbench compare A.json B.json` flags regressions
CI	GitHub Actions runs the test suite on every push

Supported backends

Backend	Status	Notes
`cpu`	✅ Working	PyTorch CPU, with optional `torch.compile`
`cuda`	🟡 Implemented, untested locally	Standard PyTorch CUDA path. Run on a machine with an NVIDIA GPU.
`ttnn`	🔲 Stub	Will be implemented once hardware access is sorted. See `backends/ttnn_stub.py` for the interface this backend will implement.

Adding a new backend = one ~100 LOC file. See DESIGN.md.

Supported models

Model	Source	Default input
`resnet18`	torchvision	1×3×224×224 image
`resnet50`	torchvision	1×3×224×224 image
`bert-base`	transformers	tokenized "Hello world"
`gpt2`	transformers	tokenized "The quick brown fox"
`whisper-tiny`	transformers	30s audio chunk (synthetic)

Add models by appending to the registry in src/hwbench/models.py.

Project layout

hwbench/
├── README.md
├── DESIGN.md              ← the architecture decisions, read this first
├── pyproject.toml
├── Dockerfile
├── Makefile
├── src/hwbench/
│   ├── cli.py             ← Click-based CLI entrypoint
│   ├── runner.py          ← benchmark orchestration: warmup + measurement loop
│   ├── models.py          ← model registry
│   ├── metrics.py         ← BenchmarkResult, percentile math
│   ├── storage.py         ← JSON persistence
│   └── backends/
│       ├── base.py        ← Backend abstract class
│       ├── cpu.py
│       ├── cuda.py
│       └── ttnn_stub.py
├── tests/
└── .github/workflows/ci.yml

Why this exists

ML inference benchmarking is one of those tasks where every team rolls their own hacky script, then everyone's results are slightly different because nobody shipped a reproducible harness. I'm building hwbench to be the simple, opinionated default — well-tested core, easy to plug new backends into, results that survive sharing.

Also: I'm applying for the Machine Learning Applications & Benchmarking internship at Tenstorrent, and this project is meant to demonstrate exactly the workflow that role describes. The TTNN backend is intentionally stubbed and waiting for someone with hardware to plug in — that someone is hopefully future-me on the team.

License

Apache-2.0.

Author

Written by @OrangeTangy. Feedback, issues, and PRs welcome — especially from anyone who has done ML benchmarking professionally.

Companion project: tensix-field-guide — a visual intro to Tenstorrent's processor architecture.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hwbench

Quick start

What it does

Supported backends

Supported models

Project layout

Why this exists

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
examples		examples
results		results
src/hwbench		src/hwbench
tests		tests
.gitignore		.gitignore
DESIGN.md		DESIGN.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

hwbench

Quick start

What it does

Supported backends

Supported models

Project layout

Why this exists

License

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages