Skip to content

OrangeTangy/hwbench

Repository files navigation

hwbench

A reproducible, cross-backend benchmarking tool for ML inference. Run the same model across multiple hardware backends, get clean latency / throughput / memory numbers, store the results, regress against them later.

Designed to be the kind of harness an ML benchmarking team actually uses day-to-day. Designed also so a Tenstorrent backend can drop in as a fourth peer when hardware is available.

⚠️ v0.1.0 — early. CPU backend is functional. CUDA backend is implemented but unverified locally. TTNN backend is a stub awaiting hardware access. Issues + PRs welcome.


Quick start

git clone https://github.com/OrangeTangy/hwbench.git
cd hwbench
pip install -e ".[models]"

# Benchmark ResNet18 on CPU, 50 iterations, batch size 1
hwbench bench resnet18 --backend cpu --iters 50 --batch 1

# What models are registered?
hwbench models

# What backends are available on this machine?
hwbench backends

# Inspect a saved result
hwbench report results/resnet18_cpu_*.json

Example output (truncated):

                Benchmark: resnet18 / cpu / batch=1 / iters=50
┌────────────────┬───────────┐
│ Metric         │     Value │
├────────────────┼───────────┤
│ p50 latency    │  18.4 ms  │
│ p95 latency    │  21.2 ms  │
│ p99 latency    │  24.7 ms  │
│ mean latency   │  19.1 ms  │
│ throughput     │  52.3 qps │
│ peak memory    │  412.6 MB │
└────────────────┴───────────┘
Saved → results/resnet18_cpu_20260527-184530.json

What it does

Concern How hwbench handles it
Cross-backend Pluggable Backend interface — add a new file under src/hwbench/backends/ and register it
Reproducibility Pinned deps, deterministic seeds, Docker image, every result JSON includes the full environment fingerprint
Measurement Warm-up iterations, monotonic clock, percentile statistics (p50/p95/p99), peak memory via psutil
Storage JSON files in results/ — easy to diff, easy to feed into a dashboard
Regression tracking Re-running produces new dated files; a future hwbench compare A.json B.json flags regressions
CI GitHub Actions runs the test suite on every push

Supported backends

Backend Status Notes
cpu ✅ Working PyTorch CPU, with optional torch.compile
cuda 🟡 Implemented, untested locally Standard PyTorch CUDA path. Run on a machine with an NVIDIA GPU.
ttnn 🔲 Stub Will be implemented once hardware access is sorted. See backends/ttnn_stub.py for the interface this backend will implement.

Adding a new backend = one ~100 LOC file. See DESIGN.md.

Supported models

Model Source Default input
resnet18 torchvision 1×3×224×224 image
resnet50 torchvision 1×3×224×224 image
bert-base transformers tokenized "Hello world"
gpt2 transformers tokenized "The quick brown fox"
whisper-tiny transformers 30s audio chunk (synthetic)

Add models by appending to the registry in src/hwbench/models.py.

Project layout

hwbench/
├── README.md
├── DESIGN.md              ← the architecture decisions, read this first
├── pyproject.toml
├── Dockerfile
├── Makefile
├── src/hwbench/
│   ├── cli.py             ← Click-based CLI entrypoint
│   ├── runner.py          ← benchmark orchestration: warmup + measurement loop
│   ├── models.py          ← model registry
│   ├── metrics.py         ← BenchmarkResult, percentile math
│   ├── storage.py         ← JSON persistence
│   └── backends/
│       ├── base.py        ← Backend abstract class
│       ├── cpu.py
│       ├── cuda.py
│       └── ttnn_stub.py
├── tests/
└── .github/workflows/ci.yml

Why this exists

ML inference benchmarking is one of those tasks where every team rolls their own hacky script, then everyone's results are slightly different because nobody shipped a reproducible harness. I'm building hwbench to be the simple, opinionated default — well-tested core, easy to plug new backends into, results that survive sharing.

Also: I'm applying for the Machine Learning Applications & Benchmarking internship at Tenstorrent, and this project is meant to demonstrate exactly the workflow that role describes. The TTNN backend is intentionally stubbed and waiting for someone with hardware to plug in — that someone is hopefully future-me on the team.

License

Apache-2.0.

Author

Written by @OrangeTangy. Feedback, issues, and PRs welcome — especially from anyone who has done ML benchmarking professionally.

Companion project: tensix-field-guide — a visual intro to Tenstorrent's processor architecture.

About

Cross-backend ML inference benchmarking tool — CPU/CUDA/TTNN. Designed for the Tenstorrent ML Benchmarking intern role.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors