A reproducible, cross-backend benchmarking tool for ML inference. Run the same model across multiple hardware backends, get clean latency / throughput / memory numbers, store the results, regress against them later.
Designed to be the kind of harness an ML benchmarking team actually uses day-to-day. Designed also so a Tenstorrent backend can drop in as a fourth peer when hardware is available.
⚠️ v0.1.0 — early. CPU backend is functional. CUDA backend is implemented but unverified locally. TTNN backend is a stub awaiting hardware access. Issues + PRs welcome.
git clone https://github.com/OrangeTangy/hwbench.git
cd hwbench
pip install -e ".[models]"
# Benchmark ResNet18 on CPU, 50 iterations, batch size 1
hwbench bench resnet18 --backend cpu --iters 50 --batch 1
# What models are registered?
hwbench models
# What backends are available on this machine?
hwbench backends
# Inspect a saved result
hwbench report results/resnet18_cpu_*.jsonExample output (truncated):
Benchmark: resnet18 / cpu / batch=1 / iters=50
┌────────────────┬───────────┐
│ Metric │ Value │
├────────────────┼───────────┤
│ p50 latency │ 18.4 ms │
│ p95 latency │ 21.2 ms │
│ p99 latency │ 24.7 ms │
│ mean latency │ 19.1 ms │
│ throughput │ 52.3 qps │
│ peak memory │ 412.6 MB │
└────────────────┴───────────┘
Saved → results/resnet18_cpu_20260527-184530.json
| Concern | How hwbench handles it |
|---|---|
| Cross-backend | Pluggable Backend interface — add a new file under src/hwbench/backends/ and register it |
| Reproducibility | Pinned deps, deterministic seeds, Docker image, every result JSON includes the full environment fingerprint |
| Measurement | Warm-up iterations, monotonic clock, percentile statistics (p50/p95/p99), peak memory via psutil |
| Storage | JSON files in results/ — easy to diff, easy to feed into a dashboard |
| Regression tracking | Re-running produces new dated files; a future hwbench compare A.json B.json flags regressions |
| CI | GitHub Actions runs the test suite on every push |
| Backend | Status | Notes |
|---|---|---|
cpu |
✅ Working | PyTorch CPU, with optional torch.compile |
cuda |
🟡 Implemented, untested locally | Standard PyTorch CUDA path. Run on a machine with an NVIDIA GPU. |
ttnn |
🔲 Stub | Will be implemented once hardware access is sorted. See backends/ttnn_stub.py for the interface this backend will implement. |
Adding a new backend = one ~100 LOC file. See DESIGN.md.
| Model | Source | Default input |
|---|---|---|
resnet18 |
torchvision | 1×3×224×224 image |
resnet50 |
torchvision | 1×3×224×224 image |
bert-base |
transformers | tokenized "Hello world" |
gpt2 |
transformers | tokenized "The quick brown fox" |
whisper-tiny |
transformers | 30s audio chunk (synthetic) |
Add models by appending to the registry in src/hwbench/models.py.
hwbench/
├── README.md
├── DESIGN.md ← the architecture decisions, read this first
├── pyproject.toml
├── Dockerfile
├── Makefile
├── src/hwbench/
│ ├── cli.py ← Click-based CLI entrypoint
│ ├── runner.py ← benchmark orchestration: warmup + measurement loop
│ ├── models.py ← model registry
│ ├── metrics.py ← BenchmarkResult, percentile math
│ ├── storage.py ← JSON persistence
│ └── backends/
│ ├── base.py ← Backend abstract class
│ ├── cpu.py
│ ├── cuda.py
│ └── ttnn_stub.py
├── tests/
└── .github/workflows/ci.yml
ML inference benchmarking is one of those tasks where every team rolls their own hacky script, then everyone's results are slightly different because nobody shipped a reproducible harness. I'm building hwbench to be the simple, opinionated default — well-tested core, easy to plug new backends into, results that survive sharing.
Also: I'm applying for the Machine Learning Applications & Benchmarking internship at Tenstorrent, and this project is meant to demonstrate exactly the workflow that role describes. The TTNN backend is intentionally stubbed and waiting for someone with hardware to plug in — that someone is hopefully future-me on the team.
Apache-2.0.
Written by @OrangeTangy. Feedback, issues, and PRs welcome — especially from anyone who has done ML benchmarking professionally.
Companion project: tensix-field-guide — a visual intro to Tenstorrent's processor architecture.