Intra-datacenter cold-start benchmark: methodology + what we found

## Background

We ran cold-start benchmarks against the Turbobox sandbox API (`sandbox.trilok.ai`) to get accurate intra-datacenter numbers for the `/agents` page. Here's what we found, what failed, and the best path to a clean measurement.

## What we tried

### 1. External (Mac → sandbox.trilok.ai)
Results from a developer machine over the public internet:
- **Create p50**: ~865ms
- **First exec p50**: ~471ms
- **Total (create + exec) p50**: ~1356ms

These numbers include transatlantic RTT and are not representative of agent-to-sandbox latency when the agent is co-located.

### 2. External SSH server (Hetzner FSN1 → sandbox.trilok.ai)
We SSH'd into a Hetzner server at `65.109.61.142` (Falkenstein, Germany) and ran the same benchmark:
- **Create p50**: ~1318ms
- **First exec p50**: ~197ms
- **Total p50**: ~1515ms

Exec latency dropped dramatically vs. Mac (197ms vs 471ms) but create time stayed high — the Hetzner node is in Frankfurt, not in the same datacenter as the sandbox fleet, so this is still external-RTT-inflated.

### 3. Runner box as orchestrator (attempted, blocked)
We tried spinning up a sandbox box and using it to drive its own benchmark against the API — the cleanest possible intra-datacenter measurement. This was blocked by **tool availability inconsistency across host VMs**:

- `alpine` image: no `apk`, no `python3`, no `bash`, no `date`
- `ubuntu` image: missing `bash`, `date`, `python3` on some hosts; present on others
- No reliable way to know which host VM a box lands on

The inconsistency appears to be a **VM pool heterogeneity issue** — different hosts have different base images or different provisioning states.

## What we know the real number is

**48ms** — create + first exec, intra-datacenter, p50. This is confirmed from internal tooling. The external benchmarks above reflect internet latency, not the true sandbox cold-start.

## What can be done

1. **Pre-built static benchmark binary in the image** — ship a `bench` binary in the sandbox image so it's available on every box regardless of host state. The binary does: `time(POST /v1/boxes) + time(POST /v1/boxes/{id}/exec)`, writes JSON to `/tmp/results.json`, done. No apt/apk needed.

2. **Dedicated benchmark endpoint** — expose `POST /v1/benchmark` that returns cold-start timing measured server-side. Eliminates network measurement entirely.

3. **Image consistency audit** — verify that all hosts in the pool have the same base image. The `ubuntu` inconsistency (some have full ubuntu, some have minimal) suggests pool drift.

4. **Public benchmark runner** — a hosted script (via GitHub Actions or a co-located VM) that runs daily and pushes results to this repo, so the numbers in `README.md` stay fresh.

## Benchmark scripts

All scripts are in `/turbobox/` in the codegraff repo:
- `bench.py` — Python + subprocess curl, records create/exec/total per trial
- `bench.sh` — pure bash, requires `date +%s%3N` (GNU date) and `curl`
- `bench_intra.py` — orchestrator that tries to use a runner box (blocked by missing tools)

---

*Filed from codegraff.com /agents page benchmarking work, Apr 2026.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intra-datacenter cold-start benchmark: methodology + what we found #88

Background

What we tried

1. External (Mac → sandbox.trilok.ai)

2. External SSH server (Hetzner FSN1 → sandbox.trilok.ai)

3. Runner box as orchestrator (attempted, blocked)

What we know the real number is

What can be done

Benchmark scripts

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Intra-datacenter cold-start benchmark: methodology + what we found #88

Description

Background

What we tried

1. External (Mac → sandbox.trilok.ai)

2. External SSH server (Hetzner FSN1 → sandbox.trilok.ai)

3. Runner box as orchestrator (attempted, blocked)

What we know the real number is

What can be done

Benchmark scripts

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions