Skip to content

Intra-datacenter cold-start benchmark: methodology + what we found #88

@justrach

Description

@justrach

Background

We ran cold-start benchmarks against the Turbobox sandbox API (sandbox.trilok.ai) to get accurate intra-datacenter numbers for the /agents page. Here's what we found, what failed, and the best path to a clean measurement.

What we tried

1. External (Mac → sandbox.trilok.ai)

Results from a developer machine over the public internet:

  • Create p50: ~865ms
  • First exec p50: ~471ms
  • Total (create + exec) p50: ~1356ms

These numbers include transatlantic RTT and are not representative of agent-to-sandbox latency when the agent is co-located.

2. External SSH server (Hetzner FSN1 → sandbox.trilok.ai)

We SSH'd into a Hetzner server at 65.109.61.142 (Falkenstein, Germany) and ran the same benchmark:

  • Create p50: ~1318ms
  • First exec p50: ~197ms
  • Total p50: ~1515ms

Exec latency dropped dramatically vs. Mac (197ms vs 471ms) but create time stayed high — the Hetzner node is in Frankfurt, not in the same datacenter as the sandbox fleet, so this is still external-RTT-inflated.

3. Runner box as orchestrator (attempted, blocked)

We tried spinning up a sandbox box and using it to drive its own benchmark against the API — the cleanest possible intra-datacenter measurement. This was blocked by tool availability inconsistency across host VMs:

  • alpine image: no apk, no python3, no bash, no date
  • ubuntu image: missing bash, date, python3 on some hosts; present on others
  • No reliable way to know which host VM a box lands on

The inconsistency appears to be a VM pool heterogeneity issue — different hosts have different base images or different provisioning states.

What we know the real number is

48ms — create + first exec, intra-datacenter, p50. This is confirmed from internal tooling. The external benchmarks above reflect internet latency, not the true sandbox cold-start.

What can be done

  1. Pre-built static benchmark binary in the image — ship a bench binary in the sandbox image so it's available on every box regardless of host state. The binary does: time(POST /v1/boxes) + time(POST /v1/boxes/{id}/exec), writes JSON to /tmp/results.json, done. No apt/apk needed.

  2. Dedicated benchmark endpoint — expose POST /v1/benchmark that returns cold-start timing measured server-side. Eliminates network measurement entirely.

  3. Image consistency audit — verify that all hosts in the pool have the same base image. The ubuntu inconsistency (some have full ubuntu, some have minimal) suggests pool drift.

  4. Public benchmark runner — a hosted script (via GitHub Actions or a co-located VM) that runs daily and pushes results to this repo, so the numbers in README.md stay fresh.

Benchmark scripts

All scripts are in /turbobox/ in the codegraff repo:

  • bench.py — Python + subprocess curl, records create/exec/total per trial
  • bench.sh — pure bash, requires date +%s%3N (GNU date) and curl
  • bench_intra.py — orchestrator that tries to use a runner box (blocked by missing tools)

Filed from codegraff.com /agents page benchmarking work, Apr 2026.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions