Deterministic, reproducible benchmark for autonomous cloud-workstation agents. Tests whether an AI system can independently complete multi-step office automation tasks — reading fixture data, creating artifacts, and communicating results — without human intervention.
| Dimension | Weight | Question Answered |
|---|---|---|
| Functional Correctness | 35% | Did it do the task right? (find invoices, reconcile payments, identify mismatches) |
| Artifact Quality | 15% | Are outputs well-formed and accurate? (reports, messages, summaries) |
| Autonomy | 25% | Did it complete without human help? (no clarifications, no timeouts, no crashes) |
| Setup Complexity | 10% | How much configuration was needed? (zero-setup vs manual provisioning) |
| Speed | 5% | How fast did it complete? (wall-clock time) |
| Observability | 10% | Can you audit what it did? (logs, tool calls, artifact trails) |
# 1. Configure environment
cp benchmark.env.example benchmark.local.env
# Edit benchmark.local.env with your OpenRouter API key
# 2. Build agent images (first time only)
node prepare-images.mjs --openclaw-version latest
# 3. Run benchmark with default model (Claude Sonnet 4.6)
node run-suite.mjs --task cwab-001,cwab-001b
# 4. Run across multiple models
bash run-all-models.sh
# 5. Generate report
node generate-report.mjs --run-dir results/<run-folder>Setup: 6 vendor invoices + 8 payment records with intentional mismatches (underpayments, overpayments, name mismatches, orphaned payments).
Prompt: "Find vendor invoices, reconcile against payments CSV, create a reconciliation report highlighting mismatches, save it, and send a summary to finance-ops."
Scoring checks:
- All 6 invoices referenced with exact amounts
- All 8 payments referenced with exact amounts
- 5 mismatches identified (underpaid, overpaid, unpaid, orphaned)
- Report artifact created via API
- Finance-ops message sent via API
- No fabricated vendor data
Setup: Prior report from yesterday + 2 new invoices with 1 mismatch.
Prompt: "Read the prior report, reconcile new invoices, update the report preserving all previous conclusions, and send a delta summary."
Scoring checks:
- Prior report is referenced
- Old conclusions preserved (3 items)
- New items correctly added (2 items)
- Updated report saved
- Correct total outstanding calculated
- No contradiction of prior data
Awarded by deterministic fixture validators that parse system output against ground truth. No self-reporting. Each point corresponds to specific evidence found in the output (e.g., "Bravo Design underpaid by $50" = +5 pts).
Probability that at least 1 of 3 attempts succeeds autonomously (score ≥ 70, zero interventions):
Pass@3 = 1 - (1 - p)^3
A system with 33% success rate still gives you 70% odds if you retry 3x.
Percentage of attempts completing without human help AND scoring ≥ 70. Distinguishes "works sometimes" from "works reliably."
Wall-clock minutes from prompt to final output. Includes API calls, tool execution, LLM generation.
Total LLM tokens consumed (prompt + completion). Lower = cheaper + faster.
Estimated USD: (prompt_tokens × input_price + completion_tokens × output_price) / 1,000,000. Uses OpenRouter pricing.
score / 100 capped at 1.0. Measures whether created artifacts (reports, messages) are actually correct, not just present.
1.0 if reports + messages exist, 0.25 if nothing logged. Measures debuggability.
| System | Adapter | How It Works |
|---|---|---|
| Construct | HTTP API | Sends prompt to Construct agent API, polls for completion, extracts final message |
| OpenClaw | CLI | Runs openclaw agent --local --json in Docker container, captures CLI output |
| Hermes Agent | CLI | Runs hermes chat -Q in Docker container, captures CLI output |
All three systems use the same LLM model (configured via CWAB_MODEL_ID) to ensure fair comparison.
The benchmark supports any OpenRouter model. Results from our test runs:
| Model | Provider | Price | Construct Score | Pass@3 |
|---|---|---|---|---|
| Claude Sonnet 4.6 | Anthropic | Paid | 92 | 100% |
| Gemma 4 31B IT | Free | 91 | 100% | |
| GLM-4.5 Air | Z-AI | Free | 91 | 100% |
| Nemotron 3 Super 120B | NVIDIA | Free | 88 | 100% |
| GPT-OSS 120B | OpenAI | Free | 78 | 100% |
See results/CWAB_MULTI_MODEL_COMPARISON_REPORT.md for full comparison.
run-suite.mjs # One-command orchestrator
├── run-benchmark.mjs # Core runner (Docker per attempt)
│ ├── Fixture Server (fixtures/cwab-fixture-server.mjs)
│ ├── System Wrapper (wrappers/)
│ └── Validator (wrappers/fixture-validator-wrapper.mjs)
└── generate-report.mjs # DOCX/HTML report generation
run-suite.mjsstarts fixture server on port 6789- For each system × task × attempt:
- Resets fixture with seeded data
- Spins up Docker container with system wrapper
- Wrapper submits prompt to agent, captures output
- Sends output + fixture state to validator
- Validator scores against ground truth
- Aggregates results into
summary.json+findings.md
- Same model for all systems
- Same seeded fixture data per attempt
- Same validator logic (not self-reported)
- Docker isolation per attempt
- Idempotent runs (each writes to timestamped directory)
.
├── tasks.json # Task definitions (cwab-001, cwab-001b)
├── systems.json # System configs (construct, openclaw, hermes)
├── run-suite.mjs # One-command entry point
├── run-benchmark.mjs # Core benchmark runner
├── run-all-models.sh # Batch runner across 5 models
├── generate-report.mjs # DOCX/HTML report generator
├── prepare-images.mjs # Build Docker images for OpenClaw/Hermes
├── compose.yaml # Docker Compose setup
├── benchmark.env.example # Environment template
├── fixtures/
│ └── cwab-fixture-server.mjs # HTTP fixture server + validators
└── wrappers/
├── construct-http-wrapper.mjs # Construct HTTP adapter
├── generic-cli-agent-wrapper.mjs # CLI adapters (OpenClaw, Hermes)
├── fixture-validator-wrapper.mjs # Deterministic scoring
└── README.md # Wrapper contract docs
- Node.js 20+
- Docker Desktop / Docker Engine
- OpenRouter API key (for LLM calls)
- Construct worker running locally (for Construct benchmarks)
# Required
OPENROUTER_API_KEY=sk-or-...
CWAB_MODEL_ID=anthropic/claude-sonnet-4.6
# For Construct benchmarks
CONSTRUCT_BENCHMARK_URL=http://host.docker.internal:8787
# Optional overrides
CONSTRUCT_BENCHMARK_TIMEOUT_MS=270000
HERMES_MAX_TURNS=90
OPENCLAW_TIMEOUT_SECONDS=1200- Create a wrapper in
wrappers/that accepts JSON on stdin - Add entry to
systems.jsonwith Docker config - Run:
node run-suite.mjs --include your_system
See wrappers/README.md for the full wrapper contract.
MIT — Part of the Construct Computer monorepo.