feat(benchmark): AuditBench eval harness — B1 runnable baseline by gnanirahulnutakki · Pull Request #53 · ArdurAI/ardur

gnanirahulnutakki · 2026-06-25T05:47:28Z

Closes #40 (Workstream B1 — make the benchmark runnable and reproducible).

What is runnable now

The go/benchmark/live/live.go stub that returned "unknown" for all four
arms has been replaced with a real, deterministic evaluator. The harness:

Loads and validates .scenario.json + .events.jsonl file pairs
Runs four independent evaluation arms over each trace
Walks a pack directory, processes pairs in sorted order, and writes
results.json + summary.csv to an output directory

Four evaluation arms

Arm	Checks
`cedar_strict`	Declared `AllowedActions` + `AllowedTools` — stateless
`cedar_state`	Same + cumulative `tool_calls` budget enforcement
`visibility`	All events must carry `visibility: "full"`
`mcep_reconciliation`	Per-event `expected_label` oracle

Four in-repo AuditBench scenarios (`go/benchmark/testdata/`)

Scenario	Ground truth	Arm differentiation
AB-01	compliant	All four arms agree
AB-02	violation	cedar_strict/state/mcep catch it; visibility does not (write was full-vis)
AB-03	violation	cedar_strict misses (tool is authorized); visibility + mcep catch it
AB-04	violation	cedar_strict misses (no tool violation); cedar_state + mcep catch budget overrun

The arms return meaningfully different verdicts (cedar_strict 50% accuracy,
cedar_state 75%, mcep_reconciliation 100% on the 4-scenario slice), confirming
the harness is doing discriminative evaluation.

Running it

# run the benchmark
make bench

# or directly:
cd go && go run ./cmd/benchcheck -- ./benchmark/testdata

# results land in bench-results/results.json and bench-results/summary.csv

See REPRODUCE.md for the full step-by-step reproduction guide including
content-address verification of input fixtures.

Tests

All 15 Go packages pass. 7 new tests added (go/benchmark/live/live_test.go):

Four end-to-end tests (one per scenario, asserting each arm's verdict)
Pack walker test (ordering and skip counting)
Two error-path tests (missing scenario / missing events)

ok  github.com/ArdurAI/ardur/go/benchmark/live   0.163s

No Python tests were touched; no Python tests regressed.

What remains for Workstream B2 (not in this PR)

The publicly-described headline corpus (independently human-labeled scenarios
from real agentic-AI traces) is absent from this repo by design — it carries
privacy-sensitive data and requires independent labeling to avoid evaluator
contamination. REPRODUCE.md documents this gap explicitly under
"What is NOT yet runnable."

B2 work (separate, gated PR):

Ingest the independently-labeled corpus as a pack directory
Recall/precision curves per arm at scale
Statistical significance analysis (bootstrap CIs)
cedar_strict arm backed by a compiled Cedar policy (not just declaration lists)

…eline) Closes the stub gap in go/benchmark/live/live.go: EvaluateAllStrict now loads and validates real Scenario + Event files, then runs all four evaluation arms over the trace rather than returning "unknown" for every arm. Four arms (all stateless/deterministic, no external calls): cedar_strict — declared AllowedActions + AllowedTools check cedar_state — same + cumulative tool-call budget enforcement visibility — all events must carry full visibility mcep_reconciliation — per-event ExpectedLabel oracle Four in-repo AuditBench scenarios (go/benchmark/testdata/) exercise orthogonal failure modes so that the arms return discriminatively different verdicts (cedar_strict 50%, cedar_state 75%, mcep_reconciliation 100% accuracy on the four-scenario slice). New deliverables: go/benchmark/live/live.go real evaluator replacing the stub go/benchmark/live/live_test.go 7 tests (4 scenarios × end-to-end, pack walker, 2 error paths) go/cmd/benchcheck/main.go CLI: benchcheck [pack-dir] go/benchmark/testdata/ AB-01…AB-04 fixtures Makefile make bench target REPRODUCE.md step-by-step reproduction guide Tests: all 15 Go packages pass; 7 new tests added. No fabricated corpus — harness runs on in-repo fixtures only. See REPRODUCE.md §"What is NOT yet runnable" for Workstream B2 (independently-labeled headline corpus).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(benchmark): AuditBench eval harness — B1 runnable baseline#53

feat(benchmark): AuditBench eval harness — B1 runnable baseline#53
gnanirahulnutakki wants to merge 1 commit into
devfrom
feat/A2-eval-harness

gnanirahulnutakki commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gnanirahulnutakki commented Jun 25, 2026

What is runnable now

Four evaluation arms

Four in-repo AuditBench scenarios (go/benchmark/testdata/)

Running it

Tests

What remains for Workstream B2 (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Four in-repo AuditBench scenarios (`go/benchmark/testdata/`)