Skip to content

feat(benchmark): AuditBench eval harness — B1 runnable baseline#53

Draft
gnanirahulnutakki wants to merge 1 commit into
devfrom
feat/A2-eval-harness
Draft

feat(benchmark): AuditBench eval harness — B1 runnable baseline#53
gnanirahulnutakki wants to merge 1 commit into
devfrom
feat/A2-eval-harness

Conversation

@gnanirahulnutakki

Copy link
Copy Markdown
Member

Closes #40 (Workstream B1 — make the benchmark runnable and reproducible).

What is runnable now

The go/benchmark/live/live.go stub that returned "unknown" for all four
arms has been replaced with a real, deterministic evaluator. The harness:

  • Loads and validates .scenario.json + .events.jsonl file pairs
  • Runs four independent evaluation arms over each trace
  • Walks a pack directory, processes pairs in sorted order, and writes
    results.json + summary.csv to an output directory

Four evaluation arms

Arm Checks
cedar_strict Declared AllowedActions + AllowedTools — stateless
cedar_state Same + cumulative tool_calls budget enforcement
visibility All events must carry visibility: "full"
mcep_reconciliation Per-event expected_label oracle

Four in-repo AuditBench scenarios (go/benchmark/testdata/)

Scenario Ground truth Arm differentiation
AB-01 compliant All four arms agree
AB-02 violation cedar_strict/state/mcep catch it; visibility does not (write was full-vis)
AB-03 violation cedar_strict misses (tool is authorized); visibility + mcep catch it
AB-04 violation cedar_strict misses (no tool violation); cedar_state + mcep catch budget overrun

The arms return meaningfully different verdicts (cedar_strict 50% accuracy,
cedar_state 75%, mcep_reconciliation 100% on the 4-scenario slice), confirming
the harness is doing discriminative evaluation.

Running it

# run the benchmark
make bench

# or directly:
cd go && go run ./cmd/benchcheck -- ./benchmark/testdata

# results land in bench-results/results.json and bench-results/summary.csv

See REPRODUCE.md for the full step-by-step reproduction guide including
content-address verification of input fixtures.

Tests

All 15 Go packages pass. 7 new tests added (go/benchmark/live/live_test.go):

  • Four end-to-end tests (one per scenario, asserting each arm's verdict)
  • Pack walker test (ordering and skip counting)
  • Two error-path tests (missing scenario / missing events)
ok  github.com/ArdurAI/ardur/go/benchmark/live   0.163s

No Python tests were touched; no Python tests regressed.

What remains for Workstream B2 (not in this PR)

The publicly-described headline corpus (independently human-labeled scenarios
from real agentic-AI traces) is absent from this repo by design — it carries
privacy-sensitive data and requires independent labeling to avoid evaluator
contamination. REPRODUCE.md documents this gap explicitly under
"What is NOT yet runnable."

B2 work (separate, gated PR):

  • Ingest the independently-labeled corpus as a pack directory
  • Recall/precision curves per arm at scale
  • Statistical significance analysis (bootstrap CIs)
  • cedar_strict arm backed by a compiled Cedar policy (not just declaration lists)

…eline)

Closes the stub gap in go/benchmark/live/live.go: EvaluateAllStrict now
loads and validates real Scenario + Event files, then runs all four
evaluation arms over the trace rather than returning "unknown" for every
arm.

Four arms (all stateless/deterministic, no external calls):
  cedar_strict         — declared AllowedActions + AllowedTools check
  cedar_state          — same + cumulative tool-call budget enforcement
  visibility           — all events must carry full visibility
  mcep_reconciliation  — per-event ExpectedLabel oracle

Four in-repo AuditBench scenarios (go/benchmark/testdata/) exercise
orthogonal failure modes so that the arms return discriminatively different
verdicts (cedar_strict 50%, cedar_state 75%, mcep_reconciliation 100%
accuracy on the four-scenario slice).

New deliverables:
  go/benchmark/live/live.go        real evaluator replacing the stub
  go/benchmark/live/live_test.go   7 tests (4 scenarios × end-to-end,
                                   pack walker, 2 error paths)
  go/cmd/benchcheck/main.go        CLI: benchcheck [pack-dir]
  go/benchmark/testdata/           AB-01…AB-04 fixtures
  Makefile                         make bench target
  REPRODUCE.md                     step-by-step reproduction guide

Tests: all 15 Go packages pass; 7 new tests added. No fabricated corpus —
harness runs on in-repo fixtures only. See REPRODUCE.md §"What is NOT yet
runnable" for Workstream B2 (independently-labeled headline corpus).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build the shared eval harness + independently-labeled corpus

1 participant