feat(benchmark): AuditBench eval harness — B1 runnable baseline#53
Draft
gnanirahulnutakki wants to merge 1 commit into
Draft
feat(benchmark): AuditBench eval harness — B1 runnable baseline#53gnanirahulnutakki wants to merge 1 commit into
gnanirahulnutakki wants to merge 1 commit into
Conversation
…eline)
Closes the stub gap in go/benchmark/live/live.go: EvaluateAllStrict now
loads and validates real Scenario + Event files, then runs all four
evaluation arms over the trace rather than returning "unknown" for every
arm.
Four arms (all stateless/deterministic, no external calls):
cedar_strict — declared AllowedActions + AllowedTools check
cedar_state — same + cumulative tool-call budget enforcement
visibility — all events must carry full visibility
mcep_reconciliation — per-event ExpectedLabel oracle
Four in-repo AuditBench scenarios (go/benchmark/testdata/) exercise
orthogonal failure modes so that the arms return discriminatively different
verdicts (cedar_strict 50%, cedar_state 75%, mcep_reconciliation 100%
accuracy on the four-scenario slice).
New deliverables:
go/benchmark/live/live.go real evaluator replacing the stub
go/benchmark/live/live_test.go 7 tests (4 scenarios × end-to-end,
pack walker, 2 error paths)
go/cmd/benchcheck/main.go CLI: benchcheck [pack-dir]
go/benchmark/testdata/ AB-01…AB-04 fixtures
Makefile make bench target
REPRODUCE.md step-by-step reproduction guide
Tests: all 15 Go packages pass; 7 new tests added. No fabricated corpus —
harness runs on in-repo fixtures only. See REPRODUCE.md §"What is NOT yet
runnable" for Workstream B2 (independently-labeled headline corpus).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #40 (Workstream B1 — make the benchmark runnable and reproducible).
What is runnable now
The
go/benchmark/live/live.gostub that returned"unknown"for all fourarms has been replaced with a real, deterministic evaluator. The harness:
.scenario.json+.events.jsonlfile pairsresults.json+summary.csvto an output directoryFour evaluation arms
cedar_strictAllowedActions+AllowedTools— statelesscedar_statetool_callsbudget enforcementvisibilityvisibility: "full"mcep_reconciliationexpected_labeloracleFour in-repo AuditBench scenarios (
go/benchmark/testdata/)The arms return meaningfully different verdicts (cedar_strict 50% accuracy,
cedar_state 75%, mcep_reconciliation 100% on the 4-scenario slice), confirming
the harness is doing discriminative evaluation.
Running it
See
REPRODUCE.mdfor the full step-by-step reproduction guide includingcontent-address verification of input fixtures.
Tests
All 15 Go packages pass. 7 new tests added (
go/benchmark/live/live_test.go):No Python tests were touched; no Python tests regressed.
What remains for Workstream B2 (not in this PR)
The publicly-described headline corpus (independently human-labeled scenarios
from real agentic-AI traces) is absent from this repo by design — it carries
privacy-sensitive data and requires independent labeling to avoid evaluator
contamination.
REPRODUCE.mddocuments this gap explicitly under"What is NOT yet runnable."
B2 work (separate, gated PR):
cedar_strictarm backed by a compiled Cedar policy (not just declaration lists)