Add LLM-as-judge scorer, dispatcher validation, and eval aggregation by argen · Pull Request #38 · argen/hornero

argen · 2026-05-21T08:48:53Z

Summary

Adds three major components to the evaluation and harness infrastructure:

LLM-as-judge scorer (evals/judge.py) — provider-agnostic scoring tool that reads synthesis outputs and scores them against the rubric using Claude, GPT-4, or Gemini. Complements human scoring with fast, repeatable LLM judgments.
Dispatcher validation (scripts/validate-dispatcher.sh) — structural checks ensuring routing fixtures are consistent with the catalog, descriptions are complete, and referenced skills have scenarios frontmatter. Integrated into smoke-check.sh as Step 12.
Enhanced eval aggregation — updated evals/score.sh to handle both human (score_human.json) and LLM (score_llm.json) scores, report deltas, and flag runs where |human − llm| ≥ 0.25 for calibration review.

Key Changes

evals/judge.py (330 lines)
- Abstract Judge base class with provider-specific implementations: AnthropicJudge, OpenAIJudge, GeminiJudge
- Lazy SDK imports — only the chosen provider's SDK is required
- Reads rubric.md + synthesis.md, calls LLM with strict JSON schema, validates response, writes score_llm.json with metadata (provider, model, timestamp, mean)
- CLI with --dry-run, --provider, --model flags; respects EVAL_JUDGE_PROVIDER and EVAL_JUDGE_MODEL env vars
- Strict score validation: only {0, 0.25, 0.5, 0.75, 1.0} allowed; all four dimensions + rationales required
evals/score.sh (refactored)
- Renamed human scoring output to score_human.json (backward-compatible: reads score.json if score_human.json absent)
- Added --judge flag to delegate to evals/judge.py
- Aggregation now reads both human and LLM scores, computes delta, flags |delta| ≥ 0.25 with ← review marker
- Human score takes precedence for reported mean; LLM shown alongside for calibration
scripts/validate-dispatcher.sh (141 lines)
- Check 1: routes.tsv → CATALOG.md dead reference detection
- Check 2: CATALOG.md description completeness (no empty "When to invoke" fields)
- Check 3: scenarios frontmatter presence in all referenced skills
- Integrated into smoke-check.sh Step 12
tests/dispatcher/routes.tsv (53 lines)
- Routing fixtures documenting expected dispatch for representative prompts
- Covers slash commands, artifact intake signals, verb+shape triggers
- Format: prompt<TAB>expected_skill<TAB>expected_role<TAB>notes
hooks/session-start-check.sh (27 lines)
- Lightweight structural guard on SessionStart
- Verifies presence of CATALOG.md, DISPATCHER.md, chronicle/SCHEMA.md
- Warns on stderr; does not block session
Documentation
- evals/README.md — updated with LLM judge workflow, provider table, recommended scoring sequence
- examples/use-cases.md — six PM workflows showing multi-skill sequences (discovery interview cycle, strategy kernel review, launch readiness, finance review, weekly rhythm, async decision capture)
- playbooks/harness-guide.md — determinism invariants, five layers (dispatcher, skill anatomy, chronicle, lifecycle hooks, build-time checks), verification suite
- CONTRIBUTING.md — added dispatcher routing fixtures section, reference to harness-guide.md

Notable Implementation Details

Provider abstraction: Each judge implements score(system, user) → str. SDK import failures are caught and reported with install

https://claude.ai/code/session_01JWU8bLomUJVpCL8qyMB3XT

…cher tests Harness guide (playbooks/harness-guide.md): - Documents all five determinism layers (dispatcher, skill anatomy, chronicle, lifecycle hooks, build-time consistency checks) with verification commands - Quick-reference table mapping what-you-changed to commands to run - TDD discipline section covering the invariants that must not be broken - Common failure modes and diagnosis Use-case how-to guide (examples/use-cases.md): - Six multi-skill PM workflows: discovery interview cycle, strategy kernel review, launch readiness sweep, finance/pricing review, weekly PM rhythm, stakeholder prep - Each workflow shows skills in sequence, what persists in the chronicle, and example prompts LLM-as-judge evals (evals/judge.py, evals/score.sh): - judge.py: calls Claude API to score a synthesis on 4 rubric dimensions; writes score_llm.json in the same schema as the human scorer - score.sh: dual-score mode (score_human.json + score_llm.json); aggregate shows both side-by-side with delta flagging (|delta| >= 0.25 → review); --judge flag delegates to judge.py - evals/README.md: documents the recommended workflow (LLM-first, human reviews flagged runs) and calibration target (mean delta < 0.1) Dispatcher routing tests (tests/dispatcher/routes.tsv, scripts/validate-dispatcher.sh): - 28 routing fixtures covering slash commands, artifact intake, verb+shape triggers, and documented multi-match escalation cases - validate-dispatcher.sh: three deterministic checks — dead references in routes.tsv, CATALOG.md description completeness, scenarios frontmatter presence - validate-dispatcher.sh added as step 12 in smoke-check.sh Session-start integrity check (hooks/session-start-check.sh, hooks/hooks.json): - Lightweight guard that verifies CATALOG.md, DISPATCHER.md, and chronicle/SCHEMA.md are present; warns on stderr without blocking the session - Wired into hooks.json SessionStart alongside the existing banner hook README.md updated to reference harness-guide.md, use-cases.md, evals dual-score path, and validate-dispatcher.sh in the CI section. https://claude.ai/code/session_01JWU8bLomUJVpCL8qyMB3XT

evals/judge.py refactored with a Judge ABC and three implementations (AnthropicJudge, OpenAIJudge, GeminiJudge). Each SDK is lazy-imported, so users only install the one they use. Defaults: anthropic + claude-sonnet-4-6 (no behavior change for existing users). New CLI flags: --provider {anthropic|openai|gemini} (default: anthropic) --model <name> (default: provider-specific) New env var overrides: EVAL_JUDGE_PROVIDER default provider EVAL_JUDGE_MODEL default model score_llm.json now records judge_provider + judge_model so aggregate reports can detect provider-specific bias. evals/score.sh --judge now passes through extra args, so cross-provider runs work: ./evals/score.sh --judge <dir> --provider openai Docs updated: - evals/README.md: provider table, install hints, cross-provider workflow - playbooks/harness-guide.md: calibration section notes that cross-provider judging is a precision technique when providers disagree https://claude.ai/code/session_01JWU8bLomUJVpCL8qyMB3XT

…ispatcher fixtures CONTRIBUTING.md: - Add a top-of-file reference to playbooks/harness-guide.md as the canonical determinism reference for contributors - Add a "Dispatcher routing fixtures" section explaining tests/dispatcher/routes.tsv and when contributors need to update it - Update the pre-PR checklist to include validate-dispatcher.sh - Clarify that validate-dispatcher runs transitively via smoke-check.sh step 12 (not as a separate CI step) README.md: same clarification on CI scope for validate-dispatcher. https://claude.ai/code/session_01JWU8bLomUJVpCL8qyMB3XT

claude added 3 commits May 21, 2026 06:48

argen merged commit 68a686c into main May 21, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLM-as-judge scorer, dispatcher validation, and eval aggregation#38

Add LLM-as-judge scorer, dispatcher validation, and eval aggregation#38
argen merged 3 commits into
mainfrom
claude/add-harness-guide-mlmEW

argen commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

argen commented May 21, 2026

Summary

Key Changes

Notable Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants