Skip to content

Add LLM-as-judge scorer, dispatcher validation, and eval aggregation#38

Merged
argen merged 3 commits into
mainfrom
claude/add-harness-guide-mlmEW
May 21, 2026
Merged

Add LLM-as-judge scorer, dispatcher validation, and eval aggregation#38
argen merged 3 commits into
mainfrom
claude/add-harness-guide-mlmEW

Conversation

@argen

@argen argen commented May 21, 2026

Copy link
Copy Markdown
Owner

Summary

Adds three major components to the evaluation and harness infrastructure:

  1. LLM-as-judge scorer (evals/judge.py) — provider-agnostic scoring tool that reads synthesis outputs and scores them against the rubric using Claude, GPT-4, or Gemini. Complements human scoring with fast, repeatable LLM judgments.

  2. Dispatcher validation (scripts/validate-dispatcher.sh) — structural checks ensuring routing fixtures are consistent with the catalog, descriptions are complete, and referenced skills have scenarios frontmatter. Integrated into smoke-check.sh as Step 12.

  3. Enhanced eval aggregation — updated evals/score.sh to handle both human (score_human.json) and LLM (score_llm.json) scores, report deltas, and flag runs where |human − llm| ≥ 0.25 for calibration review.

Key Changes

  • evals/judge.py (330 lines)

    • Abstract Judge base class with provider-specific implementations: AnthropicJudge, OpenAIJudge, GeminiJudge
    • Lazy SDK imports — only the chosen provider's SDK is required
    • Reads rubric.md + synthesis.md, calls LLM with strict JSON schema, validates response, writes score_llm.json with metadata (provider, model, timestamp, mean)
    • CLI with --dry-run, --provider, --model flags; respects EVAL_JUDGE_PROVIDER and EVAL_JUDGE_MODEL env vars
    • Strict score validation: only {0, 0.25, 0.5, 0.75, 1.0} allowed; all four dimensions + rationales required
  • evals/score.sh (refactored)

    • Renamed human scoring output to score_human.json (backward-compatible: reads score.json if score_human.json absent)
    • Added --judge flag to delegate to evals/judge.py
    • Aggregation now reads both human and LLM scores, computes delta, flags |delta| ≥ 0.25 with ← review marker
    • Human score takes precedence for reported mean; LLM shown alongside for calibration
  • scripts/validate-dispatcher.sh (141 lines)

    • Check 1: routes.tsv → CATALOG.md dead reference detection
    • Check 2: CATALOG.md description completeness (no empty "When to invoke" fields)
    • Check 3: scenarios frontmatter presence in all referenced skills
    • Integrated into smoke-check.sh Step 12
  • tests/dispatcher/routes.tsv (53 lines)

    • Routing fixtures documenting expected dispatch for representative prompts
    • Covers slash commands, artifact intake signals, verb+shape triggers
    • Format: prompt<TAB>expected_skill<TAB>expected_role<TAB>notes
  • hooks/session-start-check.sh (27 lines)

    • Lightweight structural guard on SessionStart
    • Verifies presence of CATALOG.md, DISPATCHER.md, chronicle/SCHEMA.md
    • Warns on stderr; does not block session
  • Documentation

    • evals/README.md — updated with LLM judge workflow, provider table, recommended scoring sequence
    • examples/use-cases.md — six PM workflows showing multi-skill sequences (discovery interview cycle, strategy kernel review, launch readiness, finance review, weekly rhythm, async decision capture)
    • playbooks/harness-guide.md — determinism invariants, five layers (dispatcher, skill anatomy, chronicle, lifecycle hooks, build-time checks), verification suite
    • CONTRIBUTING.md — added dispatcher routing fixtures section, reference to harness-guide.md

Notable Implementation Details

  • Provider abstraction: Each judge implements score(system, user) → str. SDK import failures are caught and reported with install

https://claude.ai/code/session_01JWU8bLomUJVpCL8qyMB3XT

claude added 3 commits May 21, 2026 06:48
…cher tests

Harness guide (playbooks/harness-guide.md):
- Documents all five determinism layers (dispatcher, skill anatomy, chronicle,
  lifecycle hooks, build-time consistency checks) with verification commands
- Quick-reference table mapping what-you-changed to commands to run
- TDD discipline section covering the invariants that must not be broken
- Common failure modes and diagnosis

Use-case how-to guide (examples/use-cases.md):
- Six multi-skill PM workflows: discovery interview cycle, strategy kernel review,
  launch readiness sweep, finance/pricing review, weekly PM rhythm, stakeholder prep
- Each workflow shows skills in sequence, what persists in the chronicle, and
  example prompts

LLM-as-judge evals (evals/judge.py, evals/score.sh):
- judge.py: calls Claude API to score a synthesis on 4 rubric dimensions;
  writes score_llm.json in the same schema as the human scorer
- score.sh: dual-score mode (score_human.json + score_llm.json); aggregate
  shows both side-by-side with delta flagging (|delta| >= 0.25 → review);
  --judge flag delegates to judge.py
- evals/README.md: documents the recommended workflow (LLM-first, human
  reviews flagged runs) and calibration target (mean delta < 0.1)

Dispatcher routing tests (tests/dispatcher/routes.tsv, scripts/validate-dispatcher.sh):
- 28 routing fixtures covering slash commands, artifact intake, verb+shape
  triggers, and documented multi-match escalation cases
- validate-dispatcher.sh: three deterministic checks — dead references in
  routes.tsv, CATALOG.md description completeness, scenarios frontmatter presence
- validate-dispatcher.sh added as step 12 in smoke-check.sh

Session-start integrity check (hooks/session-start-check.sh, hooks/hooks.json):
- Lightweight guard that verifies CATALOG.md, DISPATCHER.md, and
  chronicle/SCHEMA.md are present; warns on stderr without blocking the session
- Wired into hooks.json SessionStart alongside the existing banner hook

README.md updated to reference harness-guide.md, use-cases.md, evals dual-score
path, and validate-dispatcher.sh in the CI section.

https://claude.ai/code/session_01JWU8bLomUJVpCL8qyMB3XT
evals/judge.py refactored with a Judge ABC and three implementations
(AnthropicJudge, OpenAIJudge, GeminiJudge). Each SDK is lazy-imported,
so users only install the one they use. Defaults: anthropic +
claude-sonnet-4-6 (no behavior change for existing users).

New CLI flags:
  --provider {anthropic|openai|gemini}   (default: anthropic)
  --model <name>                          (default: provider-specific)

New env var overrides:
  EVAL_JUDGE_PROVIDER  default provider
  EVAL_JUDGE_MODEL     default model

score_llm.json now records judge_provider + judge_model so aggregate
reports can detect provider-specific bias.

evals/score.sh --judge now passes through extra args, so cross-provider
runs work: ./evals/score.sh --judge <dir> --provider openai

Docs updated:
- evals/README.md: provider table, install hints, cross-provider workflow
- playbooks/harness-guide.md: calibration section notes that cross-provider
  judging is a precision technique when providers disagree

https://claude.ai/code/session_01JWU8bLomUJVpCL8qyMB3XT
…ispatcher fixtures

CONTRIBUTING.md:
- Add a top-of-file reference to playbooks/harness-guide.md as the canonical
  determinism reference for contributors
- Add a "Dispatcher routing fixtures" section explaining tests/dispatcher/routes.tsv
  and when contributors need to update it
- Update the pre-PR checklist to include validate-dispatcher.sh
- Clarify that validate-dispatcher runs transitively via smoke-check.sh step 12
  (not as a separate CI step)

README.md: same clarification on CI scope for validate-dispatcher.

https://claude.ai/code/session_01JWU8bLomUJVpCL8qyMB3XT
@argen argen merged commit 68a686c into main May 21, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants