Add LLM-as-judge scorer, dispatcher validation, and eval aggregation#38
Merged
Conversation
…cher tests Harness guide (playbooks/harness-guide.md): - Documents all five determinism layers (dispatcher, skill anatomy, chronicle, lifecycle hooks, build-time consistency checks) with verification commands - Quick-reference table mapping what-you-changed to commands to run - TDD discipline section covering the invariants that must not be broken - Common failure modes and diagnosis Use-case how-to guide (examples/use-cases.md): - Six multi-skill PM workflows: discovery interview cycle, strategy kernel review, launch readiness sweep, finance/pricing review, weekly PM rhythm, stakeholder prep - Each workflow shows skills in sequence, what persists in the chronicle, and example prompts LLM-as-judge evals (evals/judge.py, evals/score.sh): - judge.py: calls Claude API to score a synthesis on 4 rubric dimensions; writes score_llm.json in the same schema as the human scorer - score.sh: dual-score mode (score_human.json + score_llm.json); aggregate shows both side-by-side with delta flagging (|delta| >= 0.25 → review); --judge flag delegates to judge.py - evals/README.md: documents the recommended workflow (LLM-first, human reviews flagged runs) and calibration target (mean delta < 0.1) Dispatcher routing tests (tests/dispatcher/routes.tsv, scripts/validate-dispatcher.sh): - 28 routing fixtures covering slash commands, artifact intake, verb+shape triggers, and documented multi-match escalation cases - validate-dispatcher.sh: three deterministic checks — dead references in routes.tsv, CATALOG.md description completeness, scenarios frontmatter presence - validate-dispatcher.sh added as step 12 in smoke-check.sh Session-start integrity check (hooks/session-start-check.sh, hooks/hooks.json): - Lightweight guard that verifies CATALOG.md, DISPATCHER.md, and chronicle/SCHEMA.md are present; warns on stderr without blocking the session - Wired into hooks.json SessionStart alongside the existing banner hook README.md updated to reference harness-guide.md, use-cases.md, evals dual-score path, and validate-dispatcher.sh in the CI section. https://claude.ai/code/session_01JWU8bLomUJVpCL8qyMB3XT
evals/judge.py refactored with a Judge ABC and three implementations
(AnthropicJudge, OpenAIJudge, GeminiJudge). Each SDK is lazy-imported,
so users only install the one they use. Defaults: anthropic +
claude-sonnet-4-6 (no behavior change for existing users).
New CLI flags:
--provider {anthropic|openai|gemini} (default: anthropic)
--model <name> (default: provider-specific)
New env var overrides:
EVAL_JUDGE_PROVIDER default provider
EVAL_JUDGE_MODEL default model
score_llm.json now records judge_provider + judge_model so aggregate
reports can detect provider-specific bias.
evals/score.sh --judge now passes through extra args, so cross-provider
runs work: ./evals/score.sh --judge <dir> --provider openai
Docs updated:
- evals/README.md: provider table, install hints, cross-provider workflow
- playbooks/harness-guide.md: calibration section notes that cross-provider
judging is a precision technique when providers disagree
https://claude.ai/code/session_01JWU8bLomUJVpCL8qyMB3XT
…ispatcher fixtures CONTRIBUTING.md: - Add a top-of-file reference to playbooks/harness-guide.md as the canonical determinism reference for contributors - Add a "Dispatcher routing fixtures" section explaining tests/dispatcher/routes.tsv and when contributors need to update it - Update the pre-PR checklist to include validate-dispatcher.sh - Clarify that validate-dispatcher runs transitively via smoke-check.sh step 12 (not as a separate CI step) README.md: same clarification on CI scope for validate-dispatcher. https://claude.ai/code/session_01JWU8bLomUJVpCL8qyMB3XT
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds three major components to the evaluation and harness infrastructure:
LLM-as-judge scorer (
evals/judge.py) — provider-agnostic scoring tool that reads synthesis outputs and scores them against the rubric using Claude, GPT-4, or Gemini. Complements human scoring with fast, repeatable LLM judgments.Dispatcher validation (
scripts/validate-dispatcher.sh) — structural checks ensuring routing fixtures are consistent with the catalog, descriptions are complete, and referenced skills have scenarios frontmatter. Integrated intosmoke-check.shas Step 12.Enhanced eval aggregation — updated
evals/score.shto handle both human (score_human.json) and LLM (score_llm.json) scores, report deltas, and flag runs where|human − llm| ≥ 0.25for calibration review.Key Changes
evals/judge.py(330 lines)Judgebase class with provider-specific implementations:AnthropicJudge,OpenAIJudge,GeminiJudgerubric.md+synthesis.md, calls LLM with strict JSON schema, validates response, writesscore_llm.jsonwith metadata (provider, model, timestamp, mean)--dry-run,--provider,--modelflags; respectsEVAL_JUDGE_PROVIDERandEVAL_JUDGE_MODELenv vars{0, 0.25, 0.5, 0.75, 1.0}allowed; all four dimensions + rationales requiredevals/score.sh(refactored)score_human.json(backward-compatible: readsscore.jsonifscore_human.jsonabsent)--judgeflag to delegate toevals/judge.py|delta| ≥ 0.25with← reviewmarkerscripts/validate-dispatcher.sh(141 lines)smoke-check.shStep 12tests/dispatcher/routes.tsv(53 lines)prompt<TAB>expected_skill<TAB>expected_role<TAB>noteshooks/session-start-check.sh(27 lines)CATALOG.md,DISPATCHER.md,chronicle/SCHEMA.mdDocumentation
evals/README.md— updated with LLM judge workflow, provider table, recommended scoring sequenceexamples/use-cases.md— six PM workflows showing multi-skill sequences (discovery interview cycle, strategy kernel review, launch readiness, finance review, weekly rhythm, async decision capture)playbooks/harness-guide.md— determinism invariants, five layers (dispatcher, skill anatomy, chronicle, lifecycle hooks, build-time checks), verification suiteCONTRIBUTING.md— added dispatcher routing fixtures section, reference to harness-guide.mdNotable Implementation Details
score(system, user) → str. SDK import failures are caught and reported with installhttps://claude.ai/code/session_01JWU8bLomUJVpCL8qyMB3XT