Skip to content

feat(evals): synthetic-perturbation judge validator — primary automated judge validation (L480)#255

Merged
cipher813 merged 1 commit into
mainfrom
feat/judge-perturbation-validator-260529
May 29, 2026
Merged

feat(evals): synthetic-perturbation judge validator — primary automated judge validation (L480)#255
cipher813 merged 1 commit into
mainfrom
feat/judge-perturbation-validator-260529

Conversation

@cipher813
Copy link
Copy Markdown
Owner

The primary automated judge-validation mechanism from the 2026-05-29 L480 re-scope (config #372), after Brian's process-vs-outcome challenge established that outcome-IC validates the system, not the judge.

What it does

Validates the LLM-as-judge on its actual construct — process quality — with zero human labels. Take a known-good agent output, apply a deterministic corruption targeting one rubric dimension, run the judge on reference + corrupted, assert the targeted dimension drops. Ground truth is constructed (we authored the corruption), so no annotation. Tests sensitivity (does it notice degradation?) + dimension-specificity (does the right dimension move?) — catching rubber-stamp judges, halo effects, and verbosity bias.

Explicitly NOT outcome-IC. Never touches stock returns. Outcome (realized alpha) is a separate, firewalled system diagnostic — reasoning quality and 21d return are weakly correlated, so validating/tuning the judge on outcomes would Goodhart it into a luck-predictor.

Pieces

  • evals/perturbation.py — reference fixtures (sector_quant + sector_qual), 8 deterministic corruptions (strip numbers, break rank-vs-score coherence, flatten calibration, gut completeness, strip citations, flatten reasoning depth, misalign evidence, verbosity-pad probe), battery runner with injectable judge_fn (so harness logic is testable offline), markdown scorecard.
  • tests/test_judge_perturbation.py16 tests: corruption determinism + battery logic via a fake judge. Runs in regular mocked CI (no API key).
  • tests/live_smoke/judge_perturbation_smoke.py — paths-filtered live CI gate; tolerant caught-rate threshold (0.75) over a 4-corruption subset; clean skip on forks without the secret.
  • .github/workflows/judge-perturbation-smoke.yml — triggers on judge/perturbation file changes; checks out alpha-engine-config (gitignored rubric prompts) + uses ANTHROPIC_API_KEY.

Validated live

Ran against claude-haiku-4-54/4 caught: numerical_grounding 5→1, ranking_coherence 5→2, citation_grounding 4→1, reasoning_depth 4→1. (Two fixture bugs found+fixed during validation: qual fixture tripped the degenerate_input pre-check; the first ranking corruption was too weak — scores travel with picks, so it needed score-vs-rank contradiction.)

Full research tests/ suite: 1663 passed.

Scope note

This is Phase A (core harness + CI gate). The weekly sensitivity scorecard (emit to S3 + surface in the evaluator email) is the scoped Phase B follow-up — it needs a Lambda with Anthropic access (the EvalRollingMean Lambda is an aggregation stage without it). Tracked in L480.

Independent of the schema-v18 PR (#254); this touches only evals/ + tests + a workflow.

🤖 Generated with Claude Code

…ed judge validation (L480)

Validates the LLM-as-judge on its actual construct (process quality)
with ZERO human labels. Takes a known-good agent output, applies a
deterministic corruption targeting one rubric dimension, runs the judge
on both, and asserts the targeted dimension drops. Constructed ground
truth → no annotation. Tests sensitivity + dimension-specificity
(catches rubber-stamp judges, halo effects, verbosity bias).

Explicitly NOT outcome-IC — never touches stock returns. Outcome is a
separate, firewalled system diagnostic (reasoning quality and 21d return
are weakly correlated; tuning the judge on outcome would Goodhart it).

- evals/perturbation.py: reference fixtures (sector_quant + sector_qual),
  8 deterministic corruptions, battery runner with injectable judge_fn,
  markdown scorecard.
- tests/test_judge_perturbation.py: 16 tests — corruption determinism +
  battery logic with a fake judge (regular mocked CI, no key).
- tests/live_smoke/judge_perturbation_smoke.py: paths-filtered live CI
  gate; tolerant caught-rate threshold (0.75) over a 4-corruption subset.
  Verified live against claude-haiku-4-5: 4/4 caught (numerical_grounding
  5→1, ranking_coherence 5→2, citation_grounding 4→1, reasoning_depth 4→1).
- .github/workflows/judge-perturbation-smoke.yml: triggers on judge/
  perturbation file changes; checks out alpha-engine-config for the
  gitignored rubric prompts + uses ANTHROPIC_API_KEY (clean skip on forks).

Full research tests/ suite: 1663 passed. Weekly scorecard stage is the
scoped Phase B follow-up (needs a Lambda with Anthropic access).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit bd68e47 into main May 29, 2026
2 checks passed
@cipher813 cipher813 deleted the feat/judge-perturbation-validator-260529 branch May 29, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant