Skip to content

fix(0.40.4): runCampaign actually invokes judges (was a phase-1 stub)#100

Merged
drewstone merged 1 commit into
mainfrom
fix/0.40.4-wire-judge-scoring
May 25, 2026
Merged

fix(0.40.4): runCampaign actually invokes judges (was a phase-1 stub)#100
drewstone merged 1 commit into
mainfrom
fix/0.40.4-wire-judge-scoring

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

CRITICAL — the shipped runCampaign never judged.

runJudgeCell was a hardcoded stub returning { composite: 0, notes: 'phase-1-stub' }, and JudgeConfig only modeled a single-LLM-prompt judge (systemPrompt/buildPrompt) — too narrow for real consumers (gtm scores with a 3-model ensemble + deterministic checks). The measurement primitive did not measure. Surfaced by building the first real consumer (gtm) against it — prove-one-before-fanning.

Fix

  • JudgeConfig is now score()-based: score({ artifact, scenario, signal }) returns a JudgeScore. A function, not a fixed prompt shape — ensembles, deterministic checks, or a single LLM call all conform. Matches what primitives-integration-spec.md already documented; the impl was the stub.
  • runJudgeCell calls judge.score (stub deleted).
  • Fail-loud: a thrown judge now invalidates the cell (recorded as error), NOT folded into a fake composite:0 that poisons aggregates — the exact anti-pattern the spec forbids.

Tests (the regression guard the original lacked)

  • asserts the REAL composite flows from judge.score (not 0 / not phase-1-stub), and aggregates reflect it
  • asserts a thrown judge sets cell.error + records no fake score

Suite 1421/1421. Typecheck + lint clean. Unblocks ALL of Phase 4 — no consumer migration could measure until this. 0.40.3 → 0.40.4.

CRITICAL: the shipped runCampaign never called judges. runJudgeCell was a
hardcoded stub returning { composite: 0, notes: 'phase-1-stub' }, and
JudgeConfig only modeled a single-LLM-prompt judge (systemPrompt/buildPrompt)
— too narrow for real consumers (gtm scores with a 3-model ENSEMBLE +
deterministic checks). The measurement primitive did not measure. Surfaced by
building the first real consumer (gtm) against it — prove-one-before-fanning.

Fix:
- JudgeConfig is now score()-based: . A function, not a fixed prompt shape — ensembles, deterministic
  checks, or a single LLM call all conform. (Matches what
  primitives-integration-spec.md already documented; the impl was the stub.)
- runJudgeCell calls judge.score (deleted the stub).
- Fail-loud: a thrown judge now invalidates the cell (recorded as ),
  NOT folded into a fake composite:0 that poisons aggregates — the exact
  anti-pattern the spec forbids.

Tests (the regression guard the original lacked): assert the REAL composite
flows from judge.score (not 0 / not 'phase-1-stub'), aggregates reflect it,
and a thrown judge sets cell.error + records no fake score. Suite 1421/1421.

Unblocks ALL of Phase 4 — no consumer migration could measure until this.
0.40.3 → 0.40.4 (npm + python lockstep).
@drewstone drewstone merged commit 2155397 into main May 25, 2026
1 check passed
@drewstone drewstone deleted the fix/0.40.4-wire-judge-scoring branch May 25, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant