fix(0.40.4): runCampaign actually invokes judges (was a phase-1 stub) by drewstone · Pull Request #100 · tangle-network/agent-eval

drewstone · 2026-05-25T18:38:22Z

CRITICAL — the shipped runCampaign never judged.

runJudgeCell was a hardcoded stub returning { composite: 0, notes: 'phase-1-stub' }, and JudgeConfig only modeled a single-LLM-prompt judge (systemPrompt/buildPrompt) — too narrow for real consumers (gtm scores with a 3-model ensemble + deterministic checks). The measurement primitive did not measure. Surfaced by building the first real consumer (gtm) against it — prove-one-before-fanning.

Fix

JudgeConfig is now score()-based: score({ artifact, scenario, signal }) returns a JudgeScore. A function, not a fixed prompt shape — ensembles, deterministic checks, or a single LLM call all conform. Matches what primitives-integration-spec.md already documented; the impl was the stub.
runJudgeCell calls judge.score (stub deleted).
Fail-loud: a thrown judge now invalidates the cell (recorded as error), NOT folded into a fake composite:0 that poisons aggregates — the exact anti-pattern the spec forbids.

Tests (the regression guard the original lacked)

asserts the REAL composite flows from judge.score (not 0 / not phase-1-stub), and aggregates reflect it
asserts a thrown judge sets cell.error + records no fake score

Suite 1421/1421. Typecheck + lint clean. Unblocks ALL of Phase 4 — no consumer migration could measure until this. 0.40.3 → 0.40.4.

CRITICAL: the shipped runCampaign never called judges. runJudgeCell was a hardcoded stub returning { composite: 0, notes: 'phase-1-stub' }, and JudgeConfig only modeled a single-LLM-prompt judge (systemPrompt/buildPrompt) — too narrow for real consumers (gtm scores with a 3-model ENSEMBLE + deterministic checks). The measurement primitive did not measure. Surfaced by building the first real consumer (gtm) against it — prove-one-before-fanning. Fix: - JudgeConfig is now score()-based: . A function, not a fixed prompt shape — ensembles, deterministic checks, or a single LLM call all conform. (Matches what primitives-integration-spec.md already documented; the impl was the stub.) - runJudgeCell calls judge.score (deleted the stub). - Fail-loud: a thrown judge now invalidates the cell (recorded as ), NOT folded into a fake composite:0 that poisons aggregates — the exact anti-pattern the spec forbids. Tests (the regression guard the original lacked): assert the REAL composite flows from judge.score (not 0 / not 'phase-1-stub'), aggregates reflect it, and a thrown judge sets cell.error + records no fake score. Suite 1421/1421. Unblocks ALL of Phase 4 — no consumer migration could measure until this. 0.40.3 → 0.40.4 (npm + python lockstep).

drewstone merged commit 2155397 into main May 25, 2026
1 check passed

drewstone deleted the fix/0.40.4-wire-judge-scoring branch May 25, 2026 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(0.40.4): runCampaign actually invokes judges (was a phase-1 stub)#100

fix(0.40.4): runCampaign actually invokes judges (was a phase-1 stub)#100
drewstone merged 1 commit into
mainfrom
fix/0.40.4-wire-judge-scoring

drewstone commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented May 25, 2026

Fix

Tests (the regression guard the original lacked)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant