fix: weight static + runtime in execution correctness scoring#23
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Can you elaborate on this, @lezama? I thought the LLM-judge piece was removed in my previous, giant PR. |
Execution correctness was computed as the unweighted fraction of passed runtime assertions, which discarded the static-analysis score entirely and ignored the per-pattern/per-assertion weights the runtime already computes. As a result, incomplete or insecure code (e.g. tripping a forbidden pattern) could still score full correctness. Replace _score_assertions with _score_correctness, which averages the static and runtime sub-scores (each already weighted by the runtime), and only counts a dimension when the test defines checks for it. Hard execution failures (fatal/execution errors before any assertion runs) force correctness to 0.0 so static pattern matches cannot rescue code that does not run; code that runs but fails some assertions still earns partial credit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7268b17 to
47eab26
Compare
|
you're right — that note was stale, my bad. this PR was branched off trunk before #22 merged, so its base still had the old quality/judge scaffolding (hence the "30% of overall" line). it predated your removal. just rebased onto current trunk and reframed the PR. #22 already does the static + runtime weighted averaging, so i've scoped this down to the actual net-new:
updated the description + the reference-solution-mode fixtures, 33 tests green. keeping it as this one focused commit 🙂 |
Problem
After #22, execution correctness averages the runtime's
static.scoreandruntime.score(each already weighted), which was the right move. But two gaps remained:static.score(pattern matches) averages back up to ~0.5. Static matching should not rescue code that never executed.Fix
_score_correctness(raw, test)reads which dimensions apply from the test definition (static_checks/runtime_checks), not the runtime output, then averages the applicable, runtime-weighted sub-scores. A hard crash (fatal/execution error before any assertion runs, detected via zero runtime weight or a syntheticexecution_error/fatal_errorentry) forces correctness to0.0. Code that runs but fails some assertions still earns partial credit.Before / after (real
e-hooks-001: 4 weighted static patterns + 2 runtime assertions)The first two rows match #22's behavior; the crash row is the net-new fix.
Tests
New
python/tests/test_scoring.py(10 cases): static+runtime averaging, single-dimension scoring, forbidden hard-fail, crash detection (both signals), and the control case (runs but half assertions fail → keeps partial credit). Reference-solution-mode fixtures updated to define real checks so definition-based applicability is exercised. Full suite passes (33), ruff clean.