fix: weight static + runtime in execution correctness scoring#23
Open
lezama wants to merge 1 commit into
Open
Conversation
Execution correctness was computed as the unweighted fraction of passed runtime assertions, which discarded the static-analysis score entirely and ignored the per-pattern/per-assertion weights the runtime already computes. As a result, incomplete or insecure code (e.g. tripping a forbidden pattern) could still score full correctness. Replace _score_assertions with _score_correctness, which averages the static and runtime sub-scores (each already weighted by the runtime), and only counts a dimension when the test defines checks for it. Hard execution failures (fatal/execution errors before any assertion runs) force correctness to 0.0 so static pattern matches cannot rescue code that does not run; code that runs but fails some assertions still earns partial credit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Member
Can you elaborate on this, @lezama? I thought the LLM-judge piece was removed in my previous, giant PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Execution correctness was computed as the unweighted fraction of passed runtime assertions (
_score_assertions):This had three issues:
static.score(required/forbidden patterns), but the harness never used it. Code matching zero required patterns could still score full correctness."weight"on every check; the runtime honors them, but the Python side recomputed an unweighted average.forbidden_patternsrule (e.g. a direct SQL query) still got full correctness.Fix
_score_correctness(raw, test)averages the static and runtime sub-scores — each already weighted by the WordPress runtime — and counts a dimension only when the test actually defines checks for it (so a test with only runtime assertions is scored purely on runtime, and vice versa). Applicability is read from the test definition, not the runtime output.Hard crashes (fatal/execution error before any assertion runs) force correctness to
0.0: the runtime is ground truth, so static pattern matches must not rescue code that doesn't run. Code that runs but fails some assertions still earns partial credit.Before / after (real
e-hooks-001: 4 weighted static patterns + 2 runtime assertions)get_post_meta(runtime 2/2, static 2.0/3.0)Tests
New
python/tests/test_scoring.py(10 cases): static+runtime averaging, single-dimension scoring, forbidden hard-fail, crash detection (both signals), and the control case (runs but half assertions fail → keeps partial credit). Full suite passes (17), ruff clean.Not included
The
quality/ LLM-judge dimension (30% of overall) is still unimplemented in the runtime — tracked separately.🤖 Generated with Claude Code