Skip to content

fix: weight static + runtime in execution correctness scoring#23

Open
lezama wants to merge 1 commit into
trunkfrom
fix/execution-correctness-scoring
Open

fix: weight static + runtime in execution correctness scoring#23
lezama wants to merge 1 commit into
trunkfrom
fix/execution-correctness-scoring

Conversation

@lezama

@lezama lezama commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Problem

After #22, execution correctness averages the runtime's static.score and runtime.score (each already weighted), which was the right move. But two gaps remained:

  1. A hard crash could still earn partial credit. If code throws a fatal/execution error before any assertion runs, the runtime sub-score is 0 but a high static.score (pattern matches) averages back up to ~0.5. Static matching should not rescue code that never executed.
  2. Applicability was read from the runtime output, not the test. A crash zeroes the runtime weight, so an output-based check would conclude "runtime doesn't apply" and score the test on static alone — masking the crash.

Fix

_score_correctness(raw, test) reads which dimensions apply from the test definition (static_checks / runtime_checks), not the runtime output, then averages the applicable, runtime-weighted sub-scores. A hard crash (fatal/execution error before any assertion runs, detected via zero runtime weight or a synthetic execution_error/fatal_error entry) forces correctness to 0.0. Code that runs but fails some assertions still earns partial credit.

Before / after (real e-hooks-001: 4 weighted static patterns + 2 runtime assertions)

Case before after
Registers both hooks but wrong meta key (runtime 2/2, static 2.0/3.0) 0.8334 0.8334
Works but trips a forbidden pattern (e.g. direct SQL) 0.5000 0.5000
Static perfect but code crashes before assertions 0.5000 0.0000

The first two rows match #22's behavior; the crash row is the net-new fix.

Tests

New python/tests/test_scoring.py (10 cases): static+runtime averaging, single-dimension scoring, forbidden hard-fail, crash detection (both signals), and the control case (runs but half assertions fail → keeps partial credit). Reference-solution-mode fixtures updated to define real checks so definition-based applicability is exercised. Full suite passes (33), ruff clean.

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: lezama <migueluy@git.wordpress.org>
Co-authored-by: JasonTheAdams <jason_the_adams@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@lezama lezama requested a review from JasonTheAdams June 16, 2026 13:46
@JasonTheAdams

Copy link
Copy Markdown
Member

The quality / LLM-judge dimension (30% of overall) is still unimplemented in the runtime — tracked separately.

Can you elaborate on this, @lezama? I thought the LLM-judge piece was removed in my previous, giant PR.

Execution correctness was computed as the unweighted fraction of passed
runtime assertions, which discarded the static-analysis score entirely and
ignored the per-pattern/per-assertion weights the runtime already computes.
As a result, incomplete or insecure code (e.g. tripping a forbidden pattern)
could still score full correctness.

Replace _score_assertions with _score_correctness, which averages the
static and runtime sub-scores (each already weighted by the runtime), and
only counts a dimension when the test defines checks for it. Hard execution
failures (fatal/execution errors before any assertion runs) force
correctness to 0.0 so static pattern matches cannot rescue code that does
not run; code that runs but fails some assertions still earns partial credit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lezama lezama force-pushed the fix/execution-correctness-scoring branch from 7268b17 to 47eab26 Compare June 17, 2026 20:00
@lezama

lezama commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

you're right — that note was stale, my bad. this PR was branched off trunk before #22 merged, so its base still had the old quality/judge scaffolding (hence the "30% of overall" line). it predated your removal.

just rebased onto current trunk and reframed the PR. #22 already does the static + runtime weighted averaging, so i've scoped this down to the actual net-new:

  1. a hard crash forces correctness to 0.0 — if code throws before any assertion runs, a high static score shouldn't average it back up to ~0.5. static matches shouldn't rescue code that never executed.
  2. applicability is read from the test definition, not the runtime output — otherwise a crash zeroes the runtime weight and the test gets scored on static alone, hiding the crash.

updated the description + the reference-solution-mode fixtures, 33 tests green. keeping it as this one focused commit 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants