fix: weight static + runtime in execution correctness scoring by lezama · Pull Request #23 · WordPress/wp-bench

lezama · 2026-06-16T13:27:33Z

Problem

After #22, execution correctness averages the runtime's static.score and runtime.score (each already weighted), which was the right move. But two gaps remained:

A hard crash could still earn partial credit. If code throws a fatal/execution error before any assertion runs, the runtime sub-score is 0 but a high static.score (pattern matches) averages back up to ~0.5. Static matching should not rescue code that never executed.
Applicability was read from the runtime output, not the test. A crash zeroes the runtime weight, so an output-based check would conclude "runtime doesn't apply" and score the test on static alone — masking the crash.

Fix

_score_correctness(raw, test) reads which dimensions apply from the test definition (static_checks / runtime_checks), not the runtime output, then averages the applicable, runtime-weighted sub-scores. A hard crash (fatal/execution error before any assertion runs, detected via zero runtime weight or a synthetic execution_error/fatal_error entry) forces correctness to 0.0. Code that runs but fails some assertions still earns partial credit.

Before / after (real `e-hooks-001`: 4 weighted static patterns + 2 runtime assertions)

Case	before	after
Registers both hooks but wrong meta key (runtime 2/2, static 2.0/3.0)	0.8334	0.8334
Works but trips a forbidden pattern (e.g. direct SQL)	0.5000	0.5000
Static perfect but code crashes before assertions	0.5000	0.0000

The first two rows match #22's behavior; the crash row is the net-new fix.

Tests

New python/tests/test_scoring.py (10 cases): static+runtime averaging, single-dimension scoring, forbidden hard-fail, crash detection (both signals), and the control case (runs but half assertions fail → keeps partial credit). Reference-solution-mode fixtures updated to define real checks so definition-based applicability is exercised. Full suite passes (33), ruff clean.

github-actions · 2026-06-16T13:27:54Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: lezama <migueluy@git.wordpress.org>
Co-authored-by: JasonTheAdams <jason_the_adams@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

JasonTheAdams · 2026-06-17T15:26:44Z

The quality / LLM-judge dimension (30% of overall) is still unimplemented in the runtime — tracked separately.

Can you elaborate on this, @lezama? I thought the LLM-judge piece was removed in my previous, giant PR.

Execution correctness was computed as the unweighted fraction of passed runtime assertions, which discarded the static-analysis score entirely and ignored the per-pattern/per-assertion weights the runtime already computes. As a result, incomplete or insecure code (e.g. tripping a forbidden pattern) could still score full correctness. Replace _score_assertions with _score_correctness, which averages the static and runtime sub-scores (each already weighted by the runtime), and only counts a dimension when the test defines checks for it. Hard execution failures (fatal/execution errors before any assertion runs) force correctness to 0.0 so static pattern matches cannot rescue code that does not run; code that runs but fails some assertions still earns partial credit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lezama · 2026-06-17T20:06:13Z

you're right — that note was stale, my bad. this PR was branched off trunk before #22 merged, so its base still had the old quality/judge scaffolding (hence the "30% of overall" line). it predated your removal.

just rebased onto current trunk and reframed the PR. #22 already does the static + runtime weighted averaging, so i've scoped this down to the actual net-new:

a hard crash forces correctness to 0.0 — if code throws before any assertion runs, a high static score shouldn't average it back up to ~0.5. static matches shouldn't rescue code that never executed.
applicability is read from the test definition, not the runtime output — otherwise a crash zeroes the runtime weight and the test gets scored on static alone, hiding the crash.

updated the description + the reference-solution-mode fixtures, 33 tests green. keeping it as this one focused commit 🙂

lezama requested a review from JasonTheAdams June 16, 2026 13:46

lezama force-pushed the fix/execution-correctness-scoring branch from 7268b17 to 47eab26 Compare June 17, 2026 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: weight static + runtime in execution correctness scoring#23

fix: weight static + runtime in execution correctness scoring#23
lezama wants to merge 1 commit into
trunkfrom
fix/execution-correctness-scoring

lezama commented Jun 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

JasonTheAdams commented Jun 17, 2026

Uh oh!

lezama commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lezama commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Before / after (real e-hooks-001: 4 weighted static patterns + 2 runtime assertions)

Tests

Uh oh!

github-actions Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JasonTheAdams commented Jun 17, 2026

Uh oh!

lezama commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lezama commented Jun 16, 2026 •

edited

Loading

Before / after (real `e-hooks-001`: 4 weighted static patterns + 2 runtime assertions)

github-actions Bot commented Jun 16, 2026 •

edited

Loading