fix: weight static + runtime in execution correctness scoring by lezama · Pull Request #23 · WordPress/wp-bench

lezama · 2026-06-16T13:27:33Z

Problem

Execution correctness was computed as the unweighted fraction of passed runtime assertions (_score_assertions):

passed = sum(1 for a in assertions if a.get("passed"))
return round(passed / len(assertions), 4)

This had three issues:

The static-analysis score was discarded entirely. The runtime computes a weighted static.score (required/forbidden patterns), but the harness never used it. Code matching zero required patterns could still score full correctness.
Per-pattern / per-assertion weights were ignored. The datasets define "weight" on every check; the runtime honors them, but the Python side recomputed an unweighted average.
Forbidden-pattern hard fails were ignored. A model tripping a forbidden_patterns rule (e.g. a direct SQL query) still got full correctness.

Fix

_score_correctness(raw, test) averages the static and runtime sub-scores — each already weighted by the WordPress runtime — and counts a dimension only when the test actually defines checks for it (so a test with only runtime assertions is scored purely on runtime, and vice versa). Applicability is read from the test definition, not the runtime output.

Hard crashes (fatal/execution error before any assertion runs) force correctness to 0.0: the runtime is ground truth, so static pattern matches must not rescue code that doesn't run. Code that runs but fails some assertions still earns partial credit.

Before / after (real `e-hooks-001`: 4 weighted static patterns + 2 runtime assertions)

Case	before	after
Registers both hooks but wrong meta key, no `get_post_meta` (runtime 2/2, static 2.0/3.0)	1.0000	0.8334
Works but trips a forbidden pattern (e.g. direct SQL)	1.0000	0.5000
Static perfect but code crashes	0.0000	0.0000

Tests

New python/tests/test_scoring.py (10 cases): static+runtime averaging, single-dimension scoring, forbidden hard-fail, crash detection (both signals), and the control case (runs but half assertions fail → keeps partial credit). Full suite passes (17), ruff clean.

Not included

The quality / LLM-judge dimension (30% of overall) is still unimplemented in the runtime — tracked separately.

🤖 Generated with Claude Code

Execution correctness was computed as the unweighted fraction of passed runtime assertions, which discarded the static-analysis score entirely and ignored the per-pattern/per-assertion weights the runtime already computes. As a result, incomplete or insecure code (e.g. tripping a forbidden pattern) could still score full correctness. Replace _score_assertions with _score_correctness, which averages the static and runtime sub-scores (each already weighted by the runtime), and only counts a dimension when the test defines checks for it. Hard execution failures (fatal/execution errors before any assertion runs) force correctness to 0.0 so static pattern matches cannot rescue code that does not run; code that runs but fails some assertions still earns partial credit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-16T13:27:54Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: lezama <migueluy@git.wordpress.org>
Co-authored-by: JasonTheAdams <jason_the_adams@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

JasonTheAdams · 2026-06-17T15:26:44Z

The quality / LLM-judge dimension (30% of overall) is still unimplemented in the runtime — tracked separately.

Can you elaborate on this, @lezama? I thought the LLM-judge piece was removed in my previous, giant PR.

lezama requested a review from JasonTheAdams June 16, 2026 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: weight static + runtime in execution correctness scoring#23

fix: weight static + runtime in execution correctness scoring#23
lezama wants to merge 1 commit into
trunkfrom
fix/execution-correctness-scoring

lezama commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

JasonTheAdams commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lezama commented Jun 16, 2026

Problem

Fix

Before / after (real e-hooks-001: 4 weighted static patterns + 2 runtime assertions)

Tests

Not included

Uh oh!

github-actions Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JasonTheAdams commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Before / after (real `e-hooks-001`: 4 weighted static patterns + 2 runtime assertions)

github-actions Bot commented Jun 16, 2026 •

edited

Loading