Skip to content

fix: weight static + runtime in execution correctness scoring#23

Open
lezama wants to merge 1 commit into
trunkfrom
fix/execution-correctness-scoring
Open

fix: weight static + runtime in execution correctness scoring#23
lezama wants to merge 1 commit into
trunkfrom
fix/execution-correctness-scoring

Conversation

@lezama

@lezama lezama commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Problem

Execution correctness was computed as the unweighted fraction of passed runtime assertions (_score_assertions):

passed = sum(1 for a in assertions if a.get("passed"))
return round(passed / len(assertions), 4)

This had three issues:

  1. The static-analysis score was discarded entirely. The runtime computes a weighted static.score (required/forbidden patterns), but the harness never used it. Code matching zero required patterns could still score full correctness.
  2. Per-pattern / per-assertion weights were ignored. The datasets define "weight" on every check; the runtime honors them, but the Python side recomputed an unweighted average.
  3. Forbidden-pattern hard fails were ignored. A model tripping a forbidden_patterns rule (e.g. a direct SQL query) still got full correctness.

Fix

_score_correctness(raw, test) averages the static and runtime sub-scores — each already weighted by the WordPress runtime — and counts a dimension only when the test actually defines checks for it (so a test with only runtime assertions is scored purely on runtime, and vice versa). Applicability is read from the test definition, not the runtime output.

Hard crashes (fatal/execution error before any assertion runs) force correctness to 0.0: the runtime is ground truth, so static pattern matches must not rescue code that doesn't run. Code that runs but fails some assertions still earns partial credit.

Before / after (real e-hooks-001: 4 weighted static patterns + 2 runtime assertions)

Case before after
Registers both hooks but wrong meta key, no get_post_meta (runtime 2/2, static 2.0/3.0) 1.0000 0.8334
Works but trips a forbidden pattern (e.g. direct SQL) 1.0000 0.5000
Static perfect but code crashes 0.0000 0.0000

Tests

New python/tests/test_scoring.py (10 cases): static+runtime averaging, single-dimension scoring, forbidden hard-fail, crash detection (both signals), and the control case (runs but half assertions fail → keeps partial credit). Full suite passes (17), ruff clean.

Not included

The quality / LLM-judge dimension (30% of overall) is still unimplemented in the runtime — tracked separately.

🤖 Generated with Claude Code

Execution correctness was computed as the unweighted fraction of passed
runtime assertions, which discarded the static-analysis score entirely and
ignored the per-pattern/per-assertion weights the runtime already computes.
As a result, incomplete or insecure code (e.g. tripping a forbidden pattern)
could still score full correctness.

Replace _score_assertions with _score_correctness, which averages the
static and runtime sub-scores (each already weighted by the runtime), and
only counts a dimension when the test defines checks for it. Hard execution
failures (fatal/execution errors before any assertion runs) force
correctness to 0.0 so static pattern matches cannot rescue code that does
not run; code that runs but fails some assertions still earns partial credit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: lezama <migueluy@git.wordpress.org>
Co-authored-by: JasonTheAdams <jason_the_adams@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@lezama lezama requested a review from JasonTheAdams June 16, 2026 13:46
@JasonTheAdams

Copy link
Copy Markdown
Member

The quality / LLM-judge dimension (30% of overall) is still unimplemented in the runtime — tracked separately.

Can you elaborate on this, @lezama? I thought the LLM-judge piece was removed in my previous, giant PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants