Add abilities test harness and training signals by Jameswlepage · Pull Request #19 · WordPress/wp-bench

Jameswlepage · 2026-02-05T21:53:00Z

Summary

Add an Abilities test kind to the harness, dataset export, and scoring pipeline.
Introduce a WP-CLI ability runner in the runtime and a minimal built-in ability for sanity checks.
Expand tool-use verifiers (schema validity, permissions, confirmation, method checks) and trace outputs.
Document abilities suites and add an experimental wp-abilities-v1 suite.

Why

Enables tool-use evaluation against real WordPress abilities.
Provides dense, structured verifier signals and traces that can be used for training, regression testing, and comparisons.

Testing

Not run (local runtime not started in this environment).

Copilot

Pull request overview

This PR introduces a comprehensive abilities test framework for evaluating AI models' tool-use capabilities with WordPress Abilities API. It adds a new test type alongside existing execution and knowledge tests, enabling evaluation of multi-step tool-calling workflows with structured verifiers for training signals.

Changes:

Added WP-CLI ability runner in runtime with built-in wpbench/get_site_info ability for sanity checks
Extended Python harness with AbilityTest dataclass, execute_ability method, and abilities scoring with 9+ verifier types (schema validity, permissions, confirmation, method checks, etc.)
Added experimental wp-abilities-v1 suite with knowledge and abilities tests, plus comprehensive documentation

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
runtime/wp-bench-runtime.php	Added CLI_Ability class registration and wpbench/get_site_info ability with proper hooks
runtime/src/class-cli-ability.php	New WP-CLI command for executing abilities with validation, confirmation, and permission checks
runtime/src/class-sandbox.php	Added execute_snippet method for running setup/teardown fixtures
python/wp_bench/scoring.py	Added abilities score field with normalized weighted averaging
python/wp_bench/output.py	Added conditional abilities column to comparison table
python/wp_bench/environment.py	Added AbilityResult dataclass and execute_ability method
python/wp_bench/datasets.py	Added AbilityTest dataclass and loading logic for abilities tests from local/HuggingFace
python/wp_bench/core.py	Added tool call parsing, abilities scoring with 9+ verifiers, and _run_abilities_tests method
python/wp_bench/cli.py	Updated dry-run to display abilities test count
datasets/suites/wp-abilities-v1/knowledge/abilities-api.json	Added knowledge test for abilities API hook usage
datasets/suites/wp-abilities-v1/abilities/core.json	Added basic tool-use test for wpbench/get_site_info ability
datasets/export_dataset.py	Added abilities fields to export schema with backward compatibility
datasets/README.md	Documented abilities test schema, verifiers, expected_outputs, and expected_state formats
README.md	Updated overview to mention optional abilities tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-05T22:07:25Z

+| `fixtures` | object | Setup/teardown or initial state |
+| `verifiers` | array | Verifier names to score this task |
+| `expected_outputs` | array | Optional expected tool outputs |
+| `expected_state` | object | Optional expected state |


The documentation states that expected_state is an "object" in the field table (line 91), but the description in lines 131-148 and the code implementation show that it should be a list of checks. The documentation should clarify that expected_state can be either a list of check objects, or an object with a "checks" key containing a list, or a single check object with an "ability" key.

Copilot · 2026-02-05T22:07:25Z

+		foreach ( [ 'confirm', 'confirmation', 'nonce' ] as $key ) {
+			if ( ! empty( $input[ $key ] ) ) {


The has_confirmation method checks for 'nonce' as a confirmation signal. However, in WordPress, a nonce is typically used for CSRF protection rather than explicit user confirmation of a destructive action. Using 'nonce' as a confirmation signal might be misleading or create confusion. Consider removing 'nonce' from the list of confirmation keys, or document that nonce is accepted as a confirmation signal in this specific context.

Copilot · 2026-02-05T22:07:25Z

+    def _run_abilities_tests(self, tests: List[AbilityTest]) -> None:
+        """Run tool-use ability tests in parallel."""
+        if not tests:
+            return
+
+        limit = self.config.run.limit or len(tests)
+        tests_to_run = tests[:limit]
+        concurrency = self.config.run.concurrency
+
+        def process_test(test: AbilityTest) -> Dict[str, Any]:
+            try:
+                history: List[str] = []
+                tool_calls: List[Dict[str, Any]] = []
+                observations: List[Dict[str, Any]] = []
+                trace: List[Dict[str, Any]] = []
+                final_answer = ""
+
+                setup = test.fixtures.get("setup") if isinstance(test.fixtures, dict) else None
+                teardown = test.fixtures.get("teardown") if isinstance(test.fixtures, dict) else None
+
+                for step in range(test.max_steps):
+                    prompt = render_abilities_prompt(test, history=history)
+                    completion = self.model.generate(prompt)
+                    trace_step: Dict[str, Any] = {
+                        "step": step,
+                        "prompt": prompt,
+                        "completion": completion,
+                    }
+                    parsed = parse_tool_call(completion)
+                    if parsed is None:
+                        final_answer = completion
+                        trace_step["tool_call"] = None
+                        trace.append(trace_step)
+                        break
+
+                    tool_calls.append(parsed)
+                    result = self.environment.execute_ability(
+                        ability=parsed.get("ability", ""),
+                        input_data=parsed.get("input"),
+                        method=parsed.get("method"),
+                        setup=setup if step == 0 else None,
+                        teardown=teardown if step == test.max_steps - 1 else None,
+                    )
+                    observation = result.raw if result.raw else {"success": False, "error": "no_result"}
+                    observations.append(observation)
+                    trace_step["tool_call"] = parsed
+                    trace_step["observation"] = observation
+                    trace.append(trace_step)
+                    history.append(f"TOOL_CALL: {parsed}")
+                    history.append(f"OBSERVATION: {observation}")
+                else:
+                    final_answer = ""
+
+                state_checks = _run_state_checks(test.expected_state, self.environment) if test.expected_state else []
+                score, verifier_results = score_abilities(
+                    test,
+                    tool_calls,
+                    observations,
+                    final_answer,
+                    state_checks=state_checks,
+                )
+
+                return {
+                    "test_id": test.id,
+                    "type": "abilities",
+                    "prompt_hash": sha256(render_abilities_prompt(test)),
+                    "tool_calls": tool_calls,
+                    "observations": observations,
+                    "trace": trace,
+                    "state_checks": state_checks,
+                    "final_answer": final_answer,
+                    "verifiers": verifier_results,
+                    "score": score,
+                }
+            except Exception as e:
+                raise TestError(test.id, "abilities", e) from e
+
+        with create_progress() as progress:
+            task = progress.add_task("Abilities", total=len(tests_to_run))
+            with ThreadPoolExecutor(max_workers=concurrency) as executor:
+                futures = {executor.submit(process_test, test): test for test in tests_to_run}
+                for future in as_completed(futures):
+                    try:
+                        result = future.result()
+                        with self._lock:
+                            self.aggregator.add_abilities(result["score"])
+                            self.records.append(result)
+                        progress.update(task, advance=1)
+                    except TestError:
+                        for f in futures:
+                            f.cancel()
+                        raise


The _run_abilities_tests method is duplicated between BenchmarkRunner (lines 500-591) and SingleModelRunner (lines 880-971) with identical implementations. This violates the DRY principle and makes maintenance more difficult. Consider extracting this method to a shared base class, or making one of the classes use the other's implementation.

Copilot · 2026-02-05T22:07:26Z

+def render_abilities_prompt(test: AbilityTest, history: List[str] | None = None) -> str:
+    prompt = [test.prompt]
+    prompt.append(
+        'Return a tool call as JSON: {"ability": "name", "input": {...}}'
+    )
+    prompt.append(f"Allowed abilities: {', '.join(test.allowed_abilities)}")
+    if history:
+        prompt.append("History:")
+        prompt.extend(history)
+    prompt.append("If no tool is needed, answer normally.")
+    return "\n".join(prompt)


The prompt rendering in render_abilities_prompt does not include information about the method (GET vs POST) that should be used for read-only vs write abilities. Models may not know to include the "method" field in their tool calls, which could lead to failures in method_valid checks. Consider adding guidance about HTTP methods in the prompt based on the allowed abilities' readonly annotations.

Copilot · 2026-02-05T22:07:26Z

+    if observation.get("confirmation_required"):
+        return False


The function _observation_confirmation_ok returns True as the default case when neither confirmation_ok is explicitly set nor confirmation_required is truthy. However, if observation.get("confirmation_required") returns False explicitly (as opposed to None), this function will still return True. This logic assumes that confirmation_required being absent or False means confirmation is OK, which may be correct. Consider documenting this behavior or making the logic more explicit about the intended semantics.

Suggested change

if observation.get("confirmation_required"):

return False

required = observation.get("confirmation_required")

if required is True:

return False

# If confirmation_required is missing or explicitly False, we treat

# confirmation as not required and therefore OK by default.

Copilot · 2026-02-05T22:07:27Z

+                    fixtures=fixtures if isinstance(fixtures, dict) else {},
+                    verifiers=verifiers if isinstance(verifiers, list) else [],
+                    expected_outputs=expected_outputs if isinstance(expected_outputs, list) else None,
+                    expected_state=expected_state if isinstance(expected_state, dict) else None,


When expected_state is an empty dict, it's normalized to None in line 143. However, an empty dict could be a valid test case where no state is expected. The distinction between "no expected state specified" (None) and "expecting empty state" (empty dict) is lost. Consider preserving empty dicts or documenting that empty dicts are treated as None.

Copilot · 2026-02-05T22:07:27Z

+                        setup=setup if step == 0 else None,
+                        teardown=teardown if step == test.max_steps - 1 else None,


The teardown is only executed on the last step (step == test.max_steps - 1), but if the model stops early (when parsed is None at line 529), the teardown will never execute. This could lead to test isolation issues if teardown is needed to clean up state. Consider executing teardown in a finally block or when the loop ends, regardless of how it ends.

Copilot · 2026-02-05T22:07:27Z

+        return False
+    if observation.get("success") is True:
+        return True
+    return True


The function _observation_permission_ok returns True as the default case when permission_ok is not explicitly set and there are no permission errors. However, if an observation has success=False for reasons other than permission errors, this function will still return True. This could lead to incorrect scoring where a failed ability call is considered to have passed the permission check. Consider returning True only when success is explicitly True, or when permission_ok is explicitly True.

Suggested change

return True

return False

Copilot · 2026-02-05T22:07:27Z

+    has_tool_calls = len(tool_calls) > 0
+    results["tool_calls_present"] = has_tool_calls
+    results["schema_validity"] = has_tool_calls and all(_tool_call_schema_valid(t) for t in tool_calls)
+    results["allowed_abilities_only"] = all(t.get("ability") in test.allowed_abilities for t in tool_calls)


The line results["allowed_abilities_only"] will evaluate to True when tool_calls is empty, because all() returns True for an empty sequence. This means that if no tool calls are made, the test will pass the "allowed_abilities_only" verifier even though it might be expected to fail if the test requires tool calls. Consider checking if tool_calls is non-empty before applying this verifier, or document that this is the intended behavior.

Suggested change

results["allowed_abilities_only"] = all(t.get("ability") in test.allowed_abilities for t in tool_calls)

results["allowed_abilities_only"] = has_tool_calls and all(

t.get("ability") in test.allowed_abilities for t in tool_calls

)

Copilot · 2026-02-05T22:07:28Z

+            "knowledge": 0.3,
+            "correctness": 0.4,
+            "quality": 0.3,


The default weights dictionary includes an "abilities" weight of 0.2, which will cause the weights to sum to 1.2 (0.3 + 0.4 + 0.3 + 0.2). When abilities is None and excluded from the calculation, the normalization logic in overall() will correctly handle this. However, the default weights are misleading as they appear to sum to more than 1.0. Consider either removing "abilities" from the default weights (since it's optional) or adjusting all weights to sum to 1.0 when abilities is included (e.g., 0.25, 0.35, 0.2, 0.2).

Suggested change

"knowledge": 0.3,

"correctness": 0.4,

"quality": 0.3,

"knowledge": 0.25,

"correctness": 0.35,

"quality": 0.2,

feat: add abilities harness and training signals

3ff60c2

Jameswlepage requested a review from Copilot February 5, 2026 22:00

Copilot started reviewing on behalf of Jameswlepage February 5, 2026 22:00 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add abilities test harness and training signals#19

Add abilities test harness and training signals#19
Jameswlepage wants to merge 1 commit into
trunkfrom
feat/abilities-harness

Jameswlepage commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		foreach ( [ 'confirm', 'confirmation', 'nonce' ] as $key ) {
		if ( ! empty( $input[ $key ] ) ) {

-    if observation.get("confirmation_required"):
-        return False
+    required = observation.get("confirmation_required")
+    if required is True:
+        return False
+    # If confirmation_required is missing or explicitly False, we treat
+    # confirmation as not required and therefore OK by default.

		setup=setup if step == 0 else None,
		teardown=teardown if step == test.max_steps - 1 else None,

Conversation

Jameswlepage commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants