Skip to content

Add abilities test harness and training signals#19

Draft
Jameswlepage wants to merge 1 commit into
trunkfrom
feat/abilities-harness
Draft

Add abilities test harness and training signals#19
Jameswlepage wants to merge 1 commit into
trunkfrom
feat/abilities-harness

Conversation

@Jameswlepage

Copy link
Copy Markdown
Contributor

Summary

  • Add an Abilities test kind to the harness, dataset export, and scoring pipeline.
  • Introduce a WP-CLI ability runner in the runtime and a minimal built-in ability for sanity checks.
  • Expand tool-use verifiers (schema validity, permissions, confirmation, method checks) and trace outputs.
  • Document abilities suites and add an experimental wp-abilities-v1 suite.

Why

  • Enables tool-use evaluation against real WordPress abilities.
  • Provides dense, structured verifier signals and traces that can be used for training, regression testing, and comparisons.

Testing

  • Not run (local runtime not started in this environment).

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a comprehensive abilities test framework for evaluating AI models' tool-use capabilities with WordPress Abilities API. It adds a new test type alongside existing execution and knowledge tests, enabling evaluation of multi-step tool-calling workflows with structured verifiers for training signals.

Changes:

  • Added WP-CLI ability runner in runtime with built-in wpbench/get_site_info ability for sanity checks
  • Extended Python harness with AbilityTest dataclass, execute_ability method, and abilities scoring with 9+ verifier types (schema validity, permissions, confirmation, method checks, etc.)
  • Added experimental wp-abilities-v1 suite with knowledge and abilities tests, plus comprehensive documentation

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
runtime/wp-bench-runtime.php Added CLI_Ability class registration and wpbench/get_site_info ability with proper hooks
runtime/src/class-cli-ability.php New WP-CLI command for executing abilities with validation, confirmation, and permission checks
runtime/src/class-sandbox.php Added execute_snippet method for running setup/teardown fixtures
python/wp_bench/scoring.py Added abilities score field with normalized weighted averaging
python/wp_bench/output.py Added conditional abilities column to comparison table
python/wp_bench/environment.py Added AbilityResult dataclass and execute_ability method
python/wp_bench/datasets.py Added AbilityTest dataclass and loading logic for abilities tests from local/HuggingFace
python/wp_bench/core.py Added tool call parsing, abilities scoring with 9+ verifiers, and _run_abilities_tests method
python/wp_bench/cli.py Updated dry-run to display abilities test count
datasets/suites/wp-abilities-v1/knowledge/abilities-api.json Added knowledge test for abilities API hook usage
datasets/suites/wp-abilities-v1/abilities/core.json Added basic tool-use test for wpbench/get_site_info ability
datasets/export_dataset.py Added abilities fields to export schema with backward compatibility
datasets/README.md Documented abilities test schema, verifiers, expected_outputs, and expected_state formats
README.md Updated overview to mention optional abilities tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread datasets/README.md
| `fixtures` | object | Setup/teardown or initial state |
| `verifiers` | array | Verifier names to score this task |
| `expected_outputs` | array | Optional expected tool outputs |
| `expected_state` | object | Optional expected state |

Copilot AI Feb 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states that expected_state is an "object" in the field table (line 91), but the description in lines 131-148 and the code implementation show that it should be a list of checks. The documentation should clarify that expected_state can be either a list of check objects, or an object with a "checks" key containing a list, or a single check object with an "ability" key.

Copilot uses AI. Check for mistakes.
Comment on lines +235 to +236
foreach ( [ 'confirm', 'confirmation', 'nonce' ] as $key ) {
if ( ! empty( $input[ $key ] ) ) {

Copilot AI Feb 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The has_confirmation method checks for 'nonce' as a confirmation signal. However, in WordPress, a nonce is typically used for CSRF protection rather than explicit user confirmation of a destructive action. Using 'nonce' as a confirmation signal might be misleading or create confusion. Consider removing 'nonce' from the list of confirmation keys, or document that nonce is accepted as a confirmation signal in this specific context.

Copilot uses AI. Check for mistakes.
Comment thread python/wp_bench/core.py
Comment on lines +500 to +591
def _run_abilities_tests(self, tests: List[AbilityTest]) -> None:
"""Run tool-use ability tests in parallel."""
if not tests:
return

limit = self.config.run.limit or len(tests)
tests_to_run = tests[:limit]
concurrency = self.config.run.concurrency

def process_test(test: AbilityTest) -> Dict[str, Any]:
try:
history: List[str] = []
tool_calls: List[Dict[str, Any]] = []
observations: List[Dict[str, Any]] = []
trace: List[Dict[str, Any]] = []
final_answer = ""

setup = test.fixtures.get("setup") if isinstance(test.fixtures, dict) else None
teardown = test.fixtures.get("teardown") if isinstance(test.fixtures, dict) else None

for step in range(test.max_steps):
prompt = render_abilities_prompt(test, history=history)
completion = self.model.generate(prompt)
trace_step: Dict[str, Any] = {
"step": step,
"prompt": prompt,
"completion": completion,
}
parsed = parse_tool_call(completion)
if parsed is None:
final_answer = completion
trace_step["tool_call"] = None
trace.append(trace_step)
break

tool_calls.append(parsed)
result = self.environment.execute_ability(
ability=parsed.get("ability", ""),
input_data=parsed.get("input"),
method=parsed.get("method"),
setup=setup if step == 0 else None,
teardown=teardown if step == test.max_steps - 1 else None,
)
observation = result.raw if result.raw else {"success": False, "error": "no_result"}
observations.append(observation)
trace_step["tool_call"] = parsed
trace_step["observation"] = observation
trace.append(trace_step)
history.append(f"TOOL_CALL: {parsed}")
history.append(f"OBSERVATION: {observation}")
else:
final_answer = ""

state_checks = _run_state_checks(test.expected_state, self.environment) if test.expected_state else []
score, verifier_results = score_abilities(
test,
tool_calls,
observations,
final_answer,
state_checks=state_checks,
)

return {
"test_id": test.id,
"type": "abilities",
"prompt_hash": sha256(render_abilities_prompt(test)),
"tool_calls": tool_calls,
"observations": observations,
"trace": trace,
"state_checks": state_checks,
"final_answer": final_answer,
"verifiers": verifier_results,
"score": score,
}
except Exception as e:
raise TestError(test.id, "abilities", e) from e

with create_progress() as progress:
task = progress.add_task("Abilities", total=len(tests_to_run))
with ThreadPoolExecutor(max_workers=concurrency) as executor:
futures = {executor.submit(process_test, test): test for test in tests_to_run}
for future in as_completed(futures):
try:
result = future.result()
with self._lock:
self.aggregator.add_abilities(result["score"])
self.records.append(result)
progress.update(task, advance=1)
except TestError:
for f in futures:
f.cancel()
raise

Copilot AI Feb 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _run_abilities_tests method is duplicated between BenchmarkRunner (lines 500-591) and SingleModelRunner (lines 880-971) with identical implementations. This violates the DRY principle and makes maintenance more difficult. Consider extracting this method to a shared base class, or making one of the classes use the other's implementation.

Copilot uses AI. Check for mistakes.
Comment thread python/wp_bench/core.py
Comment on lines +40 to +50
def render_abilities_prompt(test: AbilityTest, history: List[str] | None = None) -> str:
prompt = [test.prompt]
prompt.append(
'Return a tool call as JSON: {"ability": "name", "input": {...}}'
)
prompt.append(f"Allowed abilities: {', '.join(test.allowed_abilities)}")
if history:
prompt.append("History:")
prompt.extend(history)
prompt.append("If no tool is needed, answer normally.")
return "\n".join(prompt)

Copilot AI Feb 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt rendering in render_abilities_prompt does not include information about the method (GET vs POST) that should be used for read-only vs write abilities. Models may not know to include the "method" field in their tool calls, which could lead to failures in method_valid checks. Consider adding guidance about HTTP methods in the prompt based on the allowed abilities' readonly annotations.

Copilot uses AI. Check for mistakes.
Comment thread python/wp_bench/core.py
Comment on lines +143 to +144
if observation.get("confirmation_required"):
return False

Copilot AI Feb 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function _observation_confirmation_ok returns True as the default case when neither confirmation_ok is explicitly set nor confirmation_required is truthy. However, if observation.get("confirmation_required") returns False explicitly (as opposed to None), this function will still return True. This logic assumes that confirmation_required being absent or False means confirmation is OK, which may be correct. Consider documenting this behavior or making the logic more explicit about the intended semantics.

Suggested change
if observation.get("confirmation_required"):
return False
required = observation.get("confirmation_required")
if required is True:
return False
# If confirmation_required is missing or explicitly False, we treat
# confirmation as not required and therefore OK by default.

Copilot uses AI. Check for mistakes.
fixtures=fixtures if isinstance(fixtures, dict) else {},
verifiers=verifiers if isinstance(verifiers, list) else [],
expected_outputs=expected_outputs if isinstance(expected_outputs, list) else None,
expected_state=expected_state if isinstance(expected_state, dict) else None,

Copilot AI Feb 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When expected_state is an empty dict, it's normalized to None in line 143. However, an empty dict could be a valid test case where no state is expected. The distinction between "no expected state specified" (None) and "expecting empty state" (empty dict) is lost. Consider preserving empty dicts or documenting that empty dicts are treated as None.

Copilot uses AI. Check for mistakes.
Comment thread python/wp_bench/core.py
Comment on lines +540 to +541
setup=setup if step == 0 else None,
teardown=teardown if step == test.max_steps - 1 else None,

Copilot AI Feb 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The teardown is only executed on the last step (step == test.max_steps - 1), but if the model stops early (when parsed is None at line 529), the teardown will never execute. This could lead to test isolation issues if teardown is needed to clean up state. Consider executing teardown in a finally block or when the loop ends, regardless of how it ends.

Copilot uses AI. Check for mistakes.
Comment thread python/wp_bench/core.py
return False
if observation.get("success") is True:
return True
return True

Copilot AI Feb 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function _observation_permission_ok returns True as the default case when permission_ok is not explicitly set and there are no permission errors. However, if an observation has success=False for reasons other than permission errors, this function will still return True. This could lead to incorrect scoring where a failed ability call is considered to have passed the permission check. Consider returning True only when success is explicitly True, or when permission_ok is explicitly True.

Suggested change
return True
return False

Copilot uses AI. Check for mistakes.
Comment thread python/wp_bench/core.py
has_tool_calls = len(tool_calls) > 0
results["tool_calls_present"] = has_tool_calls
results["schema_validity"] = has_tool_calls and all(_tool_call_schema_valid(t) for t in tool_calls)
results["allowed_abilities_only"] = all(t.get("ability") in test.allowed_abilities for t in tool_calls)

Copilot AI Feb 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line results["allowed_abilities_only"] will evaluate to True when tool_calls is empty, because all() returns True for an empty sequence. This means that if no tool calls are made, the test will pass the "allowed_abilities_only" verifier even though it might be expected to fail if the test requires tool calls. Consider checking if tool_calls is non-empty before applying this verifier, or document that this is the intended behavior.

Suggested change
results["allowed_abilities_only"] = all(t.get("ability") in test.allowed_abilities for t in tool_calls)
results["allowed_abilities_only"] = has_tool_calls and all(
t.get("ability") in test.allowed_abilities for t in tool_calls
)

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +19
"knowledge": 0.3,
"correctness": 0.4,
"quality": 0.3,

Copilot AI Feb 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default weights dictionary includes an "abilities" weight of 0.2, which will cause the weights to sum to 1.2 (0.3 + 0.4 + 0.3 + 0.2). When abilities is None and excluded from the calculation, the normalization logic in overall() will correctly handle this. However, the default weights are misleading as they appear to sum to more than 1.0. Consider either removing "abilities" from the default weights (since it's optional) or adjusting all weights to sum to 1.0 when abilities is included (e.g., 0.25, 0.35, 0.2, 0.2).

Suggested change
"knowledge": 0.3,
"correctness": 0.4,
"quality": 0.3,
"knowledge": 0.25,
"correctness": 0.35,
"quality": 0.2,

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants