Add abilities test harness and training signals#19
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a comprehensive abilities test framework for evaluating AI models' tool-use capabilities with WordPress Abilities API. It adds a new test type alongside existing execution and knowledge tests, enabling evaluation of multi-step tool-calling workflows with structured verifiers for training signals.
Changes:
- Added WP-CLI ability runner in runtime with built-in wpbench/get_site_info ability for sanity checks
- Extended Python harness with AbilityTest dataclass, execute_ability method, and abilities scoring with 9+ verifier types (schema validity, permissions, confirmation, method checks, etc.)
- Added experimental wp-abilities-v1 suite with knowledge and abilities tests, plus comprehensive documentation
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| runtime/wp-bench-runtime.php | Added CLI_Ability class registration and wpbench/get_site_info ability with proper hooks |
| runtime/src/class-cli-ability.php | New WP-CLI command for executing abilities with validation, confirmation, and permission checks |
| runtime/src/class-sandbox.php | Added execute_snippet method for running setup/teardown fixtures |
| python/wp_bench/scoring.py | Added abilities score field with normalized weighted averaging |
| python/wp_bench/output.py | Added conditional abilities column to comparison table |
| python/wp_bench/environment.py | Added AbilityResult dataclass and execute_ability method |
| python/wp_bench/datasets.py | Added AbilityTest dataclass and loading logic for abilities tests from local/HuggingFace |
| python/wp_bench/core.py | Added tool call parsing, abilities scoring with 9+ verifiers, and _run_abilities_tests method |
| python/wp_bench/cli.py | Updated dry-run to display abilities test count |
| datasets/suites/wp-abilities-v1/knowledge/abilities-api.json | Added knowledge test for abilities API hook usage |
| datasets/suites/wp-abilities-v1/abilities/core.json | Added basic tool-use test for wpbench/get_site_info ability |
| datasets/export_dataset.py | Added abilities fields to export schema with backward compatibility |
| datasets/README.md | Documented abilities test schema, verifiers, expected_outputs, and expected_state formats |
| README.md | Updated overview to mention optional abilities tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | `fixtures` | object | Setup/teardown or initial state | | ||
| | `verifiers` | array | Verifier names to score this task | | ||
| | `expected_outputs` | array | Optional expected tool outputs | | ||
| | `expected_state` | object | Optional expected state | |
There was a problem hiding this comment.
The documentation states that expected_state is an "object" in the field table (line 91), but the description in lines 131-148 and the code implementation show that it should be a list of checks. The documentation should clarify that expected_state can be either a list of check objects, or an object with a "checks" key containing a list, or a single check object with an "ability" key.
| foreach ( [ 'confirm', 'confirmation', 'nonce' ] as $key ) { | ||
| if ( ! empty( $input[ $key ] ) ) { |
There was a problem hiding this comment.
The has_confirmation method checks for 'nonce' as a confirmation signal. However, in WordPress, a nonce is typically used for CSRF protection rather than explicit user confirmation of a destructive action. Using 'nonce' as a confirmation signal might be misleading or create confusion. Consider removing 'nonce' from the list of confirmation keys, or document that nonce is accepted as a confirmation signal in this specific context.
| def _run_abilities_tests(self, tests: List[AbilityTest]) -> None: | ||
| """Run tool-use ability tests in parallel.""" | ||
| if not tests: | ||
| return | ||
|
|
||
| limit = self.config.run.limit or len(tests) | ||
| tests_to_run = tests[:limit] | ||
| concurrency = self.config.run.concurrency | ||
|
|
||
| def process_test(test: AbilityTest) -> Dict[str, Any]: | ||
| try: | ||
| history: List[str] = [] | ||
| tool_calls: List[Dict[str, Any]] = [] | ||
| observations: List[Dict[str, Any]] = [] | ||
| trace: List[Dict[str, Any]] = [] | ||
| final_answer = "" | ||
|
|
||
| setup = test.fixtures.get("setup") if isinstance(test.fixtures, dict) else None | ||
| teardown = test.fixtures.get("teardown") if isinstance(test.fixtures, dict) else None | ||
|
|
||
| for step in range(test.max_steps): | ||
| prompt = render_abilities_prompt(test, history=history) | ||
| completion = self.model.generate(prompt) | ||
| trace_step: Dict[str, Any] = { | ||
| "step": step, | ||
| "prompt": prompt, | ||
| "completion": completion, | ||
| } | ||
| parsed = parse_tool_call(completion) | ||
| if parsed is None: | ||
| final_answer = completion | ||
| trace_step["tool_call"] = None | ||
| trace.append(trace_step) | ||
| break | ||
|
|
||
| tool_calls.append(parsed) | ||
| result = self.environment.execute_ability( | ||
| ability=parsed.get("ability", ""), | ||
| input_data=parsed.get("input"), | ||
| method=parsed.get("method"), | ||
| setup=setup if step == 0 else None, | ||
| teardown=teardown if step == test.max_steps - 1 else None, | ||
| ) | ||
| observation = result.raw if result.raw else {"success": False, "error": "no_result"} | ||
| observations.append(observation) | ||
| trace_step["tool_call"] = parsed | ||
| trace_step["observation"] = observation | ||
| trace.append(trace_step) | ||
| history.append(f"TOOL_CALL: {parsed}") | ||
| history.append(f"OBSERVATION: {observation}") | ||
| else: | ||
| final_answer = "" | ||
|
|
||
| state_checks = _run_state_checks(test.expected_state, self.environment) if test.expected_state else [] | ||
| score, verifier_results = score_abilities( | ||
| test, | ||
| tool_calls, | ||
| observations, | ||
| final_answer, | ||
| state_checks=state_checks, | ||
| ) | ||
|
|
||
| return { | ||
| "test_id": test.id, | ||
| "type": "abilities", | ||
| "prompt_hash": sha256(render_abilities_prompt(test)), | ||
| "tool_calls": tool_calls, | ||
| "observations": observations, | ||
| "trace": trace, | ||
| "state_checks": state_checks, | ||
| "final_answer": final_answer, | ||
| "verifiers": verifier_results, | ||
| "score": score, | ||
| } | ||
| except Exception as e: | ||
| raise TestError(test.id, "abilities", e) from e | ||
|
|
||
| with create_progress() as progress: | ||
| task = progress.add_task("Abilities", total=len(tests_to_run)) | ||
| with ThreadPoolExecutor(max_workers=concurrency) as executor: | ||
| futures = {executor.submit(process_test, test): test for test in tests_to_run} | ||
| for future in as_completed(futures): | ||
| try: | ||
| result = future.result() | ||
| with self._lock: | ||
| self.aggregator.add_abilities(result["score"]) | ||
| self.records.append(result) | ||
| progress.update(task, advance=1) | ||
| except TestError: | ||
| for f in futures: | ||
| f.cancel() | ||
| raise |
There was a problem hiding this comment.
The _run_abilities_tests method is duplicated between BenchmarkRunner (lines 500-591) and SingleModelRunner (lines 880-971) with identical implementations. This violates the DRY principle and makes maintenance more difficult. Consider extracting this method to a shared base class, or making one of the classes use the other's implementation.
| def render_abilities_prompt(test: AbilityTest, history: List[str] | None = None) -> str: | ||
| prompt = [test.prompt] | ||
| prompt.append( | ||
| 'Return a tool call as JSON: {"ability": "name", "input": {...}}' | ||
| ) | ||
| prompt.append(f"Allowed abilities: {', '.join(test.allowed_abilities)}") | ||
| if history: | ||
| prompt.append("History:") | ||
| prompt.extend(history) | ||
| prompt.append("If no tool is needed, answer normally.") | ||
| return "\n".join(prompt) |
There was a problem hiding this comment.
The prompt rendering in render_abilities_prompt does not include information about the method (GET vs POST) that should be used for read-only vs write abilities. Models may not know to include the "method" field in their tool calls, which could lead to failures in method_valid checks. Consider adding guidance about HTTP methods in the prompt based on the allowed abilities' readonly annotations.
| if observation.get("confirmation_required"): | ||
| return False |
There was a problem hiding this comment.
The function _observation_confirmation_ok returns True as the default case when neither confirmation_ok is explicitly set nor confirmation_required is truthy. However, if observation.get("confirmation_required") returns False explicitly (as opposed to None), this function will still return True. This logic assumes that confirmation_required being absent or False means confirmation is OK, which may be correct. Consider documenting this behavior or making the logic more explicit about the intended semantics.
| if observation.get("confirmation_required"): | |
| return False | |
| required = observation.get("confirmation_required") | |
| if required is True: | |
| return False | |
| # If confirmation_required is missing or explicitly False, we treat | |
| # confirmation as not required and therefore OK by default. |
| fixtures=fixtures if isinstance(fixtures, dict) else {}, | ||
| verifiers=verifiers if isinstance(verifiers, list) else [], | ||
| expected_outputs=expected_outputs if isinstance(expected_outputs, list) else None, | ||
| expected_state=expected_state if isinstance(expected_state, dict) else None, |
There was a problem hiding this comment.
When expected_state is an empty dict, it's normalized to None in line 143. However, an empty dict could be a valid test case where no state is expected. The distinction between "no expected state specified" (None) and "expecting empty state" (empty dict) is lost. Consider preserving empty dicts or documenting that empty dicts are treated as None.
| setup=setup if step == 0 else None, | ||
| teardown=teardown if step == test.max_steps - 1 else None, |
There was a problem hiding this comment.
The teardown is only executed on the last step (step == test.max_steps - 1), but if the model stops early (when parsed is None at line 529), the teardown will never execute. This could lead to test isolation issues if teardown is needed to clean up state. Consider executing teardown in a finally block or when the loop ends, regardless of how it ends.
| return False | ||
| if observation.get("success") is True: | ||
| return True | ||
| return True |
There was a problem hiding this comment.
The function _observation_permission_ok returns True as the default case when permission_ok is not explicitly set and there are no permission errors. However, if an observation has success=False for reasons other than permission errors, this function will still return True. This could lead to incorrect scoring where a failed ability call is considered to have passed the permission check. Consider returning True only when success is explicitly True, or when permission_ok is explicitly True.
| return True | |
| return False |
| has_tool_calls = len(tool_calls) > 0 | ||
| results["tool_calls_present"] = has_tool_calls | ||
| results["schema_validity"] = has_tool_calls and all(_tool_call_schema_valid(t) for t in tool_calls) | ||
| results["allowed_abilities_only"] = all(t.get("ability") in test.allowed_abilities for t in tool_calls) |
There was a problem hiding this comment.
The line results["allowed_abilities_only"] will evaluate to True when tool_calls is empty, because all() returns True for an empty sequence. This means that if no tool calls are made, the test will pass the "allowed_abilities_only" verifier even though it might be expected to fail if the test requires tool calls. Consider checking if tool_calls is non-empty before applying this verifier, or document that this is the intended behavior.
| results["allowed_abilities_only"] = all(t.get("ability") in test.allowed_abilities for t in tool_calls) | |
| results["allowed_abilities_only"] = has_tool_calls and all( | |
| t.get("ability") in test.allowed_abilities for t in tool_calls | |
| ) |
| "knowledge": 0.3, | ||
| "correctness": 0.4, | ||
| "quality": 0.3, |
There was a problem hiding this comment.
The default weights dictionary includes an "abilities" weight of 0.2, which will cause the weights to sum to 1.2 (0.3 + 0.4 + 0.3 + 0.2). When abilities is None and excluded from the calculation, the normalization logic in overall() will correctly handle this. However, the default weights are misleading as they appear to sum to more than 1.0. Consider either removing "abilities" from the default weights (since it's optional) or adjusting all weights to sum to 1.0 when abilities is included (e.g., 0.25, 0.35, 0.2, 0.2).
| "knowledge": 0.3, | |
| "correctness": 0.4, | |
| "quality": 0.3, | |
| "knowledge": 0.25, | |
| "correctness": 0.35, | |
| "quality": 0.2, |
Summary
Why
Testing