Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -228,3 +228,10 @@ scripts/convergence_agent/
# Generated plots / scratch scripts dropped at repo root
*.png
save_plots.py


# Eval result JSON files are per-run and regenerated by the runners.
evals/results/

# Paper drafts and supporting material (kept in a separate repo).
papers/
65 changes: 65 additions & 0 deletions evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Evals

This folder holds **evaluations** ("evals" for short) of the MCP server's
behavior. The goal is to turn opinions about how the server should behave
into **numbers** that can be re-measured when the code changes — the same
way `tests/` turns "the code should be correct" into a runnable assertion.

## What is an "eval"? (for non-AI-engineers)

In AI-driven software, an **eval** is the same thing a unit test is in
regular software, with one wrinkle: the system under test includes a
language model whose output is not bit-for-bit reproducible. So an eval
scores aggregate behavior across many inputs ("on this set of 20 prompts,
18 picked the right tool") rather than asserting one specific output.

You write an eval the same way you write a regression test:

1. Pick a behavior you care about. ("The server should reject a malformed
request before it spends compute on it.")
2. Build a small fixed set of inputs that exercise that behavior. (Say, 20
deliberately-wrong prompts.)
3. Run them through the system and record a numeric score.
4. Commit the inputs, the runner, and the result so the next person can
re-run and compare.

Evals do **not** prove correctness. They measure *how often* the system
does the right thing on a fixed sample. They are most useful for catching
regressions ("we used to pick the right tool 90% of the time, now it's
60%") and for putting numbers on architectural decisions.

## What's in here

| Folder | What it measures |
|---|---|
| [`schema_rejection/`](schema_rejection/) | How often the typed tool boundary catches malformed calls before any work happens — the "did we waste compute on garbage?" number |
| [`tool_retrieval/`](tool_retrieval/) | How often a simple text retriever (BM25) finds the right tool by description — the "is our tool catalog still navigable as it grows?" number |

Both run end-to-end in under 30 seconds on a laptop with no external
dependencies. They are cheap enough to add to CI.

## How to run

```bash
uv run python -m evals.schema_rejection.run
uv run python -m evals.tool_retrieval.run
```

Each runner writes a JSON file under `results/` named with a timestamp.
Result files are gitignored — they regenerate on each run; the runner
itself is the source of truth.

## When to add a new eval

Add one when you're about to make a decision and want a number to defend
it. Some good triggers:

- We're considering exposing more tools — does retrieval still work?
- We're refactoring an entry-point — does it still reject malformed input?
- A bug class has appeared twice — write the eval before the third time.

Bad triggers (use `tests/` instead):

- Asserting a single specific output for a single specific input.
- Checking a function's signature or contract.
- Anything that should be a unit test of a Python function.
Empty file added evals/__init__.py
Empty file.
75 changes: 75 additions & 0 deletions evals/schema_rejection/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Schema-rejection eval

## What this measures (plain language)

When someone asks an AI assistant to "compute vorticity from the wind file,"
the AI translates that into a call like:

```
run_analysis(operation="curl", grid_path=..., data_path=..., u_variable=..., v_variable=...)
```

There are many ways this call can be **wrong**:

- The AI omitted `u_variable` because it didn't read the data file carefully.
- The AI typed `operation="curl_calculation"` instead of `operation="curl"`.
- The AI passed `grid_path="/path/that/does/not/exist.nc"`.
- The AI passed a string where a list was expected, or vice versa.

For each of these, three things can happen:

1. **Caught at the schema boundary.** The server's parameter checks reject
the call before any actual analysis runs. Best case — costs nothing.
2. **Caught at the file/IO boundary.** The call passes schema validation,
tries to open a file, and fails with a clear error. Acceptable.
3. **Silent failure.** The call passes both, runs to completion, and
returns a wrong-looking number with no error at all. **This is the bug
class** — the AI gets back something that looks like an answer when it
shouldn't.

**This eval asks: how often does the typed boundary actually catch a bad
call?**

## What "good" looks like

For ~20 deliberately-malformed inputs:

- **>70% caught at schema or IO layer** = the boundary is doing real work.
- **<30%** = the boundary is too loose; the AI can drive it into silent
failures by sending well-formed-looking nonsense.
- **0 silent failures** = required. If we produce a plausible-looking
number from a malformed request, that is a bug we must fix.

## How to run

```bash
uv run python -m evals.schema_rejection.run
```

Writes a JSON report to `evals/results/schema_<timestamp>.json` and prints
a summary table. Returns non-zero exit if any silent failure occurred —
suitable for CI.

## What this does NOT measure

This eval cannot catch the kind of silent failure where the schema accepts
the call, the file opens cleanly, and the **answer is physically wrong**
(e.g., curl returned in the wrong units because of sphere-radius scaling).
That class needs a downstream validator with physical priors — expected
magnitude, expected units, expected sign — which is a separate piece of
work.

## Reading the output

The runner classifies each call into one of:

| Outcome | Meaning |
|---|---|
| `schema_rejected` | Server raised before any file IO. Best case. |
| `io_rejected` | Server tried to open a file/path and failed visibly. Acceptable. |
| `runtime_error` | Computation started but raised an exception. Acceptable but worse. |
| `silent_pass` | Returned a result dict without an error. **Bug if the input was malformed.** |

The headline number is `caught_rate = (schema_rejected + io_rejected + runtime_error) / total`.
We want that as high as possible. The danger number is the `silent_pass` count —
we want that to be **zero** for malformed inputs.
Empty file.
228 changes: 228 additions & 0 deletions evals/schema_rejection/cases.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
"""Eval cases — deliberately malformed calls to run_analysis / plot_dataset.

Each case is a dict with:
- id: short slug for the report
- description: one-line plain-English description of the bug
- tool: 'run_analysis' or 'plot_dataset'
- kwargs: the call to make
- expected: 'reject' (we want the boundary to catch it) or 'accept'
(the call is well-formed and should run cleanly — a sanity baseline)

Cases marked 'accept' are baseline sanity checks: if too many of them fail,
the eval itself is broken. Cases marked 'reject' are the actual measurement.
"""

from __future__ import annotations


def build_cases(grid_path: str, grid_path_with_data: tuple[str, str]) -> list[dict]:
"""Return the case list, parameterized by the synthetic fixture paths."""
grid_only = grid_path
grid_for_data, data_path = grid_path_with_data
missing_path = "/nonexistent/path/that/cannot/possibly/exist.nc"

return [
# ---- BASELINES: well-formed calls that SHOULD succeed ----
{
"id": "baseline_inspect_mesh",
"description": "Well-formed inspect_mesh on a valid grid",
"tool": "run_analysis",
"kwargs": {"operation": "inspect_mesh", "grid_path": grid_only},
"expected": "accept",
},
{
"id": "baseline_calculate_area",
"description": "Well-formed calculate_area on a valid grid",
"tool": "run_analysis",
"kwargs": {"operation": "calculate_area", "grid_path": grid_only},
"expected": "accept",
},
# ---- MALFORMED: schema-level violations ----
{
"id": "wrong_operation_typo",
"description": "operation='curl_calculation' instead of 'curl'",
"tool": "run_analysis",
"kwargs": {
"operation": "curl_calculation",
"grid_path": grid_only,
"data_path": data_path,
"u_variable": "u",
"v_variable": "v",
},
"expected": "reject",
},
{
"id": "wrong_operation_empty",
"description": "operation='' (empty string)",
"tool": "run_analysis",
"kwargs": {"operation": ""},
"expected": "reject",
},
{
"id": "wrong_operation_made_up",
"description": "operation='fluxulate' — does not exist",
"tool": "run_analysis",
"kwargs": {"operation": "fluxulate", "grid_path": grid_only},
"expected": "reject",
},
# ---- MALFORMED: missing required parameter ----
{
"id": "missing_grid_path",
"description": "inspect_mesh without grid_path",
"tool": "run_analysis",
"kwargs": {"operation": "inspect_mesh"},
"expected": "reject",
},
{
"id": "missing_data_path",
"description": "inspect_variable with grid but no data",
"tool": "run_analysis",
"kwargs": {"operation": "inspect_variable", "grid_path": grid_only},
"expected": "reject",
},
{
"id": "missing_variable_name",
"description": "calculate_zonal_mean without variable_name",
"tool": "run_analysis",
"kwargs": {
"operation": "calculate_zonal_mean",
"grid_path": grid_for_data,
"data_path": data_path,
},
"expected": "reject",
},
{
"id": "missing_u_variable",
"description": "curl with v_variable but no u_variable",
"tool": "run_analysis",
"kwargs": {
"operation": "curl",
"grid_path": grid_for_data,
"data_path": data_path,
"v_variable": "v",
},
"expected": "reject",
},
{
"id": "missing_center_for_azimuthal",
"description": "azimuthal_mean without center_lon/lat/radius",
"tool": "run_analysis",
"kwargs": {
"operation": "azimuthal_mean",
"grid_path": grid_for_data,
"data_path": data_path,
"variable_name": "temperature",
},
"expected": "reject",
},
{
"id": "missing_bbox_bounds",
"description": "subset_bbox without lon_bounds/lat_bounds",
"tool": "run_analysis",
"kwargs": {"operation": "subset_bbox", "grid_path": grid_only},
"expected": "reject",
},
{
"id": "missing_data_path_a",
"description": "compare_fields missing data_path_a",
"tool": "run_analysis",
"kwargs": {
"operation": "compare_fields",
"variable_name": "temperature",
"data_path_b": data_path,
},
"expected": "reject",
},
{
"id": "missing_target_grid_for_remap",
"description": "remap_variable without target_grid_path",
"tool": "run_analysis",
"kwargs": {
"operation": "remap_variable",
"grid_path": grid_for_data,
"data_path": data_path,
"variable_name": "temperature",
},
"expected": "reject",
},
{
"id": "missing_data_paths_for_ensemble",
"description": "ensemble_mean without data_paths",
"tool": "run_analysis",
"kwargs": {
"operation": "ensemble_mean",
"variable_name": "temperature",
},
"expected": "reject",
},
{
"id": "missing_output_path_for_export",
"description": "export without output_path",
"tool": "run_analysis",
"kwargs": {"operation": "export"},
"expected": "reject",
},
# ---- MALFORMED: nonexistent file paths (IO layer should catch) ----
{
"id": "nonexistent_grid",
"description": "inspect_mesh against a path that does not exist",
"tool": "run_analysis",
"kwargs": {"operation": "inspect_mesh", "grid_path": missing_path},
"expected": "reject",
},
{
"id": "nonexistent_data",
"description": "inspect_variable with nonexistent data file",
"tool": "run_analysis",
"kwargs": {
"operation": "inspect_variable",
"grid_path": grid_for_data,
"data_path": missing_path,
},
"expected": "reject",
},
# ---- MALFORMED: plot_dataset variants ----
{
"id": "plot_unknown_type",
"description": "plot_dataset with plot_type='holography'",
"tool": "plot_dataset",
"kwargs": {"plot_type": "holography", "grid_path": grid_only},
"expected": "reject",
},
{
"id": "plot_missing_variable",
"description": "plot_dataset variable plot but no variable_name",
"tool": "plot_dataset",
"kwargs": {
"plot_type": "variable",
"grid_path": grid_for_data,
"data_path": data_path,
},
"expected": "reject",
},
{
"id": "plot_variable_does_not_exist",
"description": "plot_dataset for variable 'pixiedust' (not in file)",
"tool": "plot_dataset",
"kwargs": {
"plot_type": "variable",
"grid_path": grid_for_data,
"data_path": data_path,
"variable_name": "pixiedust",
},
"expected": "reject",
},
# ---- MALFORMED: wrong-type bbox bounds ----
{
"id": "bbox_wrong_arity",
"description": "subset_bbox with lon_bounds=[10] (needs 2 floats)",
"tool": "run_analysis",
"kwargs": {
"operation": "subset_bbox",
"grid_path": grid_only,
"lon_bounds": [10.0],
"lat_bounds": [0.0, 10.0],
},
"expected": "reject",
},
]
Loading
Loading