UXARRAY · rajeeja · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/.gitignore b/.gitignore
@@ -228,3 +228,10 @@ scripts/convergence_agent/
 # Generated plots / scratch scripts dropped at repo root
 *.png
 save_plots.py
+
+
+# Eval result JSON files are per-run and regenerated by the runners.
+evals/results/
+
+# Paper drafts and supporting material (kept in a separate repo).
+papers/
diff --git a/evals/README.md b/evals/README.md
@@ -0,0 +1,65 @@
+# Evals
+
+This folder holds **evaluations** ("evals" for short) of the MCP server's
+behavior. The goal is to turn opinions about how the server should behave
+into **numbers** that can be re-measured when the code changes — the same
+way `tests/` turns "the code should be correct" into a runnable assertion.
+
+## What is an "eval"? (for non-AI-engineers)
+
+In AI-driven software, an **eval** is the same thing a unit test is in
+regular software, with one wrinkle: the system under test includes a
+language model whose output is not bit-for-bit reproducible. So an eval
+scores aggregate behavior across many inputs ("on this set of 20 prompts,
+18 picked the right tool") rather than asserting one specific output.
+
+You write an eval the same way you write a regression test:
+
+1. Pick a behavior you care about. ("The server should reject a malformed
+   request before it spends compute on it.")
+2. Build a small fixed set of inputs that exercise that behavior. (Say, 20
+   deliberately-wrong prompts.)
+3. Run them through the system and record a numeric score.
+4. Commit the inputs, the runner, and the result so the next person can
+   re-run and compare.
+
+Evals do **not** prove correctness. They measure *how often* the system
+does the right thing on a fixed sample. They are most useful for catching
+regressions ("we used to pick the right tool 90% of the time, now it's
+60%") and for putting numbers on architectural decisions.
+
+## What's in here
+
+| Folder | What it measures |
+|---|---|
+| [`schema_rejection/`](schema_rejection/) | How often the typed tool boundary catches malformed calls before any work happens — the "did we waste compute on garbage?" number |
+| [`tool_retrieval/`](tool_retrieval/) | How often a simple text retriever (BM25) finds the right tool by description — the "is our tool catalog still navigable as it grows?" number |
+
+Both run end-to-end in under 30 seconds on a laptop with no external
+dependencies. They are cheap enough to add to CI.
+
+## How to run
+
+```bash
+uv run python -m evals.schema_rejection.run
+uv run python -m evals.tool_retrieval.run
+```
+
+Each runner writes a JSON file under `results/` named with a timestamp.
+Result files are gitignored — they regenerate on each run; the runner
+itself is the source of truth.
+
+## When to add a new eval
+
+Add one when you're about to make a decision and want a number to defend
+it. Some good triggers:
+
+- We're considering exposing more tools — does retrieval still work?
+- We're refactoring an entry-point — does it still reject malformed input?
+- A bug class has appeared twice — write the eval before the third time.
+
+Bad triggers (use `tests/` instead):
+
+- Asserting a single specific output for a single specific input.
+- Checking a function's signature or contract.
+- Anything that should be a unit test of a Python function.
diff --git a/evals/__init__.py b/evals/__init__.py
diff --git a/evals/schema_rejection/README.md b/evals/schema_rejection/README.md
@@ -0,0 +1,75 @@
+# Schema-rejection eval
+
+## What this measures (plain language)
+
+When someone asks an AI assistant to "compute vorticity from the wind file,"
+the AI translates that into a call like:
+
+```
+run_analysis(operation="curl", grid_path=..., data_path=..., u_variable=..., v_variable=...)
+```
+
+There are many ways this call can be **wrong**:
+
+- The AI omitted `u_variable` because it didn't read the data file carefully.
+- The AI typed `operation="curl_calculation"` instead of `operation="curl"`.
+- The AI passed `grid_path="/path/that/does/not/exist.nc"`.
+- The AI passed a string where a list was expected, or vice versa.
+
+For each of these, three things can happen:
+
+1. **Caught at the schema boundary.** The server's parameter checks reject
+   the call before any actual analysis runs. Best case — costs nothing.
+2. **Caught at the file/IO boundary.** The call passes schema validation,
+   tries to open a file, and fails with a clear error. Acceptable.
+3. **Silent failure.** The call passes both, runs to completion, and
+   returns a wrong-looking number with no error at all. **This is the bug
+   class** — the AI gets back something that looks like an answer when it
+   shouldn't.
+
+**This eval asks: how often does the typed boundary actually catch a bad
+call?**
+
+## What "good" looks like
+
+For ~20 deliberately-malformed inputs:
+
+- **>70% caught at schema or IO layer** = the boundary is doing real work.
+- **<30%** = the boundary is too loose; the AI can drive it into silent
+  failures by sending well-formed-looking nonsense.
+- **0 silent failures** = required. If we produce a plausible-looking
+  number from a malformed request, that is a bug we must fix.
+
+## How to run
+
+```bash
+uv run python -m evals.schema_rejection.run
+```
+
+Writes a JSON report to `evals/results/schema_<timestamp>.json` and prints
+a summary table. Returns non-zero exit if any silent failure occurred —
+suitable for CI.
+
+## What this does NOT measure
+
+This eval cannot catch the kind of silent failure where the schema accepts
+the call, the file opens cleanly, and the **answer is physically wrong**
+(e.g., curl returned in the wrong units because of sphere-radius scaling).
+That class needs a downstream validator with physical priors — expected
+magnitude, expected units, expected sign — which is a separate piece of
+work.
+
+## Reading the output
+
+The runner classifies each call into one of:
+
+| Outcome | Meaning |
+|---|---|
+| `schema_rejected` | Server raised before any file IO. Best case. |
+| `io_rejected` | Server tried to open a file/path and failed visibly. Acceptable. |
+| `runtime_error` | Computation started but raised an exception. Acceptable but worse. |
+| `silent_pass` | Returned a result dict without an error. **Bug if the input was malformed.** |
+
+The headline number is `caught_rate = (schema_rejected + io_rejected + runtime_error) / total`.
+We want that as high as possible. The danger number is the `silent_pass` count —
+we want that to be **zero** for malformed inputs.
diff --git a/evals/schema_rejection/__init__.py b/evals/schema_rejection/__init__.py
diff --git a/evals/schema_rejection/cases.py b/evals/schema_rejection/cases.py
@@ -0,0 +1,228 @@
+"""Eval cases — deliberately malformed calls to run_analysis / plot_dataset.
+
+Each case is a dict with:
+- id: short slug for the report
+- description: one-line plain-English description of the bug
+- tool: 'run_analysis' or 'plot_dataset'
+- kwargs: the call to make
+- expected: 'reject' (we want the boundary to catch it) or 'accept'
+  (the call is well-formed and should run cleanly — a sanity baseline)
+
+Cases marked 'accept' are baseline sanity checks: if too many of them fail,
+the eval itself is broken. Cases marked 'reject' are the actual measurement.
+"""
+
+from __future__ import annotations
+
+
+def build_cases(grid_path: str, grid_path_with_data: tuple[str, str]) -> list[dict]:
+    """Return the case list, parameterized by the synthetic fixture paths."""
+    grid_only = grid_path
+    grid_for_data, data_path = grid_path_with_data
+    missing_path = "/nonexistent/path/that/cannot/possibly/exist.nc"
+
+    return [
+        # ---- BASELINES: well-formed calls that SHOULD succeed ----
+        {
+            "id": "baseline_inspect_mesh",
+            "description": "Well-formed inspect_mesh on a valid grid",
+            "tool": "run_analysis",
+            "kwargs": {"operation": "inspect_mesh", "grid_path": grid_only},
+            "expected": "accept",
+        },
+        {
+            "id": "baseline_calculate_area",
+            "description": "Well-formed calculate_area on a valid grid",
+            "tool": "run_analysis",
+            "kwargs": {"operation": "calculate_area", "grid_path": grid_only},
+            "expected": "accept",
+        },
+        # ---- MALFORMED: schema-level violations ----
+        {
+            "id": "wrong_operation_typo",
+            "description": "operation='curl_calculation' instead of 'curl'",
+            "tool": "run_analysis",
+            "kwargs": {
+                "operation": "curl_calculation",
+                "grid_path": grid_only,
+                "data_path": data_path,
+                "u_variable": "u",
+                "v_variable": "v",
+            },
+            "expected": "reject",
+        },
+        {
+            "id": "wrong_operation_empty",
+            "description": "operation='' (empty string)",
+            "tool": "run_analysis",
+            "kwargs": {"operation": ""},
+            "expected": "reject",
+        },
+        {
+            "id": "wrong_operation_made_up",
+            "description": "operation='fluxulate' — does not exist",
+            "tool": "run_analysis",
+            "kwargs": {"operation": "fluxulate", "grid_path": grid_only},
+            "expected": "reject",
+        },
+        # ---- MALFORMED: missing required parameter ----
+        {
+            "id": "missing_grid_path",
+            "description": "inspect_mesh without grid_path",
+            "tool": "run_analysis",
+            "kwargs": {"operation": "inspect_mesh"},
+            "expected": "reject",
+        },
+        {
+            "id": "missing_data_path",
+            "description": "inspect_variable with grid but no data",
+            "tool": "run_analysis",
+            "kwargs": {"operation": "inspect_variable", "grid_path": grid_only},
+            "expected": "reject",
+        },
+        {
+            "id": "missing_variable_name",
+            "description": "calculate_zonal_mean without variable_name",
+            "tool": "run_analysis",
+            "kwargs": {
+                "operation": "calculate_zonal_mean",
+                "grid_path": grid_for_data,
+                "data_path": data_path,
+            },
+            "expected": "reject",
+        },
+        {
+            "id": "missing_u_variable",
+            "description": "curl with v_variable but no u_variable",
+            "tool": "run_analysis",
+            "kwargs": {
+                "operation": "curl",
+                "grid_path": grid_for_data,
+                "data_path": data_path,
+                "v_variable": "v",
+            },
+            "expected": "reject",
+        },
+        {
+            "id": "missing_center_for_azimuthal",
+            "description": "azimuthal_mean without center_lon/lat/radius",
+            "tool": "run_analysis",
+            "kwargs": {
+                "operation": "azimuthal_mean",
+                "grid_path": grid_for_data,
+                "data_path": data_path,
+                "variable_name": "temperature",
+            },
+            "expected": "reject",
+        },
+        {
+            "id": "missing_bbox_bounds",
+            "description": "subset_bbox without lon_bounds/lat_bounds",
+            "tool": "run_analysis",
+            "kwargs": {"operation": "subset_bbox", "grid_path": grid_only},
+            "expected": "reject",
+        },
+        {
+            "id": "missing_data_path_a",
+            "description": "compare_fields missing data_path_a",
+            "tool": "run_analysis",
+            "kwargs": {
+                "operation": "compare_fields",
+                "variable_name": "temperature",
+                "data_path_b": data_path,
+            },
+            "expected": "reject",
+        },
+        {
+            "id": "missing_target_grid_for_remap",
+            "description": "remap_variable without target_grid_path",
+            "tool": "run_analysis",
+            "kwargs": {
+                "operation": "remap_variable",
+                "grid_path": grid_for_data,
+                "data_path": data_path,
+                "variable_name": "temperature",
+            },
+            "expected": "reject",
+        },
+        {
+            "id": "missing_data_paths_for_ensemble",
+            "description": "ensemble_mean without data_paths",
+            "tool": "run_analysis",
+            "kwargs": {
+                "operation": "ensemble_mean",
+                "variable_name": "temperature",
+            },
+            "expected": "reject",
+        },
+        {
+            "id": "missing_output_path_for_export",
+            "description": "export without output_path",
+            "tool": "run_analysis",
+            "kwargs": {"operation": "export"},
+            "expected": "reject",
+        },
+        # ---- MALFORMED: nonexistent file paths (IO layer should catch) ----
+        {
+            "id": "nonexistent_grid",
+            "description": "inspect_mesh against a path that does not exist",
+            "tool": "run_analysis",
+            "kwargs": {"operation": "inspect_mesh", "grid_path": missing_path},
+            "expected": "reject",
+        },
+        {
+            "id": "nonexistent_data",
+            "description": "inspect_variable with nonexistent data file",
+            "tool": "run_analysis",
+            "kwargs": {
+                "operation": "inspect_variable",
+                "grid_path": grid_for_data,
+                "data_path": missing_path,
+            },
+            "expected": "reject",
+        },
+        # ---- MALFORMED: plot_dataset variants ----
+        {
+            "id": "plot_unknown_type",
+            "description": "plot_dataset with plot_type='holography'",
+            "tool": "plot_dataset",
+            "kwargs": {"plot_type": "holography", "grid_path": grid_only},
+            "expected": "reject",
+        },
+        {
+            "id": "plot_missing_variable",
+            "description": "plot_dataset variable plot but no variable_name",
+            "tool": "plot_dataset",
+            "kwargs": {
+                "plot_type": "variable",
+                "grid_path": grid_for_data,
+                "data_path": data_path,
+            },
+            "expected": "reject",
+        },
+        {
+            "id": "plot_variable_does_not_exist",
+            "description": "plot_dataset for variable 'pixiedust' (not in file)",
+            "tool": "plot_dataset",
+            "kwargs": {
+                "plot_type": "variable",
+                "grid_path": grid_for_data,
+                "data_path": data_path,
+                "variable_name": "pixiedust",
+            },
+            "expected": "reject",
+        },
+        # ---- MALFORMED: wrong-type bbox bounds ----
+        {
+            "id": "bbox_wrong_arity",
+            "description": "subset_bbox with lon_bounds=[10] (needs 2 floats)",
+            "tool": "run_analysis",
+            "kwargs": {
+                "operation": "subset_bbox",
+                "grid_path": grid_only,
+                "lon_bounds": [10.0],
+                "lat_bounds": [0.0, 10.0],
+            },
+            "expected": "reject",
+        },
+    ]