[RFC] MCP Evals #728

reyortiz3 · 2026-04-17T19:07:49Z

reyortiz3
Apr 17, 2026

Pre-submission Checklist

I have verified this would not be more appropriate as a feature request in a specific repository
I have searched existing discussions to avoid duplicates

Your Idea

Problem

MCP servers expose tools but have no standard way to describe how those tools should behave, how well an LLM should invoke them, or whether a full task can be completed with them. Without a protocol-level definition, every client invents its own eval format and
server authors cannot write evals once and have them run everywhere.

The goal is to let MCP servers optionally ship a suite of evaluations alongside their tools — analogous to a library shipping unit tests — so that clients can validate tool-use correctness, tool output quality, and end-to-end task completion against any frontier model.

Goals

In scope:

Capability declaration
evals/list discovery endpoint (parallel to tools/list, resources/list, prompts/list)
Eval object schema (level, grading type, input, expected)
Execution semantics per level × grading type combination
EvalResult shape (local to client)

Explicitly deferred:

Reporting results back to the server
CI/harness integration
Eval versioning or tagging

Capability

Servers that expose evals declare the evals capability:

{
  "capabilities": {
    "evals": {
      "listChanged": true
    }
  }
}

listChanged mirrors tools.listChanged — servers that update their eval suite can emit notifications/evals/list_changed. No client capability is required to call evals/list.

Protocol Messages

`evals/list`

Request:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "evals/list",
  "params": {
    "cursor": "optional-cursor",
    "level": "invocation"
  }
}

level is an optional filter. Clients that only want to run a subset of evals (e.g., only invocation evals in a fast pre-flight check) can filter at the source.

Response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "evals": [ ], // Eval[] array
    "nextCursor": "optional-next-cursor"
  }
}

Pagination follows the same cursor model as every other MCP list primitive.

Data Model

`Eval`

interface Eval {
  id: string;           // stable identifier
  name: string;         // human-readable label
  description?: string; // what behavior this eval verifies
  gradingType: "exact-match" | "llm-as-judge";
  input: InvocationInput | ExecutionInput | ScenarioInput;
  expected: ExactMatchExpected | LLMJudgeExpected;
}

Input types (discriminated by `type`, which also encodes the level)

// level: invocation
// LLM receives messages + server's tools/list; must call the right tool
interface InvocationInput {
  type: "invocation";
  messages: SamplingMessage[];
}

// level: execution
// Tool is called directly with given arguments; output is checked
interface ExecutionInput {
  type: "execution";
  toolName: string;
  arguments: { [key: string]: unknown };
}

// level: scenario
// Full agent loop runs from messages; goal completion is assessed
interface ScenarioInput {
  type: "scenario";
  messages: SamplingMessage[];
  maxTurns?: number;   // if omitted, client determines a reasonable default; spec does not mandate a value
}

input.type serves as the level discriminant, following MCP's existing Content pattern (TextContent | ImageContent | ...). A separate top-level level field is redundant and omitted.

Expected types (discriminated by `type`, matching `gradingType`)

interface ExactMatchExpected {
  type: "exact-match";
  toolName?: string;                        // invocation evals: expected tool to call
  arguments?: { [key: string]: unknown };   // invocation evals: partial match — only specified keys checked
  content?: unknown;                        // execution evals: expected tool result content (JSON deep equality)
}
// Which fields are meaningful is determined by input.type:
// - input.type === "invocation": use toolName + arguments
// - input.type === "execution": use content

interface LLMJudgeExpected {
  type: "llm-as-judge";
  rubric: string;   // natural language criteria for the judge
}

Constraint: scenario + exact-match is explicitly invalid. Multi-turn agent trajectories are non-deterministic; exact-match grading cannot apply.

Execution Semantics

Level	Grading	Client behavior
`invocation`	`exact-match`	Send messages to LLM with server's tools in context. Pass if model calls `toolName` and all specified `arguments` keys match.
`invocation`	`llm-as-judge`	Same execution. Judge receives model response + rubric, returns pass/fail + reason.
`execution`	`exact-match`	Call `tools/call` with given arguments. Pass if result `content` deeply equals expected.
`execution`	`llm-as-judge`	Same call. Judge receives tool result + rubric.
`scenario`	`llm-as-judge`	Run full agent loop up to `maxTurns`. Judge receives full transcript + rubric.

Tool list for invocation evals: the client fetches the server's current tools via tools/list before running — tools are not embedded in the eval. This keeps evals decoupled from specific tool schema versions.

Side effects: execution-level evals call real tools against the live server. The spec SHOULD recommend that server authors avoid placing destructive or irreversible operations in execution evals without explicit documentation. Clients SHOULD surface this to users before running.

EvalResult (local to client)

interface EvalResult {
  evalId: string;
  passed: boolean;
  reason?: string;    // populated by llm-as-judge; optional for exact-match failures
  durationMs: number;
}

Results are not sent to the server.

Backward Compatibility

No breaking changes. Servers without the evals capability are unaffected. Clients that do not understand the evals capability ignore it. The new methods and types are purely additive.

Security Considerations

execution evals invoke real tools — clients MUST NOT run evals in untrusted contexts without user acknowledgment.
llm-as-judge results are model-dependent. Pass/fail is meaningful only relative to the model used; the spec must not imply universal pass/fail semantics.
Evals may embed sensitive expected outputs (e.g., API responses with real data). Server authors SHOULD avoid embedding sensitive data in eval fixtures.

Scope

armorer-labs · 2026-05-13T15:09:46Z

armorer-labs
May 13, 2026

One eval split I would really like to see in MCP-land is which boundary the text crossed before it became risky. A lot of test sets flatten everything into "bad prompt in, bad behavior out," but agent failures usually look more specific than that.\n\nUseful buckets in practice:\n- direct user injection\n- retrieved/web/document injection\n- tool-result injection\n- structured tool-argument abuse\n- outbound exfiltration attempts\n- benign administrative/security text that mentions attacks but should stay allowed\n\nIf an eval suite can preserve those distinctions, it becomes much easier to compare systems meaningfully. Otherwise a tool that is great at obvious user jailbreaks but weak on tool-result or JSON-wrapped abuse can still look deceptively strong.\n\nI would also strongly recommend scoring both detection quality and policy quality separately. Catching something as suspicious is different from mapping it to the right action (allow / warn / block / redact / require approval).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Context Protocol

[RFC] MCP Evals #728

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Model Context Protocol

[RFC] MCP Evals #728

Uh oh!

Uh oh!

reyortiz3 Apr 17, 2026

Pre-submission Checklist

Your Idea

Problem

Goals

Capability

Protocol Messages

evals/list

Data Model

Eval

Input types (discriminated by type, which also encodes the level)

Expected types (discriminated by type, matching gradingType)

Execution Semantics

EvalResult (local to client)

Backward Compatibility

Security Considerations

Scope

Replies: 1 comment

Uh oh!

armorer-labs May 13, 2026

reyortiz3
Apr 17, 2026

`evals/list`

`Eval`

Input types (discriminated by `type`, which also encodes the level)

Expected types (discriminated by `type`, matching `gradingType`)

armorer-labs
May 13, 2026