Replies: 1 comment
-
|
One eval split I would really like to see in MCP-land is which boundary the text crossed before it became risky. A lot of test sets flatten everything into "bad prompt in, bad behavior out," but agent failures usually look more specific than that.\n\nUseful buckets in practice:\n- direct user injection\n- retrieved/web/document injection\n- tool-result injection\n- structured tool-argument abuse\n- outbound exfiltration attempts\n- benign administrative/security text that mentions attacks but should stay allowed\n\nIf an eval suite can preserve those distinctions, it becomes much easier to compare systems meaningfully. Otherwise a tool that is great at obvious user jailbreaks but weak on tool-result or JSON-wrapped abuse can still look deceptively strong.\n\nI would also strongly recommend scoring both detection quality and policy quality separately. Catching something as suspicious is different from mapping it to the right action (allow / warn / block / redact / require approval). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Pre-submission Checklist
Your Idea
Problem
MCP servers expose tools but have no standard way to describe how those tools should behave, how well an LLM should invoke them, or whether a full task can be completed with them. Without a protocol-level definition, every client invents its own eval format and
server authors cannot write evals once and have them run everywhere.
The goal is to let MCP servers optionally ship a suite of evaluations alongside their tools — analogous to a library shipping unit tests — so that clients can validate tool-use correctness, tool output quality, and end-to-end task completion against any frontier model.
Goals
In scope:
evals/listdiscovery endpoint (parallel totools/list,resources/list,prompts/list)Evalobject schema (level, grading type, input, expected)EvalResultshape (local to client)Explicitly deferred:
Capability
Servers that expose evals declare the
evalscapability:{ "capabilities": { "evals": { "listChanged": true } } }listChangedmirrorstools.listChanged— servers that update their eval suite can emitnotifications/evals/list_changed. No client capability is required to callevals/list.Protocol Messages
evals/listRequest:
{ "jsonrpc": "2.0", "id": 1, "method": "evals/list", "params": { "cursor": "optional-cursor", "level": "invocation" } }levelis an optional filter. Clients that only want to run a subset of evals (e.g., onlyinvocationevals in a fast pre-flight check) can filter at the source.Response:
{ "jsonrpc": "2.0", "id": 1, "result": { "evals": [ ], // Eval[] array "nextCursor": "optional-next-cursor" } }Pagination follows the same cursor model as every other MCP list primitive.
Data Model
EvalInput types (discriminated by
type, which also encodes the level)input.typeserves as the level discriminant, following MCP's existingContentpattern (TextContent | ImageContent | ...). A separate top-levellevelfield is redundant and omitted.Expected types (discriminated by
type, matchinggradingType)Constraint:
scenario+exact-matchis explicitly invalid. Multi-turn agent trajectories are non-deterministic; exact-match grading cannot apply.Execution Semantics
invocationexact-matchtoolNameand all specifiedargumentskeys match.invocationllm-as-judgeexecutionexact-matchtools/callwith given arguments. Pass if resultcontentdeeply equals expected.executionllm-as-judgescenariollm-as-judgemaxTurns. Judge receives full transcript + rubric.Tool list for
invocationevals: the client fetches the server's current tools viatools/listbefore running — tools are not embedded in the eval. This keeps evals decoupled from specific tool schema versions.Side effects:
execution-level evals call real tools against the live server. The spec SHOULD recommend that server authors avoid placing destructive or irreversible operations inexecutionevals without explicit documentation. Clients SHOULD surface this to users before running.EvalResult (local to client)
Results are not sent to the server.
Backward Compatibility
No breaking changes. Servers without the
evalscapability are unaffected. Clients that do not understand theevalscapability ignore it. The new methods and types are purely additive.Security Considerations
executionevals invoke real tools — clients MUST NOT run evals in untrusted contexts without user acknowledgment.llm-as-judgeresults are model-dependent. Pass/fail is meaningful only relative to the model used; the spec must not imply universal pass/fail semantics.Scope
Beta Was this translation helpful? Give feedback.
All reactions