Add agent evaluations
Overview
Add agent evaluation capabilities so builders can measure and improve the quality of their agent experiences. Evaluations use a judge model to score agent responses against configurable criteria. Builders can run on-demand evaluations with test cases, view detailed result reports with improvement guidance, and optionally enable runtime evaluations that score live invocations to track quality over time.
Context
Current State
- The
AgentDetailPage shows sessions, an invoke panel, latency summary, streamed response, and deployment configuration — but has no evaluation or quality measurement features
- The
Invocation model stores prompt_text, thinking_text, and response_text for each invocation, which provides the input/output pairs needed for evaluation
- The
ConfigEntry model stores agent-level key-value configuration and could be used to persist evaluation settings
- The agent handler (
agents/strands_agent/src/handler.py) processes invocations via agent.stream_async() and yields streaming text — runtime evaluation hooks would need to run post-completion
SUPPORTED_MODELS in agents.py lists available Bedrock models — a judge model can be selected from this list
- The
InvokePanel component handles prompt submission and streaming — evaluation test cases would use a similar invocation flow
- No evaluation framework, scoring, or judge model integration exists in the codebase
Key Files
| File |
Role |
frontend/src/pages/AgentDetailPage.tsx |
Agent detail with sessions, invoke, and deployment |
frontend/src/components/InvokePanel.tsx |
Invoke form and streaming |
frontend/src/hooks/useInvoke.ts |
Invocation state management |
backend/app/routers/invocations.py |
SSE invocation endpoint, session management |
backend/app/routers/agents.py |
Agent CRUD, SUPPORTED_MODELS |
backend/app/models/invocation.py |
Invocation ORM (prompt, response, timing, tokens) |
backend/app/models/config_entry.py |
Agent config key-value pairs |
backend/app/models/agent.py |
Agent ORM model |
agents/strands_agent/src/handler.py |
Runtime invocation handler |
frontend/src/api/types.ts |
Shared TypeScript types |
Technology Stack
- Backend: Python, FastAPI, SQLAlchemy, SQLite
- Frontend: TypeScript, React, Vite, shadcn/ui, Tailwind CSS
- Agent Runtime: Strands SDK on Bedrock AgentCore Runtime
- Models: Amazon Bedrock (Claude, Nova families)
Requirements
R1: On-Demand Evaluation Configuration and Execution
Builder should be able to configure on-demand evaluations, run one, and see the results.
- Create new ORM models in
backend/app/models/evaluation.py:
EvaluationSuite — a named collection of test cases for an agent:
id (integer, primary key)
agent_id (integer, FK to agents, not null)
name (string, not null) — e.g. "Customer Support Quality"
judge_model_id (string, not null) — Bedrock model ID used as the judge (from SUPPORTED_MODELS)
criteria (text, JSON array) — list of scoring criteria, each with name, description, and weight (e.g. [{"name": "relevance", "description": "Is the response relevant to the prompt?", "weight": 1.0}, {"name": "helpfulness", "description": "Does the response help the user accomplish their goal?", "weight": 1.0}])
created_at, updated_at (datetime)
EvaluationTestCase — an individual test case within a suite:
id (integer, primary key)
suite_id (integer, FK to evaluation_suites, not null)
name (string, nullable) — optional label
prompt (text, not null) — the input prompt to send to the agent
expected_response (text, nullable) — optional reference answer for comparison
context (text, nullable) — optional additional context the judge should consider
created_at (datetime)
EvaluationRun — a single execution of a suite:
id (integer, primary key)
suite_id (integer, FK to evaluation_suites, not null)
qualifier (string, default "DEFAULT") — agent endpoint qualifier used
status (string, not null) — pending, running, complete, error
started_at, completed_at (datetime)
summary_scores (text, JSON) — aggregated scores per criterion after completion
EvaluationResult — per-test-case result within a run:
id (integer, primary key)
run_id (integer, FK to evaluation_runs, not null)
test_case_id (integer, FK to evaluation_test_cases, not null)
invocation_id (string, nullable) — FK to the invocation created for this test case
agent_response (text, nullable) — the agent's actual response
scores (text, JSON) — per-criterion scores, e.g. {"relevance": 4, "helpfulness": 5}
judge_reasoning (text, nullable) — the judge model's explanation for its scores
status (string) — pending, complete, error
error_message (text, nullable)
- Create backend endpoints in
backend/app/routers/evaluations.py:
POST /api/agents/{agent_id}/evaluations/suites — create an evaluation suite with criteria and test cases
GET /api/agents/{agent_id}/evaluations/suites — list suites for an agent
GET /api/agents/{agent_id}/evaluations/suites/{suite_id} — get suite detail including test cases
PUT /api/agents/{agent_id}/evaluations/suites/{suite_id} — update suite criteria or test cases
DELETE /api/agents/{agent_id}/evaluations/suites/{suite_id} — delete a suite
POST /api/agents/{agent_id}/evaluations/suites/{suite_id}/run — trigger an evaluation run
GET /api/agents/{agent_id}/evaluations/runs/{run_id} — get run status and results
- Create a backend evaluation service (
backend/app/services/evaluation.py) that orchestrates a run:
- For each test case in the suite, invoke the agent (reuse the existing invocation pipeline) and capture the response
- Send each prompt/response pair to the judge model via Bedrock with a scoring prompt that includes the criteria definitions, expected response (if provided), and scoring instructions (rate each criterion 1-5)
- Parse the judge model's response to extract per-criterion numeric scores and reasoning
- Store results in
EvaluationResult and compute aggregated summary_scores on the EvaluationRun
- Run evaluations asynchronously (background task) so the API returns immediately with the run ID
- Create frontend components:
- Add an Evaluations section on the
AgentDetailPage (below sessions, above deployment), visible only for deployed agents
EvaluationSuiteManager — form to create/edit suites: name, judge model selector (from SUPPORTED_MODELS), criteria editor (add/remove criteria with name, description, weight), and test case editor (add/remove test cases with prompt, optional expected response)
EvaluationRunButton — triggers a run and shows progress (pending/running/complete)
EvaluationResultsView — displays results after a run completes (see R2)
R2: Evaluation Report with Improvement Guidance
Builder should get a report of the results and guidance on how to improve various scores.
- The
EvaluationResultsView component should display:
- Summary section at the top:
- Overall score (weighted average across all criteria and test cases, displayed as a percentage or out of 5)
- Per-criterion average scores displayed as a horizontal bar chart or score cards (e.g. "Relevance: 4.2/5", "Helpfulness: 3.8/5")
- Run metadata: suite name, judge model, qualifier, start/end time, test case count
- Per-test-case detail table:
- Columns: test case name/prompt (truncated), per-criterion scores, overall score, status
- Expandable rows showing: full prompt, expected response, agent response, judge reasoning
- Color-coding: green (4-5), yellow (3), red (1-2) for individual scores
- Improvement guidance panel:
- For each criterion that scores below a configurable threshold (default: 3.5/5), generate actionable improvement suggestions
- Create a backend endpoint
GET /api/agents/{agent_id}/evaluations/runs/{run_id}/guidance that sends the aggregated results (low-scoring criteria, sample low-scoring prompt/response pairs) to the judge model with a meta-prompt asking for specific improvement recommendations
- Guidance should include concrete suggestions such as:
- System prompt modifications (e.g. "Add instructions to always cite sources")
- Missing tool integrations (e.g. "The agent lacks a search tool for factual queries")
- Model selection (e.g. "Consider using a more capable model for complex reasoning tasks")
- Test case design (e.g. "Expected response for test case 3 may be too specific")
- Display guidance as a bulleted list grouped by criterion, with a "Regenerate Guidance" button
- Add an
EvaluationHistory component that lists past runs for a suite:
- Table showing run date, overall score, status, and a link to view full results
- Allow comparison between two runs (side-by-side score diff) to show improvement or regression
R3: Runtime Evaluations with Dashboard
Builder should be able to optionally enable evaluations for runtime and see dashboards of evaluation scores over time.
- Add a
runtime_eval_enabled boolean field and runtime_eval_config (JSON text) field to the EvaluationSuite model:
runtime_eval_config contains: sample_rate (float, 0.0-1.0, default 0.1 — fraction of invocations to evaluate), judge_model_id (string), criteria (reuses suite criteria)
- When runtime evaluation is enabled for a suite:
- After each agent invocation completes (in the SSE streaming endpoint in
invocations.py, after session_end), check if the agent has any suites with runtime_eval_enabled=True
- Based on
sample_rate, probabilistically decide whether to evaluate this invocation
- If selected, queue a background evaluation task that sends the invocation's prompt/response to the judge model and stores the result as an
EvaluationResult linked to both the run (a synthetic "runtime" run per day or per batch) and the original invocation
- Create a
RuntimeEvalRun or extend EvaluationRun with a run_type field (on_demand vs runtime) to distinguish manually triggered runs from automated runtime evaluations
- Create backend endpoints:
PUT /api/agents/{agent_id}/evaluations/suites/{suite_id}/runtime — enable/disable runtime evaluation and configure sample rate
GET /api/agents/{agent_id}/evaluations/runtime/scores — return time-series evaluation scores:
- Accepts
start_date, end_date, granularity (hourly, daily, weekly) query parameters
- Returns per-criterion average scores bucketed by time period
- Create frontend components:
- Runtime evaluation toggle on the suite configuration form: switch to enable, sample rate slider (1%-100%), judge model selector
- Evaluation Dashboard section on the
AgentDetailPage (shown when runtime eval is enabled):
- Time-series line chart showing per-criterion scores over time (x-axis: date, y-axis: score 1-5, one line per criterion)
- Time range selector (last 7 days, 30 days, 90 days)
- Summary statistics: current average vs. previous period, trend indicator (improving/stable/regressing)
- Drill-down: clicking a data point shows the individual evaluated invocations for that time bucket with their scores
- Alert indicators: if any criterion's rolling average drops below the threshold (default 3.5), show a warning badge on the Evaluations section header and in the dashboard
Testing
- Run backend tests:
cd backend && make test
- Run frontend typecheck:
cd frontend && npx tsc --noEmit
- Verify on-demand evaluation:
- Create an evaluation suite with 2-3 criteria and 3-5 test cases
- Trigger a run and verify it progresses through pending -> running -> complete
- Each test case produces an invocation, agent response, and judge scores
- Summary scores aggregate correctly across test cases
- Verify evaluation report:
- Results display per-criterion scores with correct color coding
- Expanding a test case row shows full prompt, response, and judge reasoning
- Improvement guidance generates actionable suggestions for low-scoring criteria
- Evaluation history shows past runs with scores
- Verify runtime evaluation:
- Enable runtime eval on a suite with a sample rate of 1.0 (100%) for testing
- Invoke the agent several times and verify evaluation results are created for each invocation
- The time-series endpoint returns correct score buckets
- Dashboard chart renders scores over time
- Reduce sample rate to 0.5 and verify approximately half of invocations are evaluated
- Disable runtime eval and verify no new evaluations are created
- Database:
- New tables are created without affecting existing tables
- Deleting a suite cascades to test cases, runs, and results
- Deleting an agent cascades to evaluation suites
Out of Scope
- Custom judge prompts (the scoring prompt is system-defined based on criteria)
- Multi-turn conversation evaluation (each test case is a single prompt/response)
- Evaluation of tool use quality (only the final text response is scored)
- Automated remediation (changing agent config based on eval results)
- Cost tracking for judge model invocations
- Exporting evaluation results (CSV, JSON)
- Comparison across different agents (evaluations are scoped to a single agent)
- Integration with external evaluation frameworks (e.g. RAGAS, DeepEval)
Add agent evaluations
Overview
Add agent evaluation capabilities so builders can measure and improve the quality of their agent experiences. Evaluations use a judge model to score agent responses against configurable criteria. Builders can run on-demand evaluations with test cases, view detailed result reports with improvement guidance, and optionally enable runtime evaluations that score live invocations to track quality over time.
Context
Current State
AgentDetailPageshows sessions, an invoke panel, latency summary, streamed response, and deployment configuration — but has no evaluation or quality measurement featuresInvocationmodel storesprompt_text,thinking_text, andresponse_textfor each invocation, which provides the input/output pairs needed for evaluationConfigEntrymodel stores agent-level key-value configuration and could be used to persist evaluation settingsagents/strands_agent/src/handler.py) processes invocations viaagent.stream_async()and yields streaming text — runtime evaluation hooks would need to run post-completionSUPPORTED_MODELSinagents.pylists available Bedrock models — a judge model can be selected from this listInvokePanelcomponent handles prompt submission and streaming — evaluation test cases would use a similar invocation flowKey Files
frontend/src/pages/AgentDetailPage.tsxfrontend/src/components/InvokePanel.tsxfrontend/src/hooks/useInvoke.tsbackend/app/routers/invocations.pybackend/app/routers/agents.pySUPPORTED_MODELSbackend/app/models/invocation.pybackend/app/models/config_entry.pybackend/app/models/agent.pyagents/strands_agent/src/handler.pyfrontend/src/api/types.tsTechnology Stack
Requirements
R1: On-Demand Evaluation Configuration and Execution
Builder should be able to configure on-demand evaluations, run one, and see the results.
backend/app/models/evaluation.py:EvaluationSuite— a named collection of test cases for an agent:id(integer, primary key)agent_id(integer, FK to agents, not null)name(string, not null) — e.g. "Customer Support Quality"judge_model_id(string, not null) — Bedrock model ID used as the judge (fromSUPPORTED_MODELS)criteria(text, JSON array) — list of scoring criteria, each withname,description, andweight(e.g.[{"name": "relevance", "description": "Is the response relevant to the prompt?", "weight": 1.0}, {"name": "helpfulness", "description": "Does the response help the user accomplish their goal?", "weight": 1.0}])created_at,updated_at(datetime)EvaluationTestCase— an individual test case within a suite:id(integer, primary key)suite_id(integer, FK to evaluation_suites, not null)name(string, nullable) — optional labelprompt(text, not null) — the input prompt to send to the agentexpected_response(text, nullable) — optional reference answer for comparisoncontext(text, nullable) — optional additional context the judge should considercreated_at(datetime)EvaluationRun— a single execution of a suite:id(integer, primary key)suite_id(integer, FK to evaluation_suites, not null)qualifier(string, default "DEFAULT") — agent endpoint qualifier usedstatus(string, not null) —pending,running,complete,errorstarted_at,completed_at(datetime)summary_scores(text, JSON) — aggregated scores per criterion after completionEvaluationResult— per-test-case result within a run:id(integer, primary key)run_id(integer, FK to evaluation_runs, not null)test_case_id(integer, FK to evaluation_test_cases, not null)invocation_id(string, nullable) — FK to the invocation created for this test caseagent_response(text, nullable) — the agent's actual responsescores(text, JSON) — per-criterion scores, e.g.{"relevance": 4, "helpfulness": 5}judge_reasoning(text, nullable) — the judge model's explanation for its scoresstatus(string) —pending,complete,errorerror_message(text, nullable)backend/app/routers/evaluations.py:POST /api/agents/{agent_id}/evaluations/suites— create an evaluation suite with criteria and test casesGET /api/agents/{agent_id}/evaluations/suites— list suites for an agentGET /api/agents/{agent_id}/evaluations/suites/{suite_id}— get suite detail including test casesPUT /api/agents/{agent_id}/evaluations/suites/{suite_id}— update suite criteria or test casesDELETE /api/agents/{agent_id}/evaluations/suites/{suite_id}— delete a suitePOST /api/agents/{agent_id}/evaluations/suites/{suite_id}/run— trigger an evaluation runGET /api/agents/{agent_id}/evaluations/runs/{run_id}— get run status and resultsbackend/app/services/evaluation.py) that orchestrates a run:EvaluationResultand compute aggregatedsummary_scoreson theEvaluationRunAgentDetailPage(below sessions, above deployment), visible only for deployed agentsEvaluationSuiteManager— form to create/edit suites: name, judge model selector (fromSUPPORTED_MODELS), criteria editor (add/remove criteria with name, description, weight), and test case editor (add/remove test cases with prompt, optional expected response)EvaluationRunButton— triggers a run and shows progress (pending/running/complete)EvaluationResultsView— displays results after a run completes (see R2)R2: Evaluation Report with Improvement Guidance
Builder should get a report of the results and guidance on how to improve various scores.
EvaluationResultsViewcomponent should display:GET /api/agents/{agent_id}/evaluations/runs/{run_id}/guidancethat sends the aggregated results (low-scoring criteria, sample low-scoring prompt/response pairs) to the judge model with a meta-prompt asking for specific improvement recommendationsEvaluationHistorycomponent that lists past runs for a suite:R3: Runtime Evaluations with Dashboard
Builder should be able to optionally enable evaluations for runtime and see dashboards of evaluation scores over time.
runtime_eval_enabledboolean field andruntime_eval_config(JSON text) field to theEvaluationSuitemodel:runtime_eval_configcontains:sample_rate(float, 0.0-1.0, default 0.1 — fraction of invocations to evaluate),judge_model_id(string),criteria(reuses suite criteria)invocations.py, aftersession_end), check if the agent has any suites withruntime_eval_enabled=Truesample_rate, probabilistically decide whether to evaluate this invocationEvaluationResultlinked to both the run (a synthetic "runtime" run per day or per batch) and the original invocationRuntimeEvalRunor extendEvaluationRunwith arun_typefield (on_demandvsruntime) to distinguish manually triggered runs from automated runtime evaluationsPUT /api/agents/{agent_id}/evaluations/suites/{suite_id}/runtime— enable/disable runtime evaluation and configure sample rateGET /api/agents/{agent_id}/evaluations/runtime/scores— return time-series evaluation scores:start_date,end_date,granularity(hourly, daily, weekly) query parametersAgentDetailPage(shown when runtime eval is enabled):Testing
cd backend && make testcd frontend && npx tsc --noEmitOut of Scope