From de7e5eec20772d3b69de95f26c0c4b14da735bab Mon Sep 17 00:00:00 2001 From: Paulo Lacerda Date: Wed, 10 Jun 2026 13:44:10 -0300 Subject: [PATCH] docs: teach how to inspect each eval run in the Foundry portal The tutorial previously stopped at the terminal's aggregate pass/fail output, which makes the eval steps feel like a black box. Add a reusable `See the run in the Foundry portal` section at the end of step 10 (deep-link via results report_url + manual nav + what each pane shows) and link back to it from the step 11 multi-turn re-run and the step 11.4 rubric re-run. Step 11.4 also gets an explicit walkthrough of the `View rubric details` modal so users see the per-dimension judge scores even though the CLI threshold lives on the aggregate. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- docs/tutorial-prompt-agent-quickstart.md | 53 ++++++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/docs/tutorial-prompt-agent-quickstart.md b/docs/tutorial-prompt-agent-quickstart.md index e23b868..2897a65 100644 --- a/docs/tutorial-prompt-agent-quickstart.md +++ b/docs/tutorial-prompt-agent-quickstart.md @@ -805,6 +805,33 @@ You should see `execution: azd` and `Threshold status: PASSED`. The raw azd run details are kept under `.agentops/results/latest/` alongside AgentOps' normalized `results.json` and `report.md`. +### See the run in the Foundry portal + +`agentops eval run` only prints aggregate pass/fail to the terminal. The +Foundry portal shows the full per-row, per-evaluator breakdown — useful +for learning what the judge actually scored and why. Use this anchor +section any time the tutorial tells you to run an eval. + +1. **Open the deep link** — easiest path. Look in + `.agentops/results/latest/azd_evaluation.json` for the `report_url` + field. That URL goes straight to the evaluation run in the New + Foundry experience. +2. **Or navigate manually** in : + 1. Pick the `travel-agent-sandbox` project (top selector). + 2. **Agents** → select **`travel-agent`**. + 3. Open the **Evaluations** tab. + 4. Click the most recent run (named after the evaluator, e.g. + `smoke-core`). +3. **What to look at on the run page:** + - **Overall metric results** — the aggregate pass rate per evaluator + (matches the values AgentOps reports under `aggregate_metrics`). + - **Detailed metrics results** — one row per dataset sample with the + pass/fail for `coherence`, `fluency`, and the local rubric + (`smoke-core`). + +> **Tip:** keep this tab open as you iterate. Every new +> `agentops eval run` creates a new evaluation run in the same list. + ## 11. Harden the gate: conversation-aware dataset and rubric The smoke gate proves the workspace works. Before generating CI, harden @@ -850,6 +877,14 @@ agentops eval run When it passes, `results.json` records `execution: azd`, the evaluator list, the multi-turn dataset kind, and the threshold results. +> **See it in the Foundry portal.** Open the new evaluation run using +> the deep link in `.agentops/results/latest/azd_evaluation.json` +> (`report_url`) or the manual nav described in +> [See the run in the Foundry portal](#see-the-run-in-the-foundry-portal). +> The **Detailed metrics results** table now shows one row per +> multi-turn sample, so you can compare how the agent handled the Rome +> and Lisbon/Seattle scenarios independently. + > **What did this gate test?** Individual synthetic conversation-context > turns, not the Foundry portal **Full conversations** preview. AgentOps > uses `messages` to preserve the conversation shape and @@ -1000,6 +1035,24 @@ wrong, AgentOps cannot bind it to an emitted metric — open `aggregate_metrics` to see exactly which evaluator names azd produced for this recipe. +> **See the per-dimension rubric scores in the Foundry portal.** The +> CLI threshold lives on the `smoke-core` aggregate, but Foundry still +> records every dimension the judge scored. Open the run as in +> [See the run in the Foundry portal](#see-the-run-in-the-foundry-portal), +> scroll to **Detailed metrics results**, find the `smoke-core` column, +> and click **View rubric details** on any row. The modal shows: +> +> - The aggregated rubric score (e.g. `0.92 / 1.0`). +> - The judge's free-text explanation of the overall result. +> - One row per dimension (`correct_itinerary`, `clear_practical_notes`, +> `user_satisfaction`, `adherence_to_constraints`, +> `itinerary_clarity`, `general_quality`) with the individual score +> (1–5), pass/fail badge, and the judge's reason for that dimension. +> +> This is the most useful drill-down when you are iterating on the +> rubric file: it tells you not just *whether* the rubric passed, but +> *which dimension* drove the result on each sample. + ## 12. Add ASSERT and Red Team to the release gate The eval gate proves quality. Two additional release-readiness signals