From 29ec3c4c4d115444e97c32e89ac0a21c031b480f Mon Sep 17 00:00:00 2001
From: Paulo Lacerda <pclacerda@gmail.com>
Date: Wed, 10 Jun 2026 12:32:55 -0300
Subject: [PATCH] docs: threshold the aggregate rubric metric, not the
 dimension ids

azd ai agent eval emits one aggregate pass-rate metric per evaluator (coherence, fluency, smoke-core), not one metric per rubric dimension. Step 11.3 previously instructed readers to set thresholds on the dimension ids (correct_itinerary, adherence_to_constraints, clear_practical_notes), which always fails with `threshold metric(s) not found in azd results`. Switch the example thresholds to the evaluator names azd actually emits (0..1 pass-rate scale) and add a callout explaining why dimension-level thresholds are not supported today.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 docs/tutorial-prompt-agent-quickstart.md | 26 +++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/docs/tutorial-prompt-agent-quickstart.md b/docs/tutorial-prompt-agent-quickstart.md
index 24d8e75..e23b868 100644
--- a/docs/tutorial-prompt-agent-quickstart.md
+++ b/docs/tutorial-prompt-agent-quickstart.md
@@ -970,11 +970,22 @@ rubrics:
         weight: 0.2
 
 thresholds:
-  correct_itinerary: ">=4"
-  adherence_to_constraints: ">=4"
-  clear_practical_notes: ">=4"
+  smoke-core: ">=0.6"
+  coherence: ">=0.6"
+  fluency: ">=0.6"
 ```
 
+> **Why threshold the evaluator, not the dimensions?** `azd ai agent
+> eval` emits one aggregate pass-rate metric per evaluator
+> (`coherence`, `fluency`, `smoke-core`), not one metric per rubric
+> dimension. The dimension `id`s live inside the local rubric file and
+> guide the judge's prompt, but azd does not surface them as separate
+> metrics today, so thresholds bind to the evaluator names azd actually
+> reports. The `rubrics:` block above is still recorded in
+> `results.json` and the release evidence pack as documentation of what
+> the judge was asked to score. Values are pass rates in `0..1` (e.g.
+> `">=0.6"` means at least 60% of rows passed the evaluator).
+
 **4. Regenerate the recipe and re-run the gate:**
 
 ```powershell
@@ -983,10 +994,11 @@ agentops eval run
 ```
 
 When this passes, the gate enforces both the conversation-context dataset
-and the Travel Agent rubric thresholds. If a dimension name is wrong,
-AgentOps cannot bind the threshold to an emitted metric — open
-`.agentops/results/latest/results.json` to see which rubric metric names
-azd actually produced.
+and the Travel Agent rubric pass-rate threshold. If a threshold key is
+wrong, AgentOps cannot bind it to an emitted metric — open
+`.agentops/results/latest/results.json` and look at
+`aggregate_metrics` to see exactly which evaluator names azd produced
+for this recipe.
 
 ## 12. Add ASSERT and Red Team to the release gate