From 29ec3c4c4d115444e97c32e89ac0a21c031b480f Mon Sep 17 00:00:00 2001 From: Paulo Lacerda Date: Wed, 10 Jun 2026 12:32:55 -0300 Subject: [PATCH] docs: threshold the aggregate rubric metric, not the dimension ids azd ai agent eval emits one aggregate pass-rate metric per evaluator (coherence, fluency, smoke-core), not one metric per rubric dimension. Step 11.3 previously instructed readers to set thresholds on the dimension ids (correct_itinerary, adherence_to_constraints, clear_practical_notes), which always fails with `threshold metric(s) not found in azd results`. Switch the example thresholds to the evaluator names azd actually emits (0..1 pass-rate scale) and add a callout explaining why dimension-level thresholds are not supported today. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- docs/tutorial-prompt-agent-quickstart.md | 26 +++++++++++++++++------- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/docs/tutorial-prompt-agent-quickstart.md b/docs/tutorial-prompt-agent-quickstart.md index 24d8e75..e23b868 100644 --- a/docs/tutorial-prompt-agent-quickstart.md +++ b/docs/tutorial-prompt-agent-quickstart.md @@ -970,11 +970,22 @@ rubrics: weight: 0.2 thresholds: - correct_itinerary: ">=4" - adherence_to_constraints: ">=4" - clear_practical_notes: ">=4" + smoke-core: ">=0.6" + coherence: ">=0.6" + fluency: ">=0.6" ``` +> **Why threshold the evaluator, not the dimensions?** `azd ai agent +> eval` emits one aggregate pass-rate metric per evaluator +> (`coherence`, `fluency`, `smoke-core`), not one metric per rubric +> dimension. The dimension `id`s live inside the local rubric file and +> guide the judge's prompt, but azd does not surface them as separate +> metrics today, so thresholds bind to the evaluator names azd actually +> reports. The `rubrics:` block above is still recorded in +> `results.json` and the release evidence pack as documentation of what +> the judge was asked to score. Values are pass rates in `0..1` (e.g. +> `">=0.6"` means at least 60% of rows passed the evaluator). + **4. Regenerate the recipe and re-run the gate:** ```powershell @@ -983,10 +994,11 @@ agentops eval run ``` When this passes, the gate enforces both the conversation-context dataset -and the Travel Agent rubric thresholds. If a dimension name is wrong, -AgentOps cannot bind the threshold to an emitted metric — open -`.agentops/results/latest/results.json` to see which rubric metric names -azd actually produced. +and the Travel Agent rubric pass-rate threshold. If a threshold key is +wrong, AgentOps cannot bind it to an emitted metric — open +`.agentops/results/latest/results.json` and look at +`aggregate_metrics` to see exactly which evaluator names azd produced +for this recipe. ## 12. Add ASSERT and Red Team to the release gate