Azure · placerda · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json
@@ -13,7 +13,7 @@
       "name": "agentops-accelerator",
       "source": "../../plugins/agentops",
       "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Toolkit and Microsoft Foundry agents.",
-      "version": "0.3.18",
+      "version": "0.3.19",
       "keywords": [
         "agentops",
         "evaluation",

diff --git a/.github/plugin/marketplace.json b/.github/plugin/marketplace.json
@@ -13,7 +13,7 @@
       "name": "agentops-accelerator",
       "source": "../../plugins/agentops",
       "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Toolkit and Microsoft Foundry agents.",
-      "version": "0.3.18",
+      "version": "0.3.19",
       "keywords": [
         "agentops",
         "evaluation",

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,29 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres
 
 ## [Unreleased]
 
+## [0.3.19] - 2026-06-10
+
+### Fixed
+- **`execution: azd` reports no longer ship empty `Dataset:` lines and empty
+  `## Rows` tables.** The `eval.yaml` parser now recognizes the `dataset_file:`
+  field that `azd ai agent eval init` emits, so `report.md` shows the actual
+  dataset path. When azd returns aggregate metrics only (the normal case), the
+  reporter omits the row tables entirely and instead emits a `## Per-row
+  breakdown` section that links to the Foundry run for the per-sample view.
+- **`agentops eval run` prints a clickable Foundry deep link on success.**
+  After a successful azd run, the CLI now emits a `Foundry run: <url>` line
+  alongside the `results.json`/`report.md` paths so users can jump straight to
+  the per-sample table and rubric drill-downs in the Foundry portal.
+
+### Changed
+- **Shorter azd backend log line.** Replaced the verbose `Running azd backend:
+  azd --no-prompt ai agent eval run --config <long path> --output json` line
+  with a concise `Running azd backend: azd ai agent eval run`; the full
+  command remains captured in the per-failure debug logs introduced in 0.3.18.
+- **`execution: azd` startup line uses a workspace-relative recipe path** so
+  the "delegating to azd ai agent eval" message stays readable on long
+  Windows paths.
+
 ## [0.3.18] - 2026-06-10
 
 ### Fixed

diff --git a/docs/tutorial-prompt-agent-quickstart.md b/docs/tutorial-prompt-agent-quickstart.md
@@ -795,15 +795,6 @@ eval_recipe: src/travel-agent/eval.yaml
 Use `--force` only when you intentionally want to regenerate an existing
 `eval.yaml`. For the normal flow, run it without `--force`.
 
-> **What is `smoke-core`?** In the generated `src/travel-agent/eval.yaml`,
-> azd may include an evaluator like `name: smoke-core` with
-> `local_uri: evaluators\smoke-core\rubric_dimensions.json`. That is the
-> local rubric evaluator generated for this quickstart's smoke gate. The
-> built-in evaluators (`builtin.coherence`, `builtin.fluency`) check
-> general response quality; `smoke-core` points at rubric dimensions
-> specific to this Travel Agent. When you add `rubrics:` to
-> `agentops.yaml` later, use the evaluator name that appears here.
-
 Run the gate locally:
 
 ```powershell
@@ -814,6 +805,33 @@ You should see `execution: azd` and `Threshold status: PASSED`. The raw
 azd run details are kept under `.agentops/results/latest/` alongside
 AgentOps' normalized `results.json` and `report.md`.
 
+### See the run in the Foundry portal
+
+`agentops eval run` only prints aggregate pass/fail to the terminal. The
+Foundry portal shows the full per-row, per-evaluator breakdown — useful
+for learning what the judge actually scored and why. Use this anchor
+section any time the tutorial tells you to run an eval.
+
+1. **Open the deep link** — easiest path. Look in
+   `.agentops/results/latest/azd_evaluation.json` for the `report_url`
+   field. That URL goes straight to the evaluation run in the New
+   Foundry experience.
+2. **Or navigate manually** in <https://ai.azure.com>:
+   1. Pick the `travel-agent-sandbox` project (top selector).
+   2. **Agents** → select **`travel-agent`**.
+   3. Open the **Evaluations** tab.
+   4. Click the most recent run (named after the evaluator, e.g.
+      `smoke-core`).
+3. **What to look at on the run page:**
+   - **Overall metric results** — the aggregate pass rate per evaluator
+     (matches the values AgentOps reports under `aggregate_metrics`).
+   - **Detailed metrics results** — one row per dataset sample with the
+     pass/fail for `coherence`, `fluency`, and the local rubric
+     (`smoke-core`).
+
+> **Tip:** keep this tab open as you iterate. Every new
+> `agentops eval run` creates a new evaluation run in the same list.
+
 ## 11. Harden the gate: conversation-aware dataset and rubric
 
 The smoke gate proves the workspace works. Before generating CI, harden
@@ -859,6 +877,14 @@ agentops eval run
 When it passes, `results.json` records `execution: azd`, the evaluator
 list, the multi-turn dataset kind, and the threshold results.
 
+> **See it in the Foundry portal.** Open the new evaluation run using
+> the deep link in `.agentops/results/latest/azd_evaluation.json`
+> (`report_url`) or the manual nav described in
+> [See the run in the Foundry portal](#see-the-run-in-the-foundry-portal).
+> The **Detailed metrics results** table now shows one row per
+> multi-turn sample, so you can compare how the agent handled the Rome
+> and Lisbon/Seattle scenarios independently.
+
 > **What did this gate test?** Individual synthetic conversation-context
 > turns, not the Foundry portal **Full conversations** preview. AgentOps
 > uses `messages` to preserve the conversation shape and
@@ -911,6 +937,15 @@ Fill in two kinds of real names: the rubric evaluator name and the rubric
 dimension names. Do not invent values — both must come from files
 `agentops eval init` already generated on disk.
 
+> **About the auto-generated evaluator.** When you ran `agentops eval
+> init`, azd seeded `src/travel-agent/eval.yaml` with two kinds of
+> evaluators: built-ins like `builtin.coherence` and `builtin.fluency`
+> (general response-quality checks) plus a local rubric evaluator —
+> typically `name: smoke-core` — whose `local_uri` points at a JSON file
+> with rubric dimensions specific to this Travel Agent. That local
+> evaluator is the hook AgentOps `rubrics:` bind to. You will reference
+> its `name:` and its dimension `id`s in the next two steps.
+
 **1. Find the evaluator name.** Open `src/travel-agent/eval.yaml` and
 look under `evaluators:` for the entry with a `local_uri`:
 
@@ -970,11 +1005,22 @@ rubrics:
         weight: 0.2
 
 thresholds:
-  correct_itinerary: ">=4"
-  adherence_to_constraints: ">=4"
-  clear_practical_notes: ">=4"
+  smoke-core: ">=0.6"
+  coherence: ">=0.6"
+  fluency: ">=0.6"
 ```
 
+> **Why threshold the evaluator, not the dimensions?** `azd ai agent
+> eval` emits one aggregate pass-rate metric per evaluator
+> (`coherence`, `fluency`, `smoke-core`), not one metric per rubric
+> dimension. The dimension `id`s live inside the local rubric file and
+> guide the judge's prompt, but azd does not surface them as separate
+> metrics today, so thresholds bind to the evaluator names azd actually
+> reports. The `rubrics:` block above is still recorded in
+> `results.json` and the release evidence pack as documentation of what
+> the judge was asked to score. Values are pass rates in `0..1` (e.g.
+> `">=0.6"` means at least 60% of rows passed the evaluator).
+
 **4. Regenerate the recipe and re-run the gate:**
 
 ```powershell
@@ -983,10 +1029,29 @@ agentops eval run
 ```
 
 When this passes, the gate enforces both the conversation-context dataset
-and the Travel Agent rubric thresholds. If a dimension name is wrong,
-AgentOps cannot bind the threshold to an emitted metric — open
-`.agentops/results/latest/results.json` to see which rubric metric names
-azd actually produced.
+and the Travel Agent rubric pass-rate threshold. If a threshold key is
+wrong, AgentOps cannot bind it to an emitted metric — open
+`.agentops/results/latest/results.json` and look at
+`aggregate_metrics` to see exactly which evaluator names azd produced
+for this recipe.
+
+> **See the per-dimension rubric scores in the Foundry portal.** The
+> CLI threshold lives on the `smoke-core` aggregate, but Foundry still
+> records every dimension the judge scored. Open the run as in
+> [See the run in the Foundry portal](#see-the-run-in-the-foundry-portal),
+> scroll to **Detailed metrics results**, find the `smoke-core` column,
+> and click **View rubric details** on any row. The modal shows:
+>
+> - The aggregated rubric score (e.g. `0.92 / 1.0`).
+> - The judge's free-text explanation of the overall result.
+> - One row per dimension (`correct_itinerary`, `clear_practical_notes`,
+>   `user_satisfaction`, `adherence_to_constraints`,
+>   `itinerary_clarity`, `general_quality`) with the individual score
+>   (1–5), pass/fail badge, and the judge's reason for that dimension.
+>
+> This is the most useful drill-down when you are iterating on the
+> rubric file: it tells you not just *whether* the rubric passed, but
+> *which dimension* drove the result on each sample.
 
 ## 12. Add ASSERT and Red Team to the release gate
 

diff --git a/plugins/agentops/package.json b/plugins/agentops/package.json
@@ -2,7 +2,7 @@
   "name": "agentops-accelerator",
   "displayName": "AgentOps Accelerator — Skills for GitHub Copilot",
   "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Accelerator and Microsoft Foundry agents.",
-  "version": "0.3.18",
+  "version": "0.3.19",
   "publisher": "AgentOpsAccelerator",
   "icon": "icon.png",
   "license": "MIT",

diff --git a/plugins/agentops/plugin.json b/plugins/agentops/plugin.json
@@ -1,7 +1,7 @@
 {
   "name": "agentops-accelerator",
   "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Accelerator and Microsoft Foundry agents.",
-  "version": "0.3.18",
+  "version": "0.3.19",
   "author": {
     "name": "AgentOps Accelerator",
     "url": "https://github.com/Azure/agentops"

diff --git a/src/agentops/cli/app.py b/src/agentops/cli/app.py
@@ -2766,6 +2766,10 @@ def _run_flat_schema_eval(
     typer.echo(f"{_cli_label('report.md')}:    {_cli_path(output_dir / 'report.md')}")
     if latest_dir is not None:
         typer.echo(f"{_cli_label('latest/')}:      {_cli_path(latest_dir)}")
+    azd_eval = result.config.get("azd_evaluation") if isinstance(result.config, dict) else None
+    report_url = azd_eval.get("report_url") if isinstance(azd_eval, dict) else None
+    if isinstance(report_url, str) and report_url.strip():
+        typer.echo(f"{_cli_label('Foundry run')}:  {report_url.strip()}")
     if result.summary.overall_passed:
         typer.echo(f"{_cli_label('Threshold status')}: {style('PASSED', 'bold', 'green')}")
         return

diff --git a/src/agentops/core/azd_eval.py b/src/agentops/core/azd_eval.py
@@ -139,6 +139,7 @@ class EvalRecipe(BaseModel):
     name: Optional[str] = None
     agent: Optional[EvalAgent] = None
     dataset_reference: Optional[EvalDatasetReference] = None
+    dataset_file: Optional[str] = None
     evaluators: list[EvalEvaluator] = Field(default_factory=list)
     options: Optional[EvalOptions] = None
 

diff --git a/src/agentops/pipeline/azd_runner.py b/src/agentops/pipeline/azd_runner.py
@@ -118,7 +118,7 @@ def run_azd_eval(
         "--output",
         "json",
     ]
-    notify(f"Running azd backend: {' '.join(command)}")
+    notify("Running azd backend: azd ai agent eval run")
 
     started = time.perf_counter()
     completed = _run_command(
@@ -301,7 +301,9 @@ def normalize_to_results(
             "azd_evaluation": {
                 "recipe_path": str(azd_run.recipe_path),
                 "run_id": azd_run.run_id,
+                "eval_id": _extract_eval_id(azd_run.payload),
                 "status": azd_run.status,
+                "report_url": _extract_report_url(azd_run.payload),
                 "dataset": (
                     recipe.dataset_reference.model_dump(mode="json")
                     if recipe.dataset_reference
@@ -477,6 +479,14 @@ def _extract_status(payload: Dict[str, Any]) -> str:
     return "unknown"
 
 
+def _extract_report_url(payload: Dict[str, Any]) -> Optional[str]:
+    for key in ("report_url", "reportUrl", "report_uri", "url"):
+        value = payload.get(key)
+        if isinstance(value, str) and value.strip().lower().startswith(("http://", "https://")):
+            return value.strip()
+    return None
+
+
 def _extract_item_count(payload: Dict[str, Any]) -> int:
     for key in ("items_total", "item_count", "samples", "max_samples", "row_count"):
         value = payload.get(key)
@@ -585,6 +595,11 @@ def _looks_like_metric_name(name: str) -> bool:
 
 
 def _recipe_dataset_path(recipe: EvalRecipe, recipe_path: Path) -> str:
+    if recipe.dataset_file:
+        dataset = Path(recipe.dataset_file)
+        if not dataset.is_absolute():
+            dataset = recipe_path.parent / dataset
+        return str(dataset)
     ref = recipe.dataset_reference
     if ref and ref.local_uri:
         dataset = Path(ref.local_uri)

diff --git a/src/agentops/pipeline/orchestrator.py b/src/agentops/pipeline/orchestrator.py
@@ -537,10 +537,14 @@ def _run_evaluation_azd(
 
     recipe_path = azd_runner.resolve_eval_recipe(workspace, config)
     recipe = load_eval_recipe(recipe_path)
+    try:
+        recipe_display = recipe_path.relative_to(workspace).as_posix()
+    except ValueError:
+        recipe_display = recipe_path.name
     progress(
         f"execution: {style('azd', 'bold')} - delegating to "
-        f"{style('azd ai agent eval', 'cyan')} with recipe "
-        f"{style(str(recipe_path), 'cyan')}."
+        f"{style('azd ai agent eval', 'cyan')} (recipe "
+        f"{style(recipe_display, 'cyan')})."
     )
 
     azd_run = azd_runner.run_azd_eval(

diff --git a/src/agentops/pipeline/reporter.py b/src/agentops/pipeline/reporter.py
@@ -24,7 +24,8 @@ def render(result: RunResult) -> str:
     lines.append(f"- **Target:** `{result.target.raw}` ({result.target.kind})")
     if result.target.protocol:
         lines.append(f"- **Protocol:** {result.target.protocol}")
-    lines.append(f"- **Dataset:** `{result.dataset_path}`")
+    if result.dataset_path:
+        lines.append(f"- **Dataset:** `{result.dataset_path}`")
     lines.append(f"- **Started:** {result.started_at}")
     lines.append(f"- **Duration:** {result.duration_seconds:.2f}s")
     lines.append(f"- **Rows:** {result.summary.items_total}")
@@ -62,22 +63,27 @@ def render(result: RunResult) -> str:
             lines.append(f"| {row.row_index} | {_short(row.error or '', 200)} |")
         lines.append("")
 
-    lines.append("## Rows")
-    lines.append("")
-    lines.append("| # | Latency (s) | Metrics |")
-    lines.append("| --- | --- | --- |")
-    for row in result.rows:
-        lines.append(_row_summary(row))
-    lines.append("")
-
     if result.rows:
+        lines.append("## Rows")
+        lines.append("")
+        lines.append("| # | Latency (s) | Metrics |")
+        lines.append("| --- | --- | --- |")
+        for row in result.rows:
+            lines.append(_row_summary(row))
+        lines.append("")
+
         lines.append("## Row Details")
         lines.append("")
         lines.append("| # | Input | Response | Expected |")
         lines.append("| --- | --- | --- | --- |")
         for row in result.rows:
             lines.append(_row_detail(row))
         lines.append("")
+    else:
+        azd_eval = result.config.get("azd_evaluation")
+        if isinstance(azd_eval, dict):
+            lines.extend(_render_azd_aggregate_note(azd_eval))
+            lines.append("")
 
     cloud = result.config.get("cloud_evaluation")
     if isinstance(cloud, dict):
@@ -120,6 +126,26 @@ def _short(text: str, limit: int) -> str:
     return text if len(text) <= limit else text[: limit - 1] + "…"
 
 
+def _render_azd_aggregate_note(azd: dict) -> List[str]:
+    lines = ["## Per-row breakdown", ""]
+    lines.append(
+        "`execution: azd` reports aggregate metrics only; per-row scores "
+        "are recorded by Foundry."
+    )
+    report_url = azd.get("report_url")
+    if isinstance(report_url, str) and report_url.strip():
+        lines.append("")
+        lines.append(f"**Open the run in Foundry:** {report_url.strip()}")
+    else:
+        lines.append("")
+        lines.append(
+            "Open the latest run in the Foundry portal "
+            "(Agents → your agent → Evaluations) to see the per-sample table "
+            "and rubric drill-downs."
+        )
+    return lines
+
+
 def _render_cloud_evaluation(cloud: dict) -> List[str]:
     lines = ["## Foundry Cloud Session", ""]
     status = str(cloud.get("status") or "unknown")