Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"name": "agentops-accelerator",
"source": "../../plugins/agentops",
"description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Toolkit and Microsoft Foundry agents.",
"version": "0.3.18",
"version": "0.3.19",
"keywords": [
"agentops",
"evaluation",
Expand Down
2 changes: 1 addition & 1 deletion .github/plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"name": "agentops-accelerator",
"source": "../../plugins/agentops",
"description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Toolkit and Microsoft Foundry agents.",
"version": "0.3.18",
"version": "0.3.19",
"keywords": [
"agentops",
"evaluation",
Expand Down
23 changes: 23 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,29 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres

## [Unreleased]

## [0.3.19] - 2026-06-10

### Fixed
- **`execution: azd` reports no longer ship empty `Dataset:` lines and empty
`## Rows` tables.** The `eval.yaml` parser now recognizes the `dataset_file:`
field that `azd ai agent eval init` emits, so `report.md` shows the actual
dataset path. When azd returns aggregate metrics only (the normal case), the
reporter omits the row tables entirely and instead emits a `## Per-row
breakdown` section that links to the Foundry run for the per-sample view.
- **`agentops eval run` prints a clickable Foundry deep link on success.**
After a successful azd run, the CLI now emits a `Foundry run: <url>` line
alongside the `results.json`/`report.md` paths so users can jump straight to
the per-sample table and rubric drill-downs in the Foundry portal.

### Changed
- **Shorter azd backend log line.** Replaced the verbose `Running azd backend:
azd --no-prompt ai agent eval run --config <long path> --output json` line
with a concise `Running azd backend: azd ai agent eval run`; the full
command remains captured in the per-failure debug logs introduced in 0.3.18.
- **`execution: azd` startup line uses a workspace-relative recipe path** so
the "delegating to azd ai agent eval" message stays readable on long
Windows paths.

## [0.3.18] - 2026-06-10

### Fixed
Expand Down
97 changes: 81 additions & 16 deletions docs/tutorial-prompt-agent-quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -795,15 +795,6 @@ eval_recipe: src/travel-agent/eval.yaml
Use `--force` only when you intentionally want to regenerate an existing
`eval.yaml`. For the normal flow, run it without `--force`.

> **What is `smoke-core`?** In the generated `src/travel-agent/eval.yaml`,
> azd may include an evaluator like `name: smoke-core` with
> `local_uri: evaluators\smoke-core\rubric_dimensions.json`. That is the
> local rubric evaluator generated for this quickstart's smoke gate. The
> built-in evaluators (`builtin.coherence`, `builtin.fluency`) check
> general response quality; `smoke-core` points at rubric dimensions
> specific to this Travel Agent. When you add `rubrics:` to
> `agentops.yaml` later, use the evaluator name that appears here.

Run the gate locally:

```powershell
Expand All @@ -814,6 +805,33 @@ You should see `execution: azd` and `Threshold status: PASSED`. The raw
azd run details are kept under `.agentops/results/latest/` alongside
AgentOps' normalized `results.json` and `report.md`.

### See the run in the Foundry portal

`agentops eval run` only prints aggregate pass/fail to the terminal. The
Foundry portal shows the full per-row, per-evaluator breakdown — useful
for learning what the judge actually scored and why. Use this anchor
section any time the tutorial tells you to run an eval.

1. **Open the deep link** — easiest path. Look in
`.agentops/results/latest/azd_evaluation.json` for the `report_url`
field. That URL goes straight to the evaluation run in the New
Foundry experience.
2. **Or navigate manually** in <https://ai.azure.com>:
1. Pick the `travel-agent-sandbox` project (top selector).
2. **Agents** → select **`travel-agent`**.
3. Open the **Evaluations** tab.
4. Click the most recent run (named after the evaluator, e.g.
`smoke-core`).
3. **What to look at on the run page:**
- **Overall metric results** — the aggregate pass rate per evaluator
(matches the values AgentOps reports under `aggregate_metrics`).
- **Detailed metrics results** — one row per dataset sample with the
pass/fail for `coherence`, `fluency`, and the local rubric
(`smoke-core`).

> **Tip:** keep this tab open as you iterate. Every new
> `agentops eval run` creates a new evaluation run in the same list.

## 11. Harden the gate: conversation-aware dataset and rubric

The smoke gate proves the workspace works. Before generating CI, harden
Expand Down Expand Up @@ -859,6 +877,14 @@ agentops eval run
When it passes, `results.json` records `execution: azd`, the evaluator
list, the multi-turn dataset kind, and the threshold results.

> **See it in the Foundry portal.** Open the new evaluation run using
> the deep link in `.agentops/results/latest/azd_evaluation.json`
> (`report_url`) or the manual nav described in
> [See the run in the Foundry portal](#see-the-run-in-the-foundry-portal).
> The **Detailed metrics results** table now shows one row per
> multi-turn sample, so you can compare how the agent handled the Rome
> and Lisbon/Seattle scenarios independently.

> **What did this gate test?** Individual synthetic conversation-context
> turns, not the Foundry portal **Full conversations** preview. AgentOps
> uses `messages` to preserve the conversation shape and
Expand Down Expand Up @@ -911,6 +937,15 @@ Fill in two kinds of real names: the rubric evaluator name and the rubric
dimension names. Do not invent values — both must come from files
`agentops eval init` already generated on disk.

> **About the auto-generated evaluator.** When you ran `agentops eval
> init`, azd seeded `src/travel-agent/eval.yaml` with two kinds of
> evaluators: built-ins like `builtin.coherence` and `builtin.fluency`
> (general response-quality checks) plus a local rubric evaluator —
> typically `name: smoke-core` — whose `local_uri` points at a JSON file
> with rubric dimensions specific to this Travel Agent. That local
> evaluator is the hook AgentOps `rubrics:` bind to. You will reference
> its `name:` and its dimension `id`s in the next two steps.

**1. Find the evaluator name.** Open `src/travel-agent/eval.yaml` and
look under `evaluators:` for the entry with a `local_uri`:

Expand Down Expand Up @@ -970,11 +1005,22 @@ rubrics:
weight: 0.2

thresholds:
correct_itinerary: ">=4"
adherence_to_constraints: ">=4"
clear_practical_notes: ">=4"
smoke-core: ">=0.6"
coherence: ">=0.6"
fluency: ">=0.6"
```

> **Why threshold the evaluator, not the dimensions?** `azd ai agent
> eval` emits one aggregate pass-rate metric per evaluator
> (`coherence`, `fluency`, `smoke-core`), not one metric per rubric
> dimension. The dimension `id`s live inside the local rubric file and
> guide the judge's prompt, but azd does not surface them as separate
> metrics today, so thresholds bind to the evaluator names azd actually
> reports. The `rubrics:` block above is still recorded in
> `results.json` and the release evidence pack as documentation of what
> the judge was asked to score. Values are pass rates in `0..1` (e.g.
> `">=0.6"` means at least 60% of rows passed the evaluator).

**4. Regenerate the recipe and re-run the gate:**

```powershell
Expand All @@ -983,10 +1029,29 @@ agentops eval run
```

When this passes, the gate enforces both the conversation-context dataset
and the Travel Agent rubric thresholds. If a dimension name is wrong,
AgentOps cannot bind the threshold to an emitted metric — open
`.agentops/results/latest/results.json` to see which rubric metric names
azd actually produced.
and the Travel Agent rubric pass-rate threshold. If a threshold key is
wrong, AgentOps cannot bind it to an emitted metric — open
`.agentops/results/latest/results.json` and look at
`aggregate_metrics` to see exactly which evaluator names azd produced
for this recipe.

> **See the per-dimension rubric scores in the Foundry portal.** The
> CLI threshold lives on the `smoke-core` aggregate, but Foundry still
> records every dimension the judge scored. Open the run as in
> [See the run in the Foundry portal](#see-the-run-in-the-foundry-portal),
> scroll to **Detailed metrics results**, find the `smoke-core` column,
> and click **View rubric details** on any row. The modal shows:
>
> - The aggregated rubric score (e.g. `0.92 / 1.0`).
> - The judge's free-text explanation of the overall result.
> - One row per dimension (`correct_itinerary`, `clear_practical_notes`,
> `user_satisfaction`, `adherence_to_constraints`,
> `itinerary_clarity`, `general_quality`) with the individual score
> (1–5), pass/fail badge, and the judge's reason for that dimension.
>
> This is the most useful drill-down when you are iterating on the
> rubric file: it tells you not just *whether* the rubric passed, but
> *which dimension* drove the result on each sample.

## 12. Add ASSERT and Red Team to the release gate

Expand Down
2 changes: 1 addition & 1 deletion plugins/agentops/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"name": "agentops-accelerator",
"displayName": "AgentOps Accelerator — Skills for GitHub Copilot",
"description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Accelerator and Microsoft Foundry agents.",
"version": "0.3.18",
"version": "0.3.19",
"publisher": "AgentOpsAccelerator",
"icon": "icon.png",
"license": "MIT",
Expand Down
2 changes: 1 addition & 1 deletion plugins/agentops/plugin.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "agentops-accelerator",
"description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Accelerator and Microsoft Foundry agents.",
"version": "0.3.18",
"version": "0.3.19",
"author": {
"name": "AgentOps Accelerator",
"url": "https://github.com/Azure/agentops"
Expand Down
4 changes: 4 additions & 0 deletions src/agentops/cli/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -2766,6 +2766,10 @@ def _run_flat_schema_eval(
typer.echo(f"{_cli_label('report.md')}: {_cli_path(output_dir / 'report.md')}")
if latest_dir is not None:
typer.echo(f"{_cli_label('latest/')}: {_cli_path(latest_dir)}")
azd_eval = result.config.get("azd_evaluation") if isinstance(result.config, dict) else None
report_url = azd_eval.get("report_url") if isinstance(azd_eval, dict) else None
if isinstance(report_url, str) and report_url.strip():
typer.echo(f"{_cli_label('Foundry run')}: {report_url.strip()}")
if result.summary.overall_passed:
typer.echo(f"{_cli_label('Threshold status')}: {style('PASSED', 'bold', 'green')}")
return
Expand Down
1 change: 1 addition & 0 deletions src/agentops/core/azd_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ class EvalRecipe(BaseModel):
name: Optional[str] = None
agent: Optional[EvalAgent] = None
dataset_reference: Optional[EvalDatasetReference] = None
dataset_file: Optional[str] = None
evaluators: list[EvalEvaluator] = Field(default_factory=list)
options: Optional[EvalOptions] = None

Expand Down
17 changes: 16 additions & 1 deletion src/agentops/pipeline/azd_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ def run_azd_eval(
"--output",
"json",
]
notify(f"Running azd backend: {' '.join(command)}")
notify("Running azd backend: azd ai agent eval run")

started = time.perf_counter()
completed = _run_command(
Expand Down Expand Up @@ -301,7 +301,9 @@ def normalize_to_results(
"azd_evaluation": {
"recipe_path": str(azd_run.recipe_path),
"run_id": azd_run.run_id,
"eval_id": _extract_eval_id(azd_run.payload),
"status": azd_run.status,
"report_url": _extract_report_url(azd_run.payload),
"dataset": (
recipe.dataset_reference.model_dump(mode="json")
if recipe.dataset_reference
Expand Down Expand Up @@ -477,6 +479,14 @@ def _extract_status(payload: Dict[str, Any]) -> str:
return "unknown"


def _extract_report_url(payload: Dict[str, Any]) -> Optional[str]:
for key in ("report_url", "reportUrl", "report_uri", "url"):
value = payload.get(key)
if isinstance(value, str) and value.strip().lower().startswith(("http://", "https://")):
return value.strip()
return None


def _extract_item_count(payload: Dict[str, Any]) -> int:
for key in ("items_total", "item_count", "samples", "max_samples", "row_count"):
value = payload.get(key)
Expand Down Expand Up @@ -585,6 +595,11 @@ def _looks_like_metric_name(name: str) -> bool:


def _recipe_dataset_path(recipe: EvalRecipe, recipe_path: Path) -> str:
if recipe.dataset_file:
dataset = Path(recipe.dataset_file)
if not dataset.is_absolute():
dataset = recipe_path.parent / dataset
return str(dataset)
ref = recipe.dataset_reference
if ref and ref.local_uri:
dataset = Path(ref.local_uri)
Expand Down
8 changes: 6 additions & 2 deletions src/agentops/pipeline/orchestrator.py
Original file line number Diff line number Diff line change
Expand Up @@ -537,10 +537,14 @@ def _run_evaluation_azd(

recipe_path = azd_runner.resolve_eval_recipe(workspace, config)
recipe = load_eval_recipe(recipe_path)
try:
recipe_display = recipe_path.relative_to(workspace).as_posix()
except ValueError:
recipe_display = recipe_path.name
progress(
f"execution: {style('azd', 'bold')} - delegating to "
f"{style('azd ai agent eval', 'cyan')} with recipe "
f"{style(str(recipe_path), 'cyan')}."
f"{style('azd ai agent eval', 'cyan')} (recipe "
f"{style(recipe_display, 'cyan')})."
)

azd_run = azd_runner.run_azd_eval(
Expand Down
44 changes: 35 additions & 9 deletions src/agentops/pipeline/reporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ def render(result: RunResult) -> str:
lines.append(f"- **Target:** `{result.target.raw}` ({result.target.kind})")
if result.target.protocol:
lines.append(f"- **Protocol:** {result.target.protocol}")
lines.append(f"- **Dataset:** `{result.dataset_path}`")
if result.dataset_path:
lines.append(f"- **Dataset:** `{result.dataset_path}`")
lines.append(f"- **Started:** {result.started_at}")
lines.append(f"- **Duration:** {result.duration_seconds:.2f}s")
lines.append(f"- **Rows:** {result.summary.items_total}")
Expand Down Expand Up @@ -62,22 +63,27 @@ def render(result: RunResult) -> str:
lines.append(f"| {row.row_index} | {_short(row.error or '', 200)} |")
lines.append("")

lines.append("## Rows")
lines.append("")
lines.append("| # | Latency (s) | Metrics |")
lines.append("| --- | --- | --- |")
for row in result.rows:
lines.append(_row_summary(row))
lines.append("")

if result.rows:
lines.append("## Rows")
lines.append("")
lines.append("| # | Latency (s) | Metrics |")
lines.append("| --- | --- | --- |")
for row in result.rows:
lines.append(_row_summary(row))
lines.append("")

lines.append("## Row Details")
lines.append("")
lines.append("| # | Input | Response | Expected |")
lines.append("| --- | --- | --- | --- |")
for row in result.rows:
lines.append(_row_detail(row))
lines.append("")
else:
azd_eval = result.config.get("azd_evaluation")
if isinstance(azd_eval, dict):
lines.extend(_render_azd_aggregate_note(azd_eval))
lines.append("")

cloud = result.config.get("cloud_evaluation")
if isinstance(cloud, dict):
Expand Down Expand Up @@ -120,6 +126,26 @@ def _short(text: str, limit: int) -> str:
return text if len(text) <= limit else text[: limit - 1] + "…"


def _render_azd_aggregate_note(azd: dict) -> List[str]:
lines = ["## Per-row breakdown", ""]
lines.append(
"`execution: azd` reports aggregate metrics only; per-row scores "
"are recorded by Foundry."
)
report_url = azd.get("report_url")
if isinstance(report_url, str) and report_url.strip():
lines.append("")
lines.append(f"**Open the run in Foundry:** {report_url.strip()}")
else:
lines.append("")
lines.append(
"Open the latest run in the Foundry portal "
"(Agents → your agent → Evaluations) to see the per-sample table "
"and rubric drill-downs."
)
return lines


def _render_cloud_evaluation(cloud: dict) -> List[str]:
lines = ["## Foundry Cloud Session", ""]
status = str(cloud.get("status") or "unknown")
Expand Down
Loading