agenticenv · vinodvx · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
@@ -46,6 +46,38 @@ jobs:
       - name: Build
         run: make build
 
+  eval-harness:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Set up Go
+        uses: actions/setup-go@v5
+        with:
+          go-version-file: go.mod
+
+      - name: Set up Node.js
+        uses: actions/setup-node@v4
+        with:
+          node-version: "20"
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: PromptFoo
+        working-directory: eval-harness/promptfoo
+        run: npx --yes promptfoo@latest eval -c config.yaml
+
+      - name: DeepEval
+        working-directory: eval-harness/deepeval
+        run: |
+          python3 -m venv .venv
+          .venv/bin/pip install -r requirements.txt
+          .venv/bin/pytest test_agent.py -v
+
   # examples:local — off until if: true and EXAMPLES_* repo secrets (examples/.env.defaults).
   examples:
     if: false

@@ -25,3 +25,9 @@ pnpm-debug.log*
 .env*.local
 
 .DS_Store
+
+# Python / DeepEval (eval-harness/deepeval)
+eval-harness/deepeval/.venv/
+eval-harness/deepeval/.pytest_cache/
+eval-harness/deepeval/.deepeval/
+eval-harness/deepeval/__pycache__/
@@ -62,7 +62,7 @@ Keep your branch short and descriptive. Sync with `main` before opening a PR: `g
 make check
 ```
 
-Runs `fmt-check`, spell check, `make lint`, `make test`, `make build`, and `make secrets-scan` — same core gates as CI (coverage is CI-only; use `make test-coverage` locally if you want a report).
+Runs `fmt-check`, spell check, `make lint`, `make test`, `make build`, and `make secrets-scan` — same core gates as the main CI job (coverage is CI-only; use `make test-coverage` locally if you want a report). `make test` includes eval-harness Go tests; the full Promptfoo/DeepEval suite runs in CI and via `make eval-harness` (see below).
 
 Also run the full example suite on any code change to catch regressions unit tests may miss:
 
@@ -72,6 +72,14 @@ task examples:all
 
 Requires Task, Docker, and LLM credentials — see [examples/README.md](examples/README.md).
 
+If you change **agent behavior** (e.g. `pkg/agent`, telemetry, tools, runtime) or **`eval-harness/`**, run:
+
+```bash
+make eval-harness
+```
+
+Behavioral regression tests use mock LLM/tools and assert on run output — SDK changes can break them even when eval-harness files are untouched. Requires Node.js and Python 3.10+ — see [eval-harness/README.md](eval-harness/README.md). CI runs this automatically on PRs (`eval-harness` job).
+
 **CI runs automatically** on pull requests to `main` (open a PR or push updates to an existing PR to re-run checks). Pushes or merges to `main` do not trigger CI; use **workflow_dispatch** in GitHub Actions for an on-demand run. Run `make check` locally before opening a PR; CI must pass on the PR before merge.
 
 To run only tests (e.g. while iterating):
@@ -172,6 +180,7 @@ Using the SDK and ran into issues, unclear docs, or confusing behavior? **Raise
 2. **Tests**
    - Add tests for new features and bug fixes.
    - Unit tests go in `*_test.go` files alongside the code.
+   - Agent behavior changes (`pkg/agent`, telemetry, tools, runtime) or **`eval-harness/`** edits — run `make eval-harness` before submitting a PR.
 
 3. **Commits**
    - Use [conventional commits](https://www.conventionalcommits.org) — these drive the release changelog:

@@ -1,4 +1,4 @@
-.PHONY: build install test lint tidy clean fmt fmt-check spell secrets-scan check
+.PHONY: build install test lint tidy clean fmt fmt-check spell secrets-scan check eval-harness
 
 BIN_DIR := cmd/bin
 BINARY := $(BIN_DIR)/agentctl
@@ -26,13 +26,24 @@ install: build
 	cp $(BINARY) $(GOPATH_BIN)/agentctl
 	@echo "Installed to $(GOPATH_BIN)/agentctl"
 
-# Run tests under pkg
+# Run Go tests (pkg, internal, eval-harness runner)
 test:
 	@echo "==> Running tests..."
 	go test ./pkg/... -count=1
 	go test ./internal/... -count=1
+	go test ./eval-harness/... -count=1
 	@echo "==> Tests complete"
 
+# Promptfoo + DeepEval (same as CI eval-harness job). Requires Node.js and Python 3.10+.
+eval-harness:
+	@echo "==> Running eval-harness (Promptfoo + DeepEval)..."
+	cd eval-harness/promptfoo && npx --yes promptfoo@latest eval -c config.yaml
+	cd eval-harness/deepeval && \
+		(test -d .venv || python3 -m venv .venv) && \
+		.venv/bin/pip install -q -r requirements.txt && \
+		.venv/bin/pytest test_agent.py -v
+	@echo "==> Eval-harness complete"
+
 # Run before push: lint, test, build, and secrets scan (same core gates as CI; no auto-format).
 # Coverage is CI-only (`make test-coverage` when you want the report). If fmt-check fails, run `make fmt`.
 check: lint test build secrets-scan

@@ -43,6 +43,7 @@
   - [Agent and worker in separate processes](#agent-and-worker-in-separate-processes)
   - [Conversation](#conversation-message-history)
   - [AG-UI Protocol](#ag-ui-protocol)
+- [Telemetry](#telemetry)
 - [Observability](#observability)
   - [Wire OTLP](#wire-otlp-traces--metrics--logs-in-one-block)
   - [Bring your own tracer / metrics](#bring-your-own-tracer--metrics)
@@ -54,6 +55,7 @@
   - [Code Coverage](#code-coverage)
 - [Setup and run examples](#setup-and-run-examples)
 - [Benchmarks](#benchmarks)
+- [Eval Harness](#eval-harness)
 - [Production Readiness Checklist](#production-readiness-checklist)
 - [Disclaimer](#disclaimer)
 
@@ -328,8 +330,8 @@ Streaming text deltas (`TEXT_MESSAGE_*`) versus the `**RUN_FINISHED**` body ofte
 
 Each LLM completion can report token counts via `[interfaces.LLMUsage](pkg/interfaces/llm.go)` on `[interfaces.LLMResponse.Usage](pkg/interfaces/llm.go)`. OpenAI, Anthropic, and Gemini clients populate `**PromptTokens**`, `**CompletionTokens**`, `**TotalTokens**`, and optional `**CachedPromptTokens**` / `**ReasoningTokens**` when the provider returns them.
 
-- `**Agent.Run` / `RunAsync`:** `**Usage`** on [*AgentRunResult](pkg/agent/agent.go) is the **sum** across all LLM calls in that run (including tool rounds). Use it for cost estimates, quotas, and logging.
-- `**Stream`:** the same aggregate appears as `**Usage*`* on `**RUN_FINISHED**`: assert `**[*AgentRunFinishedEvent](pkg/agent/agent.go)**`, then `**Result**` as `**[*AgentRunResult](pkg/agent/agent.go)**`. OpenAI streaming `**include_usage**` surfaces totals there. Helpers: [examples/shared/utils.go](examples/shared/utils.go) (`UsageFooter`, `RunResultFromFinishedEvent`).
+- `**Agent.Run` / `RunAsync`:** `**LLMUsage**` on [*AgentRunResult](pkg/agent/agent.go) is the **sum** across all LLM calls in that run (including tool rounds). Use it for cost estimates, quotas, and logging.
+- `**Stream`:** the same aggregate is on `**LLMUsage**` in `**RUN_FINISHED**` `**Result**` (`**[*AgentRunFinishedEvent](pkg/agent/agent.go)**`; `**Result**` is `**[*AgentRunResult](pkg/agent/agent.go)**`). OpenAI streaming `**include_usage**` surfaces totals there. Helpers: [examples/shared/utils.go](examples/shared/utils.go) (`LLMUsageFooter`, `RunResultFromFinishedEvent`).
 
 Examples: [examples/simple_agent](examples/simple_agent) (prints usage after `Run`), [examples/agent_with_stream](examples/agent_with_stream) (prints usage on `**RUN_FINISHED**`).
 
@@ -348,7 +350,6 @@ Custom tools may also implement:
 
 - `interfaces.ToolApproval` — tool-level hint for **interactive human approval**. Use this when a person should decide whether the tool runs, and no agent-level approval policy is set.
 - `interfaces.ToolAuthorizer` — tool-level **programmatic authorization**. Use this when code should decide whether the tool runs before approval/execute (for example: scopes, tenancy, environment flags, or feature access). Return `Allow=false` to deny the tool call without executing it.
-- `interfaces.ToolKindProvider` — optional interface that reports the tool's origin category. The built-in tool wrappers already implement it (`"mcp"`, `"a2a"`, `"sub-agent"`, `"retriever"`). Implement it on custom tools when you want to distinguish origin in logs or metrics. Use `interfaces.KindOf(tool)` to read the kind from any tool; returns `"native"` when the interface is not implemented.
 
 ```go
 reg := agent.NewToolRegistry()
@@ -1116,6 +1117,41 @@ for ev := range ch {
 
 ---
 
+## Telemetry
+
+Every run populates `AgentTelemetry` inside `AgentRunResult` with behavioral metrics across three areas:
+
+> Telemetry fields are designed to support eval harness assertions — see [eval-harness/](eval-harness/) for examples with PromptFoo and DeepEval.
+
+- **Run** — start/end time, total LLM calls, and finish reason (`complete` or `max_iterations`)
+- **Tools** — total calls, failed calls, and per-tool breakdown for registered tools and MCP tools
+- **Storage** — RAG retriever search counts split by mode (`prefetch_searches`, `agentic_searches`) and failure count; all fields are zero when no retriever is configured
+
+```go
+result, _ := ag.Run(ctx, "prompt")
+t := result.Telemetry
+fmt.Printf("llm_calls=%d  finish=%s\n", t.Run.TotalLLMCalls, t.Run.FinishReason)
+fmt.Printf("tool_calls=%d  failed=%d\n", t.Tools.TotalCalls, t.Tools.FailedCalls)
+fmt.Printf("retriever_searches=%d  prefetch=%d  agentic=%d\n",
+    t.Storage.TotalRetrieverSearches,
+    t.Storage.PrefetchSearches,
+    t.Storage.AgenticSearches)
+```
+
+**Stream** — telemetry is on `Result.Telemetry` inside the `RUN_FINISHED` event:
+
+```go
+for ev := range ch {
+    if result := shared.RunResultFromFinishedEvent(ev); result != nil {
+        fmt.Println(result.Telemetry.Run.TotalLLMCalls)
+    }
+}
+```
+
+Examples can print a formatted telemetry footer — see [examples/README.md](examples/README.md#run-output).
+
+---
+
 ## Observability
 
 The SDK emits **traces**, **metrics**, and **logs** via OpenTelemetry. All signals are **no-op by default** — if you set nothing, the agent runs without any overhead. Wire them only when you need them.
@@ -1248,7 +1284,7 @@ A Temporal connection (`WithTemporalConfig` or `WithTemporalClient`) is **option
 - **WithMaxSubAgentDepth**: Maximum delegation hops from this agent (default 2). See [Sub-agents](#sub-agents).
 - **WithMaxIterations**: Max LLM rounds (default 5).
 - **WithStream**: Enable `Stream` partial content streaming.
-- **Token usage:** Not a separate option. On `**Run`**, read `**Usage**` on `**[*AgentRunResult](pkg/agent/agent.go)**` when set. On `**Stream**`, assert `**[*AgentRunFinishedEvent](pkg/agent/agent.go)**` with `**[*AgentRunResult](pkg/agent/agent.go)**` in `**Result**` (aggregate across LLM/tool rounds when the provider reports it). See [Token usage](#token-usage-llmusage).
+- **Token usage:** Not a separate option. On `**Run`**, read `**LLMUsage**` on `**[*AgentRunResult](pkg/agent/agent.go)**` when set. On `**Stream**`, assert `**[*AgentRunFinishedEvent](pkg/agent/agent.go)**` and read `**Result.LLMUsage**` (aggregate across LLM/tool rounds when the provider reports it). See [Token usage](#token-usage-llmusage).
 - **WithLLMSampling**: Pass `&agent.LLMSampling{...}`; nil or zero fields leave that knob to the provider default. Which fields apply where:
   - `**Temperature`** — OpenAI, Anthropic, Gemini.
   - `**MaxTokens**` — OpenAI, Anthropic, Gemini (max output / completion tokens).
@@ -1272,7 +1308,7 @@ A Temporal connection (`WithTemporalConfig` or `WithTemporalClient`) is **option
 Contributors: see **[CONTRIBUTING.md](CONTRIBUTING.md)** for prerequisites (Go, Temporal setup, workflow, and guidelines).
 Project policies: **[SECURITY.md](SECURITY.md)** for vulnerability reporting and **[CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md)** for community standards.
 
-Quick commands: `make test` | `make lint` | `make fmt` | `make spell` | `make tidy` | `make test-coverage` (`make lint` runs `gofmt -s`, `misspell`, then `go vet` + `golangci-lint`)
+Quick commands: `make test` | `make check` | `make eval-harness` | `make lint` | `make fmt` | `make spell` | `make tidy` | `make test-coverage` (`make lint` runs `gofmt -s`, `misspell`, then `go vet` + `golangci-lint`)
 
 ## Code Coverage
 
@@ -1319,6 +1355,12 @@ Config-driven benchmark suite to measure agent performance in your environment.
 
 See [benchmarks/README.md](benchmarks/README.md).
 
+## Eval Harness
+
+Behavioral regression suite for agent runs — verify tools, completion, and telemetry without a live LLM. Use it to catch breaking changes in CI and as a reference for wiring your own agents into eval tools.
+
+See [eval-harness/README.md](eval-harness/README.md).
+
 ## Production Readiness Checklist
 
 - **Run and approval limits** — Use `WithTimeout` and/or a context deadline on `Run` / `Stream`; use `WithApprovalTimeout` when tools require approval (activity retry counts inside workflows are fixed in the SDK, not user-tunable).

@@ -234,8 +234,7 @@ func runResultFromFinishedEvent(ev agent.AgentEvent) *agent.AgentRunResult {
 	if !ok || fin == nil {
 		return nil
 	}
-	res, _ := fin.Result.(*agent.AgentRunResult)
-	return res
+	return fin.Result
 }
 
 func printEvent(ev agent.AgentEvent, streamedContent bool) {

@@ -0,0 +1,161 @@
+# Eval harness
+
+Runs a single agent execution with mock LLM and mock tools. Prints JSON to stdout with `content`, `llm_usage`, and `telemetry` for evaluation assertions.
+
+## Runner
+
+From the repo root:
+
+```bash
+go run ./eval-harness/runner
+go run ./eval-harness/runner -prompt "custom prompt"
+go run ./eval-harness/runner -runtime temporal
+go run ./eval-harness/runner -tools 2
+go run ./eval-harness/runner -config eval-harness/runner/config.yaml
+```
+
+### Arguments
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `-config` | `eval-harness/runner/config.yaml` | Path to config file |
+| `-prompt` | from config | Override `user_prompt` |
+| `-runtime` | from config | Override `runtime` (`local` or `temporal`) |
+| `-tools` | from config | Override `agent.tool_count` |
+
+### config.yaml
+
+Default path: `eval-harness/runner/config.yaml`
+
+| Field | Default | Description |
+|-------|---------|-------------|
+| `runtime` | `local` | `local` or `temporal` |
+| `user_prompt` | — | User message (required) |
+| `agent.name` | `eval-agent` | Agent name |
+| `agent.system_prompt` | built-in eval prompt | System instructions |
+| `agent.tool_count` | `3` | Number of mock tools |
+| `temporal.host` | `localhost` | Temporal host when `runtime: temporal` |
+| `temporal.port` | `7233` | Temporal port |
+| `temporal.namespace` | `default` | Temporal namespace |
+| `temporal.task_queue` | `eval-harness` | Task queue |
+
+Temporal mode uses an embedded local worker. Start Temporal before running (e.g. `task infra:temporal:up` from `examples/`).
+
+### Output
+
+Stdout is always JSON:
+
+```json
+{
+  "content": "eval complete",
+  "llm_usage": { "prompt_tokens": 600, "completion_tokens": 400, "total_tokens": 1000 },
+  "telemetry": { "run": { ... }, "tools": { ... }, "storage": { ... } }
+}
+```
+
+## PromptFoo
+
+Config: `eval-harness/promptfoo/config.yaml`
+
+PromptFoo runs the eval harness as an [exec provider](https://www.promptfoo.dev/docs/providers/custom-script/). Each test invokes the runner once, parses the JSON stdout, and asserts on `content`, `llm_usage`, and `telemetry`.
+
+### Run
+
+```bash
+cd eval-harness/promptfoo
+npx promptfoo eval -c config.yaml
+```
+
+View results in the web UI:
+
+```bash
+npx promptfoo view
+```
+
+Requires Node.js. PromptFoo is installed on demand via `npx`; no local install is required.
+
+### How it works
+
+| Piece | Role |
+|-------|------|
+| **Provider** | `exec:../run_agent.sh` — shared wrapper in `eval-harness/` |
+| **Prompt** | `"run eval check"` — passed as the first arg to `run_agent.sh` (overrides `user_prompt`) |
+| **Output** | Runner JSON on stdout; assertions use `JSON.parse(output)` |
+| **Paths** | `eval-harness/run_agent.sh` resolves repo root and runner config |
+
+The runner accepts PromptFoo’s prompt as a positional argument when `-prompt` is not set. Agent settings (`tool_count`, `runtime`, etc.) still come from `eval-harness/runner/config.yaml`.
+
+### Tests
+
+Four test cases in `config.yaml`, each with a JavaScript assertion on runner JSON:
+
+| Test | Checks |
+|------|--------|
+| all mock tools were called | `telemetry.tools.breakdown` — `eval_tool_1`, `eval_tool_2`, `eval_tool_3`, each called once |
+| agent completed successfully | `telemetry.run.finish_reason === "complete"` and `content === "eval complete"` |
+| no failed tool calls | `telemetry.tools.failed_calls === 0` |
+| llm usage reported | `llm_usage.total_tokens > 0` |
+
+### Customizing
+
+- **Change the prompt** — edit `prompts` in `promptfoo/config.yaml`, or add `vars` and use `{{var}}` in the prompt string.
+- **Change agent behavior** — edit `eval-harness/runner/config.yaml` (tool count, runtime, system prompt), or adjust `eval-harness/run_agent.sh`.
+- **Add tests** — append cases under `tests:` with `type: javascript` and `value:` returning a boolean.
+- **Filter providers** — use `label: eval-agent` in test `options.providers` if you add more providers later.
+
+## DeepEval
+
+Python tests in `eval-harness/deepeval/`. The suite runs the Go eval harness, parses the JSON stdout, and asserts on `content`, `llm_usage`, and `telemetry` — the same output contract as the runner and PromptFoo.
+
+### Run
+
+```bash
+cd eval-harness/deepeval
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+pytest test_agent.py -v
+```
+
+Requires Python 3.10+ and Go. No API key is required for the default tests.
+
+### How it works
+
+1. `harness.run_agent()` calls `eval-harness/run_agent.sh` and parses JSON.
+2. Tests read telemetry from the agent SDK run output.
+3. `assert_test()` runs DeepEval metrics where useful; plain pytest asserts cover the rest.
+
+| Source field | Used for |
+|--------------|----------|
+| `content` | Agent response text |
+| `llm_usage.total_tokens` | Token usage reported |
+| `telemetry.run.finish_reason` | Run completed (`"complete"`) |
+| `telemetry.tools.failed_calls` | No tool failures |
+| `telemetry.tools.total_calls` | Expected call count |
+| `telemetry.tools.breakdown` | Per-tool call counts; fed into `tools_called` for `ToolCorrectnessMetric` |
+
+Example — extract tools from telemetry:
+
+```python
+agent_res = run_agent()
+tools = list(agent_res["telemetry"]["tools"]["breakdown"].keys())
+finish_reason = agent_res["telemetry"]["run"]["finish_reason"]
+```
+
+### Tests
+
+Two pytest tests in `test_agent.py`:
+
+| Test | Checks |
+|------|--------|
+| `test_agent_completes_with_telemetry` | `content`, `llm_usage`, `finish_reason`, `failed_calls`, `total_calls`, `breakdown` keys |
+| `test_agent_tool_correctness` | `ToolCorrectnessMetric` — `tools_called` from telemetry vs expected tools |
+
+### Customizing
+
+- **Change the prompt** — pass a different string to `run_agent(prompt=...)`.
+- **Change agent behavior** — edit `eval-harness/runner/config.yaml` or `eval-harness/run_agent.sh`.
+- **Add tests** — extend `test_agent.py` with more telemetry asserts or DeepEval `LLMTestCase` fields.
+
+> **Note:** CI runs both PromptFoo and DeepEval on PRs — see `.github/workflows/ci.yml` (`eval-harness` job). Locally: `make eval-harness` from the repo root.
+