Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,38 @@ jobs:
- name: Build
run: make build

eval-harness:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Set up Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod

- name: Set up Node.js
uses: actions/setup-node@v4
with:
node-version: "20"

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: PromptFoo
working-directory: eval-harness/promptfoo
run: npx --yes promptfoo@latest eval -c config.yaml

- name: DeepEval
working-directory: eval-harness/deepeval
run: |
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
.venv/bin/pytest test_agent.py -v

# examples:local — off until if: true and EXAMPLES_* repo secrets (examples/.env.defaults).
examples:
if: false
Expand Down
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,9 @@ pnpm-debug.log*
.env*.local

.DS_Store

# Python / DeepEval (eval-harness/deepeval)
eval-harness/deepeval/.venv/
eval-harness/deepeval/.pytest_cache/
eval-harness/deepeval/.deepeval/
eval-harness/deepeval/__pycache__/
11 changes: 10 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ Keep your branch short and descriptive. Sync with `main` before opening a PR: `g
make check
```

Runs `fmt-check`, spell check, `make lint`, `make test`, `make build`, and `make secrets-scan` — same core gates as CI (coverage is CI-only; use `make test-coverage` locally if you want a report).
Runs `fmt-check`, spell check, `make lint`, `make test`, `make build`, and `make secrets-scan` — same core gates as the main CI job (coverage is CI-only; use `make test-coverage` locally if you want a report). `make test` includes eval-harness Go tests; the full Promptfoo/DeepEval suite runs in CI and via `make eval-harness` (see below).

Also run the full example suite on any code change to catch regressions unit tests may miss:

Expand All @@ -72,6 +72,14 @@ task examples:all

Requires Task, Docker, and LLM credentials — see [examples/README.md](examples/README.md).

If you change **agent behavior** (e.g. `pkg/agent`, telemetry, tools, runtime) or **`eval-harness/`**, run:

```bash
make eval-harness
```

Behavioral regression tests use mock LLM/tools and assert on run output — SDK changes can break them even when eval-harness files are untouched. Requires Node.js and Python 3.10+ — see [eval-harness/README.md](eval-harness/README.md). CI runs this automatically on PRs (`eval-harness` job).

**CI runs automatically** on pull requests to `main` (open a PR or push updates to an existing PR to re-run checks). Pushes or merges to `main` do not trigger CI; use **workflow_dispatch** in GitHub Actions for an on-demand run. Run `make check` locally before opening a PR; CI must pass on the PR before merge.

To run only tests (e.g. while iterating):
Expand Down Expand Up @@ -172,6 +180,7 @@ Using the SDK and ran into issues, unclear docs, or confusing behavior? **Raise
2. **Tests**
- Add tests for new features and bug fixes.
- Unit tests go in `*_test.go` files alongside the code.
- Agent behavior changes (`pkg/agent`, telemetry, tools, runtime) or **`eval-harness/`** edits — run `make eval-harness` before submitting a PR.

3. **Commits**
- Use [conventional commits](https://www.conventionalcommits.org) — these drive the release changelog:
Expand Down
15 changes: 13 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.PHONY: build install test lint tidy clean fmt fmt-check spell secrets-scan check
.PHONY: build install test lint tidy clean fmt fmt-check spell secrets-scan check eval-harness

BIN_DIR := cmd/bin
BINARY := $(BIN_DIR)/agentctl
Expand Down Expand Up @@ -26,13 +26,24 @@ install: build
cp $(BINARY) $(GOPATH_BIN)/agentctl
@echo "Installed to $(GOPATH_BIN)/agentctl"

# Run tests under pkg
# Run Go tests (pkg, internal, eval-harness runner)
test:
@echo "==> Running tests..."
go test ./pkg/... -count=1
go test ./internal/... -count=1
go test ./eval-harness/... -count=1
@echo "==> Tests complete"

# Promptfoo + DeepEval (same as CI eval-harness job). Requires Node.js and Python 3.10+.
eval-harness:
@echo "==> Running eval-harness (Promptfoo + DeepEval)..."
cd eval-harness/promptfoo && npx --yes promptfoo@latest eval -c config.yaml
cd eval-harness/deepeval && \
(test -d .venv || python3 -m venv .venv) && \
.venv/bin/pip install -q -r requirements.txt && \
.venv/bin/pytest test_agent.py -v
@echo "==> Eval-harness complete"

# Run before push: lint, test, build, and secrets scan (same core gates as CI; no auto-format).
# Coverage is CI-only (`make test-coverage` when you want the report). If fmt-check fails, run `make fmt`.
check: lint test build secrets-scan
Expand Down
52 changes: 47 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
- [Agent and worker in separate processes](#agent-and-worker-in-separate-processes)
- [Conversation](#conversation-message-history)
- [AG-UI Protocol](#ag-ui-protocol)
- [Telemetry](#telemetry)
- [Observability](#observability)
- [Wire OTLP](#wire-otlp-traces--metrics--logs-in-one-block)
- [Bring your own tracer / metrics](#bring-your-own-tracer--metrics)
Expand All @@ -54,6 +55,7 @@
- [Code Coverage](#code-coverage)
- [Setup and run examples](#setup-and-run-examples)
- [Benchmarks](#benchmarks)
- [Eval Harness](#eval-harness)
- [Production Readiness Checklist](#production-readiness-checklist)
- [Disclaimer](#disclaimer)

Expand Down Expand Up @@ -328,8 +330,8 @@ Streaming text deltas (`TEXT_MESSAGE_*`) versus the `**RUN_FINISHED**` body ofte

Each LLM completion can report token counts via `[interfaces.LLMUsage](pkg/interfaces/llm.go)` on `[interfaces.LLMResponse.Usage](pkg/interfaces/llm.go)`. OpenAI, Anthropic, and Gemini clients populate `**PromptTokens**`, `**CompletionTokens**`, `**TotalTokens**`, and optional `**CachedPromptTokens**` / `**ReasoningTokens**` when the provider returns them.

- `**Agent.Run` / `RunAsync`:** `**Usage`** on [*AgentRunResult](pkg/agent/agent.go) is the **sum** across all LLM calls in that run (including tool rounds). Use it for cost estimates, quotas, and logging.
- `**Stream`:** the same aggregate appears as `**Usage*`* on `**RUN_FINISHED**`: assert `**[*AgentRunFinishedEvent](pkg/agent/agent.go)**`, then `**Result**` as `**[*AgentRunResult](pkg/agent/agent.go)**`. OpenAI streaming `**include_usage**` surfaces totals there. Helpers: [examples/shared/utils.go](examples/shared/utils.go) (`UsageFooter`, `RunResultFromFinishedEvent`).
- `**Agent.Run` / `RunAsync`:** `**LLMUsage**` on [*AgentRunResult](pkg/agent/agent.go) is the **sum** across all LLM calls in that run (including tool rounds). Use it for cost estimates, quotas, and logging.
- `**Stream`:** the same aggregate is on `**LLMUsage**` in `**RUN_FINISHED**` `**Result**` (`**[*AgentRunFinishedEvent](pkg/agent/agent.go)**`; `**Result**` is `**[*AgentRunResult](pkg/agent/agent.go)**`). OpenAI streaming `**include_usage**` surfaces totals there. Helpers: [examples/shared/utils.go](examples/shared/utils.go) (`LLMUsageFooter`, `RunResultFromFinishedEvent`).

Examples: [examples/simple_agent](examples/simple_agent) (prints usage after `Run`), [examples/agent_with_stream](examples/agent_with_stream) (prints usage on `**RUN_FINISHED**`).

Expand All @@ -348,7 +350,6 @@ Custom tools may also implement:

- `interfaces.ToolApproval` — tool-level hint for **interactive human approval**. Use this when a person should decide whether the tool runs, and no agent-level approval policy is set.
- `interfaces.ToolAuthorizer` — tool-level **programmatic authorization**. Use this when code should decide whether the tool runs before approval/execute (for example: scopes, tenancy, environment flags, or feature access). Return `Allow=false` to deny the tool call without executing it.
- `interfaces.ToolKindProvider` — optional interface that reports the tool's origin category. The built-in tool wrappers already implement it (`"mcp"`, `"a2a"`, `"sub-agent"`, `"retriever"`). Implement it on custom tools when you want to distinguish origin in logs or metrics. Use `interfaces.KindOf(tool)` to read the kind from any tool; returns `"native"` when the interface is not implemented.

```go
reg := agent.NewToolRegistry()
Expand Down Expand Up @@ -1116,6 +1117,41 @@ for ev := range ch {

---

## Telemetry

Every run populates `AgentTelemetry` inside `AgentRunResult` with behavioral metrics across three areas:

> Telemetry fields are designed to support eval harness assertions — see [eval-harness/](eval-harness/) for examples with PromptFoo and DeepEval.

- **Run** — start/end time, total LLM calls, and finish reason (`complete` or `max_iterations`)
- **Tools** — total calls, failed calls, and per-tool breakdown for registered tools and MCP tools
- **Storage** — RAG retriever search counts split by mode (`prefetch_searches`, `agentic_searches`) and failure count; all fields are zero when no retriever is configured

```go
result, _ := ag.Run(ctx, "prompt")
t := result.Telemetry
fmt.Printf("llm_calls=%d finish=%s\n", t.Run.TotalLLMCalls, t.Run.FinishReason)
fmt.Printf("tool_calls=%d failed=%d\n", t.Tools.TotalCalls, t.Tools.FailedCalls)
fmt.Printf("retriever_searches=%d prefetch=%d agentic=%d\n",
t.Storage.TotalRetrieverSearches,
t.Storage.PrefetchSearches,
t.Storage.AgenticSearches)
```

**Stream** — telemetry is on `Result.Telemetry` inside the `RUN_FINISHED` event:

```go
for ev := range ch {
if result := shared.RunResultFromFinishedEvent(ev); result != nil {
fmt.Println(result.Telemetry.Run.TotalLLMCalls)
}
}
```

Examples can print a formatted telemetry footer — see [examples/README.md](examples/README.md#run-output).

---

## Observability

The SDK emits **traces**, **metrics**, and **logs** via OpenTelemetry. All signals are **no-op by default** — if you set nothing, the agent runs without any overhead. Wire them only when you need them.
Expand Down Expand Up @@ -1248,7 +1284,7 @@ A Temporal connection (`WithTemporalConfig` or `WithTemporalClient`) is **option
- **WithMaxSubAgentDepth**: Maximum delegation hops from this agent (default 2). See [Sub-agents](#sub-agents).
- **WithMaxIterations**: Max LLM rounds (default 5).
- **WithStream**: Enable `Stream` partial content streaming.
- **Token usage:** Not a separate option. On `**Run`**, read `**Usage**` on `**[*AgentRunResult](pkg/agent/agent.go)**` when set. On `**Stream**`, assert `**[*AgentRunFinishedEvent](pkg/agent/agent.go)**` with `**[*AgentRunResult](pkg/agent/agent.go)**` in `**Result**` (aggregate across LLM/tool rounds when the provider reports it). See [Token usage](#token-usage-llmusage).
- **Token usage:** Not a separate option. On `**Run`**, read `**LLMUsage**` on `**[*AgentRunResult](pkg/agent/agent.go)**` when set. On `**Stream**`, assert `**[*AgentRunFinishedEvent](pkg/agent/agent.go)**` and read `**Result.LLMUsage**` (aggregate across LLM/tool rounds when the provider reports it). See [Token usage](#token-usage-llmusage).
- **WithLLMSampling**: Pass `&agent.LLMSampling{...}`; nil or zero fields leave that knob to the provider default. Which fields apply where:
- `**Temperature`** — OpenAI, Anthropic, Gemini.
- `**MaxTokens**` — OpenAI, Anthropic, Gemini (max output / completion tokens).
Expand All @@ -1272,7 +1308,7 @@ A Temporal connection (`WithTemporalConfig` or `WithTemporalClient`) is **option
Contributors: see **[CONTRIBUTING.md](CONTRIBUTING.md)** for prerequisites (Go, Temporal setup, workflow, and guidelines).
Project policies: **[SECURITY.md](SECURITY.md)** for vulnerability reporting and **[CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md)** for community standards.

Quick commands: `make test` | `make lint` | `make fmt` | `make spell` | `make tidy` | `make test-coverage` (`make lint` runs `gofmt -s`, `misspell`, then `go vet` + `golangci-lint`)
Quick commands: `make test` | `make check` | `make eval-harness` | `make lint` | `make fmt` | `make spell` | `make tidy` | `make test-coverage` (`make lint` runs `gofmt -s`, `misspell`, then `go vet` + `golangci-lint`)

## Code Coverage

Expand Down Expand Up @@ -1319,6 +1355,12 @@ Config-driven benchmark suite to measure agent performance in your environment.

See [benchmarks/README.md](benchmarks/README.md).

## Eval Harness

Behavioral regression suite for agent runs — verify tools, completion, and telemetry without a live LLM. Use it to catch breaking changes in CI and as a reference for wiring your own agents into eval tools.

See [eval-harness/README.md](eval-harness/README.md).

## Production Readiness Checklist

- **Run and approval limits** — Use `WithTimeout` and/or a context deadline on `Run` / `Stream`; use `WithApprovalTimeout` when tools require approval (activity retry counts inside workflows are fixed in the SDK, not user-tunable).
Expand Down
3 changes: 1 addition & 2 deletions cmd/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -234,8 +234,7 @@ func runResultFromFinishedEvent(ev agent.AgentEvent) *agent.AgentRunResult {
if !ok || fin == nil {
return nil
}
res, _ := fin.Result.(*agent.AgentRunResult)
return res
return fin.Result
}

func printEvent(ev agent.AgentEvent, streamedContent bool) {
Expand Down
161 changes: 161 additions & 0 deletions eval-harness/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Eval harness

Runs a single agent execution with mock LLM and mock tools. Prints JSON to stdout with `content`, `llm_usage`, and `telemetry` for evaluation assertions.

## Runner

From the repo root:

```bash
go run ./eval-harness/runner
go run ./eval-harness/runner -prompt "custom prompt"
go run ./eval-harness/runner -runtime temporal
go run ./eval-harness/runner -tools 2
go run ./eval-harness/runner -config eval-harness/runner/config.yaml
```

### Arguments

| Flag | Default | Description |
|------|---------|-------------|
| `-config` | `eval-harness/runner/config.yaml` | Path to config file |
| `-prompt` | from config | Override `user_prompt` |
| `-runtime` | from config | Override `runtime` (`local` or `temporal`) |
| `-tools` | from config | Override `agent.tool_count` |

### config.yaml

Default path: `eval-harness/runner/config.yaml`

| Field | Default | Description |
|-------|---------|-------------|
| `runtime` | `local` | `local` or `temporal` |
| `user_prompt` | — | User message (required) |
| `agent.name` | `eval-agent` | Agent name |
| `agent.system_prompt` | built-in eval prompt | System instructions |
| `agent.tool_count` | `3` | Number of mock tools |
| `temporal.host` | `localhost` | Temporal host when `runtime: temporal` |
| `temporal.port` | `7233` | Temporal port |
| `temporal.namespace` | `default` | Temporal namespace |
| `temporal.task_queue` | `eval-harness` | Task queue |

Temporal mode uses an embedded local worker. Start Temporal before running (e.g. `task infra:temporal:up` from `examples/`).

### Output

Stdout is always JSON:

```json
{
"content": "eval complete",
"llm_usage": { "prompt_tokens": 600, "completion_tokens": 400, "total_tokens": 1000 },
"telemetry": { "run": { ... }, "tools": { ... }, "storage": { ... } }
}
```

## PromptFoo

Config: `eval-harness/promptfoo/config.yaml`

PromptFoo runs the eval harness as an [exec provider](https://www.promptfoo.dev/docs/providers/custom-script/). Each test invokes the runner once, parses the JSON stdout, and asserts on `content`, `llm_usage`, and `telemetry`.

### Run

```bash
cd eval-harness/promptfoo
npx promptfoo eval -c config.yaml
```

View results in the web UI:

```bash
npx promptfoo view
```

Requires Node.js. PromptFoo is installed on demand via `npx`; no local install is required.

### How it works

| Piece | Role |
|-------|------|
| **Provider** | `exec:../run_agent.sh` — shared wrapper in `eval-harness/` |
| **Prompt** | `"run eval check"` — passed as the first arg to `run_agent.sh` (overrides `user_prompt`) |
| **Output** | Runner JSON on stdout; assertions use `JSON.parse(output)` |
| **Paths** | `eval-harness/run_agent.sh` resolves repo root and runner config |

The runner accepts PromptFoo’s prompt as a positional argument when `-prompt` is not set. Agent settings (`tool_count`, `runtime`, etc.) still come from `eval-harness/runner/config.yaml`.

### Tests

Four test cases in `config.yaml`, each with a JavaScript assertion on runner JSON:

| Test | Checks |
|------|--------|
| all mock tools were called | `telemetry.tools.breakdown` — `eval_tool_1`, `eval_tool_2`, `eval_tool_3`, each called once |
| agent completed successfully | `telemetry.run.finish_reason === "complete"` and `content === "eval complete"` |
| no failed tool calls | `telemetry.tools.failed_calls === 0` |
| llm usage reported | `llm_usage.total_tokens > 0` |

### Customizing

- **Change the prompt** — edit `prompts` in `promptfoo/config.yaml`, or add `vars` and use `{{var}}` in the prompt string.
- **Change agent behavior** — edit `eval-harness/runner/config.yaml` (tool count, runtime, system prompt), or adjust `eval-harness/run_agent.sh`.
- **Add tests** — append cases under `tests:` with `type: javascript` and `value:` returning a boolean.
- **Filter providers** — use `label: eval-agent` in test `options.providers` if you add more providers later.

## DeepEval

Python tests in `eval-harness/deepeval/`. The suite runs the Go eval harness, parses the JSON stdout, and asserts on `content`, `llm_usage`, and `telemetry` — the same output contract as the runner and PromptFoo.

### Run

```bash
cd eval-harness/deepeval
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pytest test_agent.py -v
```

Requires Python 3.10+ and Go. No API key is required for the default tests.

### How it works

1. `harness.run_agent()` calls `eval-harness/run_agent.sh` and parses JSON.
2. Tests read telemetry from the agent SDK run output.
3. `assert_test()` runs DeepEval metrics where useful; plain pytest asserts cover the rest.

| Source field | Used for |
|--------------|----------|
| `content` | Agent response text |
| `llm_usage.total_tokens` | Token usage reported |
| `telemetry.run.finish_reason` | Run completed (`"complete"`) |
| `telemetry.tools.failed_calls` | No tool failures |
| `telemetry.tools.total_calls` | Expected call count |
| `telemetry.tools.breakdown` | Per-tool call counts; fed into `tools_called` for `ToolCorrectnessMetric` |

Example — extract tools from telemetry:

```python
agent_res = run_agent()
tools = list(agent_res["telemetry"]["tools"]["breakdown"].keys())
finish_reason = agent_res["telemetry"]["run"]["finish_reason"]
```

### Tests

Two pytest tests in `test_agent.py`:

| Test | Checks |
|------|--------|
| `test_agent_completes_with_telemetry` | `content`, `llm_usage`, `finish_reason`, `failed_calls`, `total_calls`, `breakdown` keys |
| `test_agent_tool_correctness` | `ToolCorrectnessMetric` — `tools_called` from telemetry vs expected tools |

### Customizing

- **Change the prompt** — pass a different string to `run_agent(prompt=...)`.
- **Change agent behavior** — edit `eval-harness/runner/config.yaml` or `eval-harness/run_agent.sh`.
- **Add tests** — extend `test_agent.py` with more telemetry asserts or DeepEval `LLMTestCase` fields.

> **Note:** CI runs both PromptFoo and DeepEval on PRs — see `.github/workflows/ci.yml` (`eval-harness` job). Locally: `make eval-harness` from the repo root.

Loading
Loading