Skip to content

beyhangl/evalcraft

Repository files navigation

Evalcraft

Deterministic tests for AI agents — generated from one real run.

Capture an agent run and evalcraft writes a pytest that locks its tool calls, output shape, and cost — then replays it in CI for $0. Like VCR for HTTP, but it writes the agent tests for you.

CI PyPI Python License


Get Started in 60 Seconds

pip install evalcraft
evalcraft init                # scaffolds tests/cassettes/ and a sample test
pytest --evalcraft            # run with recording

That's it. Your first cassette is recorded, committed to git, and replays for free on every future run. See the 5-minute quickstart for the full walkthrough.


The problem

Agent testing is broken:

  • Expensive. Running 200 tests against GPT-4.1 costs real money. Every commit.
  • Non-deterministic. Tests fail randomly because LLMs aren't functions.
  • No CI/CD story. You can't gate deploys on eval results if evals take 10 minutes and cost $5.

Evalcraft records agent runs as cassettes (like VCR for HTTP) and replays them deterministically — so the tests that exercise your agent's plumbing (tool wiring, control flow, output shape, cost/latency budgets) drop from 10 minutes + $5 to 200ms + $0. For the questions that genuinely need a live model — quality, drift, LLM-judge, RAG — run live-eval on a schedule.


How it works

  Your Agent
      |
      v
+-------------+    record     +--------------+
|  CaptureCtx | ------------> |   Cassette   |  (plain JSON, git-friendly)
|             |               |  (spans[])   |
+-------------+               +------+-------+
                                     |
                    +----------------+----------------+
                    v                v                v
              replay()          MockLLM /        assert_*()
           (zero API calls)    MockTool()       (scorers)
                    |                                 |
                    +----------------+----------------+
                                     v
                            pytest / CI gate
                           (200ms, $0.00)

Install

pip install evalcraft

# With pytest plugin
pip install "evalcraft[pytest]"

# With framework adapters
pip install "evalcraft[openai]"       # OpenAI SDK adapter
pip install "evalcraft[anthropic]"    # Anthropic SDK adapter
pip install "evalcraft[gemini]"       # Google Gemini adapter
pip install "evalcraft[pydantic-ai]"  # Pydantic AI adapter
pip install "evalcraft[langchain]"    # LangChain/LangGraph adapter

# Everything
pip install "evalcraft[all]"

5-minute quickstart

1. Capture an agent run

from evalcraft import CaptureContext

with CaptureContext(
    name="weather_agent_test",
    agent_name="weather_agent",
    save_path="tests/cassettes/weather.json",
) as ctx:
    ctx.record_input("What's the weather in Paris?")

    # Run your agent — wrap tool/LLM calls with record_* methods
    ctx.record_tool_call("get_weather", args={"city": "Paris"}, result={"temp": 18, "condition": "cloudy"})
    ctx.record_llm_call(
        model="gpt-4.1-mini",
        input="User asked about weather. Tool returned: cloudy 18C",
        output="It's 18C and cloudy in Paris right now.",
        prompt_tokens=120,
        completion_tokens=15,
        cost_usd=0.0003,
    )

    ctx.record_output("It's 18C and cloudy in Paris right now.")

cassette = ctx.cassette
print(f"Captured {cassette.tool_call_count} tool calls, ${cassette.total_cost_usd:.4f}")
# Captured 1 tool calls, $0.0003

2. Replay without API calls

from evalcraft import replay

# Loads the cassette and replays all spans — zero LLM calls
run = replay("tests/cassettes/weather.json")

assert run.replayed is True
assert run.cassette.output_text == "It's 18C and cloudy in Paris right now."

3. Assert tool behavior

from evalcraft import replay, assert_tool_called, assert_cost_under

run = replay("tests/cassettes/weather.json")

assert assert_tool_called(run, "get_weather").passed
assert assert_tool_called(run, "get_weather", with_args={"city": "Paris"}).passed
assert assert_cost_under(run, max_usd=0.05).passed

4. LLM-as-Judge evaluation

⚠️ These are live scorers. Unlike replay + the structural scorers (which are offline, deterministic, and $0), the LLM-as-Judge / RAG / pairwise scorers call a real model at test time — they cost money, need an API key, and are non-deterministic (use eval_n + confidence intervals). See Offline vs. live scorers.

from evalcraft import replay, assert_output_semantic, assert_factual_consistency

run = replay("tests/cassettes/weather.json")

# Semantic evaluation — uses an LLM to judge output quality
result = assert_output_semantic(run, criteria="Mentions temperature and city name")
assert result.passed

# Factual consistency check
result = assert_factual_consistency(run, ground_truth="Paris is 18C and cloudy")
assert result.passed

5. RAG evaluation metrics

from evalcraft import replay, assert_faithfulness, assert_answer_relevance

run = replay("tests/cassettes/rag_agent.json")
contexts = ["Paris has a population of 2.1 million...", "The Eiffel Tower..."]

# Does the output stay faithful to retrieved context?
assert assert_faithfulness(run, contexts=contexts).passed

# Does the answer address the original question?
assert assert_answer_relevance(run, query="Tell me about Paris").passed

6. Use with pytest

# tests/test_weather_agent.py
from evalcraft import replay, assert_tool_called, assert_cost_under

def test_agent_calls_weather_tool():
    run = replay("tests/cassettes/weather.json")
    result = assert_tool_called(run, "get_weather")
    assert result.passed, result.message

def test_agent_cost_budget():
    run = replay("tests/cassettes/weather.json")
    result = assert_cost_under(run, max_usd=0.01)
    assert result.passed, result.message
pytest tests/ -v
# 200ms, $0.00

7. Pairwise A/B comparison

from evalcraft import pairwise_compare, pairwise_rank

# Compare two agent outputs — LLM judge picks the winner
result = pairwise_compare(cassette_a, cassette_b, criteria="Which is more helpful?")
print(result.winner)      # "A", "B", or "tie"
print(result.confidence)  # 0.0-1.0

# Rank multiple agents via round-robin tournament
rankings = pairwise_rank([agent_a, agent_b, agent_c], criteria="Accuracy and helpfulness")
for entry in rankings:
    print(f"{entry.name}: {entry.wins}W/{entry.losses}L (score {entry.score:.2f})")

Position bias is mitigated by randomizing presentation order.

8. Statistical evaluation with confidence intervals

from evalcraft import eval_n, assert_output_semantic

# Run a scorer 5 times — LLM outputs are non-deterministic, one run means nothing
result = eval_n(run, assert_output_semantic, n=5, criteria="Mentions the city name")
assert result.pass_rate >= 0.8

print(f"Pass rate: {result.pass_rate:.0%} ({result.passes}/{result.n})")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")

9. Auto-generate tests from cassettes

evalcraft generate-tests tests/cassettes/weather.json -o tests/test_weather.py
# Generates a complete pytest file with tool, output, cost, token, and latency assertions

10. Diagnose your setup

evalcraft doctor
#   ✓ Python 3.11.5
#   ✓ evalcraft 0.1.0
#   ✓ openai 2.30.0
#   ! anthropic not installed
#   ✓ OPENAI_API_KEY configured
#   ✓ Cassette directory: tests/cassettes/ (3 cassettes)
#   ! 1 stale cassette (>30 days old)
#   ✓ pytest plugin registered

Examples

Four complete, self-contained example projects — each with pre-recorded cassettes, working test suites, and step-by-step READMEs.

Example Scenario What it demonstrates
openai-agent/ Customer support agent (ShopEasy) OpenAIAdapter, tool call assertions, golden sets, MockLLM + MockTool unit tests
anthropic-agent/ Code review bot (PRs via Claude) AnthropicAdapter, multi-turn testing, security assertions, add_sequential_responses
langgraph-workflow/ RAG policy Q&A pipeline LangGraphAdapter, node-order assertions, SpanKind.AGENT_STEP inspection, citation validation
ci-pipeline/ GitHub Actions CI gate GitHub Actions workflow, standalone gate script, cassette refresh strategy

Run any example in 60 seconds (no API key needed)

cd examples/openai-agent
pip install -r requirements.txt
pytest tests/ -v
# 15 tests pass in ~0.3s, $0.00

All cassettes are pre-recorded and committed to the repo. Tests replay them deterministically — no API key, no network calls, no cost.


How Evalcraft compares

An honest comparison against the closest tools. ✅ first-class · ⚠️ partial / via integration · ❌ no · — not applicable.

Evalcraft DeepEval Promptfoo LangSmith Braintrust Ragas
Git-committed cassette replay
Zero-cost CI re-runs ✅ replay ✅ cache ✅ cache ⚠️
pytest-native ❌ CLI/YAML ⚠️ library
First-class Mock LLM / Tools
LLM-as-Judge scoring
RAG metrics ⚠️ ⚠️ ✅ reference
Pairwise A/B ⚠️
Statistical eval w/ confidence intervals ✅ Wilson ⚠️ ⚠️ repeat ⚠️ ⚠️
Auto-generate tests from runs
OSS / self-hostable ⚠️ enterprise ❌ enterprise
Primary focus CI / glue testing LLM eval framework eval + red-team tracing + eval eval + observability RAG metrics
Pricing Free / OSS Free / OSS (+cloud) Free / OSS Paid SaaS (free tier) Paid SaaS (free tier) Free / OSS

What's genuinely distinctive (vs. the table-stakes everyone has): git-committed, PR-diffable cassettes capturing full agent traces (LLM + tool + steps); auto-generating a pytest file from a recorded run; first-class MockLLM / MockTool; and a packaged Wilson-interval statistical helper.

Honest caveats:

  • Zero-cost CI is not unique — Promptfoo (disk cache, on by default) and DeepEval (-c) already make re-runs free. Evalcraft's angle is deterministic replay of a committed artifact, not a lower bill per se.
  • Replay only re-checks a recorded run. It does not re-execute the live model, so on its own it can't catch model/prompt/retrieval drift — see what replay does and doesn't test. For drift, re-record or run a live eval.
  • The LLM-as-Judge, RAG, and pairwise scorers make real, paid model calls at test time — they are not part of the $0 deterministic path.
  • Other strong OSS/self-hostable options not shown: Langfuse, Arize Phoenix, Inspect AI.

Evalcraft is a testing tool for your agent's deterministic glue + budgets — not an observability platform. Use Braintrust / LangSmith / Langfuse for production tracing; use Evalcraft to keep that layer of your suite fast and committed to git.

Sources for the contested rows: Promptfoo caching · DeepEval CI/CD + cache · LangSmith pairwise


Features

Feature Description
Capture Record every LLM call, tool use, and agent decision as a cassette
Replay Re-run cassettes deterministically — no API calls, zero cost
Mock LLM Substitute real LLMs with deterministic mocks (exact / pattern / wildcard)
Mock Tools Mock any tool with static, dynamic, sequential, or error-simulating responses
Scorers 19 built-in assertions: tool calls, output, cost, latency, tokens, LLM-as-Judge, RAG metrics
LLM-as-Judge Semantic evaluation, factual consistency, tone, custom criteria — via OpenAI or Anthropic
RAG Metrics Faithfulness, context relevance, answer relevance, context recall
Pairwise A/B Arena-style comparison — LLM judge picks winner with position-bias mitigation
Statistical Eval Run scorers N times, get pass rate with Wilson score confidence intervals
Diff Compare two cassette runs to detect regressions
Golden Sets Version baselines and detect regressions automatically
Auto-generate evalcraft generate-tests creates pytest files from cassettes
CLI 14 commands: replay, diff, eval, generate-tests, doctor, golden, regression, sanitize, ...
pytest plugin Native fixtures and markers — cassette, mock_llm, @pytest.mark.evalcraft
CI Gate GitHub Action with PR comments, score thresholds, regression detection
JS/TS SDK TypeScript SDK (pre-release, source-only): capture/replay, mocks, 16 scorers, OpenAI/Gemini/Vercel AI adapters

Supported frameworks

Framework Adapter Install
OpenAI SDK OpenAIAdapter — auto-records chat.completions.create (sync + async) evalcraft[openai]
Anthropic SDK AnthropicAdapter — auto-records messages.create (sync + async) evalcraft[anthropic]
Google Gemini GeminiAdapter — auto-records generate_content (sync + async) evalcraft[gemini]
Pydantic AI PydanticAIAdapter — auto-records agent.run / agent.run_sync evalcraft[pydantic-ai]
LangGraph LangGraphAdapter — callback handler for graphs and chains evalcraft[langchain]
CrewAI CrewAIAdapter — instruments Crew.kickoff() evalcraft[crewai]
AutoGen AutoGenAdapter — captures multi-agent conversations evalcraft[autogen]
LlamaIndex LlamaIndexAdapter — hooks into query/retrieval pipeline evalcraft[llamaindex]
Any agent Manual record_tool_call / record_llm_call works with any framework

OpenAI

from evalcraft.adapters import OpenAIAdapter
from evalcraft import CaptureContext
import openai

client = openai.OpenAI()

with CaptureContext(name="openai_run", save_path="tests/cassettes/openai_run.json") as ctx:
    with OpenAIAdapter():  # auto-records all LLM + tool calls
        ctx.record_input("Summarize the French Revolution")

        response = client.chat.completions.create(
            model="gpt-4.1-mini",
            messages=[{"role": "user", "content": "Summarize the French Revolution"}],
        )

        ctx.record_output(response.choices[0].message.content)

Gemini

from evalcraft.adapters import GeminiAdapter
from evalcraft import CaptureContext
import google.generativeai as genai

genai.configure(api_key="...")
model = genai.GenerativeModel("gemini-2.0-flash")

with CaptureContext(name="gemini_run", save_path="tests/cassettes/gemini_run.json") as ctx:
    with GeminiAdapter():
        ctx.record_input("What is quantum computing?")
        response = model.generate_content("What is quantum computing?")
        ctx.record_output(response.text)

Pydantic AI

from evalcraft.adapters import PydanticAIAdapter
from evalcraft import CaptureContext
from pydantic_ai import Agent

agent = Agent("openai:gpt-4.1-mini", system_prompt="You are helpful.")

with CaptureContext(name="pydantic_run", save_path="tests/cassettes/pydantic_run.json") as ctx:
    with PydanticAIAdapter():
        ctx.record_input("What's the weather?")
        result = agent.run_sync("What's the weather?")
        ctx.record_output(result.data)

CI/CD integration

GitHub Action

# .github/workflows/evalcraft.yml
- uses: beyhangl/evalcraft@v1
  with:
    test-path: tests/
    cassette-dir: tests/cassettes
    max-cost: '0.50'
    max-regression: '10'
    post-comment: 'true'

The action runs your agent tests, checks cost/regression thresholds, and posts a results table as a PR comment. See examples/ci-pipeline/ for a complete workflow.


Catching drift: live-eval

Replay is deterministic and free because it doesn't run your model — which is exactly why it can't catch model/prompt/retrieval drift. Live-eval is the complementary layer: it runs your real agent over a golden set of inputs, scores the live output, and gates CI when quality regresses against a baseline.

from evalcraft.eval.live import LiveEvalCase, LiveEvalResult, run_live_eval, compare_to_baseline
from evalcraft import assert_output_contains

cases = [LiveEvalCase(name="paris", input="Weather in Paris?",
                      scorers=[lambda c: assert_output_contains(c, "Paris")])]

def runner(case):
    return my_agent.run(case.input)   # your REAL agent — paid, non-deterministic

result = run_live_eval(cases, runner)
comparison = compare_to_baseline(
    result, LiveEvalResult.load("live-baseline.json"), max_score_drop=0.1
)
assert comparison.passed, comparison.summary()

Run it nightly or as a release gate (not on every commit). See Live Eval.


CLI reference

evalcraft [command] [options]
Command Description
evalcraft init Scaffold a test project for your framework
evalcraft capture <script> Run a script with capture enabled
evalcraft replay <cassette> Replay a cassette (zero API calls)
evalcraft diff <old> <new> Compare two cassettes
evalcraft eval <cassette> Run assertions with thresholds
evalcraft info <cassette> Inspect cassette metadata
evalcraft generate-tests <cassette> Auto-generate a pytest file
evalcraft mock <cassette> Generate MockLLM fixtures from a cassette
evalcraft golden save <cassette> Save a golden-set baseline
evalcraft golden compare <cassette> Compare against a baseline
evalcraft regression <cassette> Detect regressions
evalcraft sanitize <cassette> Redact PII and secrets
evalcraft doctor Diagnose setup issues (deps, API keys, cassettes)
evalcraft live-eval <current> --baseline <b> Gate a live-eval run vs a baseline (catch drift)

Data model

Cassette
+-- id, name, agent_name, framework
+-- input_text, output_text
+-- total_tokens, total_cost_usd, total_duration_ms
+-- llm_call_count, tool_call_count
+-- fingerprint  (SHA-256 of span content -- changes when the recording changes)
+-- spans[]
    +-- Span (llm_request / llm_response)
    |   +-- model, token_usage, cost_usd
    |   +-- input, output
    +-- Span (tool_call)
        +-- tool_name, tool_args, tool_result
        +-- duration_ms, error

Cassettes are plain JSON — check them into git, diff them in PRs.


TypeScript / JavaScript SDK

Status: pre-release (source-only). The JS/TS SDK is not yet published to npm. Until it is, build it from source from this repo:

git clone https://github.com/beyhangl/evalcraft
cd evalcraft/packages/evalcraft-js
npm install && npm run build   # emits dist/ (CJS + ESM + type defs)
import {
  CaptureContext, replay, assertToolCalled, assertCostUnder,  // Core
  assertOutputSemantic, assertTone, assertCustomCriteria,     // LLM-as-Judge
  assertFaithfulness, assertContextRelevance,                 // RAG metrics
} from 'evalcraft';
import { wrapOpenAI } from 'evalcraft/adapters/openai';
import { wrapGemini } from 'evalcraft/adapters/gemini';

The JS/TS SDK covers the core workflow — capture, replay, MockLLM/MockTool, and 16 scorers (8 core + 4 LLM-as-Judge + 4 RAG) — with OpenAI, Gemini, and Vercel AI adapters. It is not yet at full parity with the Python SDK.

Python vs JS/TS parity

Capability Python JS/TS
Capture / replay / cassettes
MockLLM / MockTool
Core scorers (tool / output / cost / latency / tokens) ✅ (8) ✅ (8)
LLM-as-Judge scorers ✅ (4) ✅ (4)
RAG metrics ✅ (4) ✅ (4)
Pairwise A/B
Statistical eval (eval_n)
Multi-judge jury / consensus
Hallucination detection
Golden sets / regression / trend
CLI + pytest plugin
Framework adapters 8 (OpenAI, Anthropic, Gemini, Pydantic AI, LangGraph, CrewAI, AutoGen, LlamaIndex) 3 (OpenAI, Gemini, Vercel AI)

Contributing

git clone https://github.com/beyhangl/evalcraft
cd evalcraft
pip install -e ".[dev]"
pytest
  • Format: ruff format .
  • Lint: ruff check .
  • Type check: mypy evalcraft/

PRs welcome. Please open an issue first for significant changes. See CONTRIBUTING.md for details.


Design Partners

We're looking for design partners. evalcraft is early (v0.1.0), and we'd like a few teams to help shape it. Partners get:

  • Hands-on setup help — we'll pair with you to get evalcraft into your CI pipeline
  • Direct access to the maintainer — not a support queue
  • Influence the roadmap — your use cases drive what we build next

Interested? Open an issue and say hi.


License

MIT © 2026 Beyhan Gul. See LICENSE.