Pickled tests whether real agents can answer and build with your product, across declared context paths, using deterministic evidence.
Docs can be correct and agents can still answer wrong. Examples can be correct and agents can still build the wrong thing. Pickled gives you receipts: per-cell verdicts showing which agent, which source, and which context path produced which answer or build result. No LLM grades another LLM.
Five terms:
- Agent is who answers: Claude Code, Codex CLI, Anthropic API, OpenAI API.
- Source is the public context Pickled may score against: a local file, a URL, or a codebase glob. Anything not registered does not count.
- Context is a named delivery path: a
mode(memory,inject,web, ormcp) plus an optional source. - Question asks something and scores the answer on declared facts (and rejected misstatements). Build has the agent edit a workspace and passes when the
verifierdoes. - Fact / misstatement are the reusable, deterministic match contracts a question scores against.
A task runs as one cell per (agent × context) pair, and each cell is graded on its own.
- Facts. Reusable product truths a question's answer must cover (
allOf/anyOfsubstring matches, normalized). - Misstatements. Reusable wrong claims the answer must not make. A match is a hard veto to
NO. - Tool paths are real. A
webormcpcontext that answers without invoking any of its tools is vetoed toNO. Model memory does not count as evidence for a tool path. - Builds prove themselves. Builds pass or fail on a SWE-bench-style
verifier(failToPass+passToPass). - No LLM grades another LLM. Every signal is a substring match, a recorded tool invocation, or a command result.
bunx @pickled-dev/cli init
bunx @pickled-dev/cli check .schemaVersion: 2
product:
name: my-product
description: short one-liner
sources:
docs: { url: https://docs.my-product.dev/llms-full.txt }
agents:
quick:
provider: claude-code
model: claude-haiku-4-5
contexts:
memory: { mode: memory } # no context, model memory only
given_docs: { mode: inject, source: docs } # docs content injected
web_open: { mode: web } # open web discovery
facts:
install_command:
statement: my-product installs with bunx my-product.
match:
allOf: ["bunx my-product"]
questions:
- id: install
question: How do I install my-product?
agents: [quick]
contexts: [memory, given_docs, web_open]
expects: [install_command]
thresholds:
questions: 60That question runs three cells, one per context, and grades each on its own. memory answers from model memory; given_docs reads the docs you registered; web_open makes the agent reach the live site through web tools (a cell that answers without invoking a tool is vetoed). Every cell scores the same fact. Compare the verdicts to see which context the agent actually needed to get it right.
Questions check what the agent says. Builds check what the agent can do:
builds:
- id: add-toolbar
goal: Add a toolbar using my-product.
agents: [quick]
contexts: [given_docs]
trials: 3
workspace:
path: ./fixtures/app
setup: [bun install]
verifier:
failToPass:
- { run: bun test }
passToPass:
- { run: bun run typecheck }
referenceSolution:
patch: ./fixtures/solutions/add-toolbar.patchRun builds with pickled build .. Each (agent × context) cell runs in a fresh workspace for each trial and reports Built k/n, Partially built k/n, or Did not build k/n.
Before spending tokens, run pickled build . --verify-only to prove the starter fixture fails, the regression guard passes, and the reference patch clears the verifier.
Save a run once, then render it without spending more tokens:
bunx @pickled-dev/cli check . --output pickled-report.json
bunx @pickled-dev/cli report pickled-report.json --format markdownDefault JSON is CI-safe: it keeps verdicts, scores, missing facts, misstatements, provenance, and build attempts, but strips source content, full agent answers, transcripts, diffs, and command output. Use --verbose when you need a forensic receipt.
Full docs: docs.pickled.dev.
MIT
