🥒 pickled

Pickled tests whether real agents can answer and build with your product, across declared context paths, using deterministic evidence.

Why

Docs can be correct and agents can still answer wrong. Examples can be correct and agents can still build the wrong thing. Pickled gives you receipts: per-cell verdicts showing which agent, which source, and which context path produced which answer or build result. No LLM grades another LLM.

How it works

Five terms:

Agent is who answers: Claude Code, Codex CLI, Anthropic API, OpenAI API.
Source is the public context Pickled may score against: a local file, a URL, or a codebase glob. Anything not registered does not count.
Context is a named delivery path: a mode (memory, inject, web, or mcp) plus an optional source.
Question asks something and scores the answer on declared facts (and rejected misstatements). Build has the agent edit a workspace and passes when the verifier does.
Fact / misstatement are the reusable, deterministic match contracts a question scores against.

A task runs as one cell per (agent × context) pair, and each cell is graded on its own.

What it checks

Facts. Reusable product truths a question's answer must cover (allOf / anyOf substring matches, normalized).
Misstatements. Reusable wrong claims the answer must not make. A match is a hard veto to NO.
Tool paths are real. A web or mcp context that answers without invoking any of its tools is vetoed to NO. Model memory does not count as evidence for a tool path.
Builds prove themselves. Builds pass or fail on a SWE-bench-style verifier (failToPass + passToPass).
No LLM grades another LLM. Every signal is a substring match, a recorded tool invocation, or a command result.

Quick start

bunx @pickled-dev/cli init
bunx @pickled-dev/cli check .

Tiny config

schemaVersion: 2

product:
  name: my-product
  description: short one-liner

sources:
  docs: { url: https://docs.my-product.dev/llms-full.txt }

agents:
  quick:
    provider: claude-code
    model: claude-haiku-4-5

contexts:
  memory: { mode: memory } # no context, model memory only
  given_docs: { mode: inject, source: docs } # docs content injected
  web_open: { mode: web } # open web discovery

facts:
  install_command:
    statement: my-product installs with bunx my-product.
    match:
      allOf: ["bunx my-product"]

questions:
  - id: install
    question: How do I install my-product?
    agents: [quick]
    contexts: [memory, given_docs, web_open]
    expects: [install_command]

thresholds:
  questions: 60

That question runs three cells, one per context, and grades each on its own. memory answers from model memory; given_docs reads the docs you registered; web_open makes the agent reach the live site through web tools (a cell that answers without invoking a tool is vetoed). Every cell scores the same fact. Compare the verdicts to see which context the agent actually needed to get it right.

Builds

Questions check what the agent says. Builds check what the agent can do:

builds:
  - id: add-toolbar
    goal: Add a toolbar using my-product.
    agents: [quick]
    contexts: [given_docs]
    trials: 3
    workspace:
      path: ./fixtures/app
      setup: [bun install]
    verifier:
      failToPass:
        - { run: bun test }
      passToPass:
        - { run: bun run typecheck }
    referenceSolution:
      patch: ./fixtures/solutions/add-toolbar.patch

Run builds with pickled build .. Each (agent × context) cell runs in a fresh workspace for each trial and reports Built k/n, Partially built k/n, or Did not build k/n.

Before spending tokens, run pickled build . --verify-only to prove the starter fixture fails, the regression guard passes, and the reference patch clears the verifier.

Receipts

Save a run once, then render it without spending more tokens:

bunx @pickled-dev/cli check . --output pickled-report.json
bunx @pickled-dev/cli report pickled-report.json --format markdown

Default JSON is CI-safe: it keeps verdicts, scores, missing facts, misstatements, provenance, and build attempts, but strips source content, full agent answers, transcripts, diffs, and command output. Use --verbose when you need a forensic receipt.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 316 Commits
.github/workflows		.github/workflows
apps		apps
fixtures		fixtures
packages		packages
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
brand.md		brand.md
bun.lock		bun.lock
comment-policy.md		comment-policy.md
implement-mode-plan.md		implement-mode-plan.md
llms.txt		llms.txt
package.json		package.json
pickled.yml		pickled.yml
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🥒 pickled

Why

How it works

What it checks

Quick start

Tiny config

Builds

Receipts

Read more

License

Contributors

About

Uh oh!

Releases 51

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🥒 pickled

Why

How it works

What it checks

Quick start

Tiny config

Builds

Receipts

Read more

License

Contributors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 51

Contributors

Uh oh!

Languages