anneal

Let an AI agent improve your code, prompts, and configs — overnight, unattended.

explainer.mp4

Point anneal at any text file in a git repo, tell it how to measure "better," and walk away. The agent generates hypotheses, runs experiments, keeps winners, discards losers, and compounds learnings — all while you sleep.

Quick Start

uv tool install anneal-cli

Requires Python 3.12+.

# Register a target
anneal register \
  --name my-target \
  --artifact path/to/file.py \
  --eval-mode deterministic \
  --run-cmd "python benchmark.py" \
  --parse-cmd "grep 'score' | awk '{print \$2}'" \
  --direction maximize \
  --scope scope.yaml

# Run experiments
anneal run --target my-target --experiments 20

# Review results — every experiment is a git commit
git log --oneline

Supported Providers

Provider	Models	Notes
Anthropic	claude-*	Default. Claude Code or API
OpenAI	gpt-*	Via OpenAI SDK
Google	gemini-*	Via OpenAI-compatible endpoint
Ollama	ollama/*	Local. $0 cost tracking
LM Studio	lmstudio/*	Local. $0 cost tracking
Any OpenAI-compatible	Custom	Via base URL override

Two Evaluation Modes

Deterministic — shell command produces a number

A shell command produces a numeric score. Run code, parse output, compare. Use for: performance benchmarks, test coverage, file size, build time.

--eval-mode deterministic \
--run-cmd "pytest --cov=src --cov-report=term | grep TOTAL | awk '{print \$4}'" \
--parse-cmd "cat"

Stochastic — LLM judges N samples against K binary criteria

An LLM judges N samples against K binary (YES/NO) criteria. Use for: prompt quality, documentation clarity, content optimization — anything where output varies between runs.

--eval-mode stochastic \
--criteria eval_criteria.toml

Each criterion is a YES/NO question. Scores aggregate across samples and criteria into a single float.

Where Anneal Works

Use Case	Eval Mode	~Cost / 50 exp
Prompt optimization	stochastic	$8–$13
API response time	deterministic	$2–$5
Test coverage improvement	deterministic	$2–$5
Training config (hyperparams)	deterministic	$2–$8
RAG retrieval prompt	deterministic	$2–$5
System prompt	stochastic	$8–$15
Config tuning (build/infra)	deterministic	$1–$3

Where Anneal Does Not Work

Target	Reason
Binary files, databases	Artifact must be a text file in git
Embedding model selection	Requires full re-index — not a file edit
Inter-agent protocol changes	Coordinated multi-file edits required
Live system tuning	No git isolation, unsafe to mutate in place
Cross-service optimization	Single-artifact scope only
Database schema migrations	Irreversible side effects

Results

Code Golf — 93.7% size reduction

Shrink a verbose Python file while preserving byte-identical output.

Metric	Value
Target	`examples/code-golf/app.py`
Eval mode	Deterministic (file size in bytes)
Direction	Minimize
Experiments	7
Start score	3,592 bytes
End score	228 bytes
Reduction	93.7%

Score trajectory: 3,592 bytes to 228 bytes over 7 experiments

Prompt Optimization — stochastic eval

Improve an article summarizer prompt against 4 binary criteria across 5 test articles. Scores improve across 10 experiments as the agent iteratively refines the system prompt.

Examples

Prompt Optimization — stochastic eval

The agent rewrites system_prompt.md, generates summaries from 5 test articles, and an LLM judge scores each against 4 binary criteria (key points captured? concise? plain language? factually accurate?).

anneal register \
  --name prompt-optimizer \
  --artifact examples/prompt-optimizer/system_prompt.md \
  --eval-mode stochastic \
  --criteria examples/prompt-optimizer/eval_criteria.toml \
  --direction maximize \
  --scope examples/prompt-optimizer/scope.yaml

anneal run --target prompt-optimizer --experiments 10

Test Coverage — deterministic eval, maximize

The agent adds tests to cover untested code paths. pytest --cov provides the score. Source code is immutable — the agent can only write tests.

anneal register \
  --name test-coverage \
  --artifact examples/test-coverage/tests/test_calculator.py \
  --eval-mode deterministic \
  --run-cmd "bash examples/test-coverage/eval.sh" \
  --parse-cmd "cat" \
  --direction maximize \
  --scope examples/test-coverage/scope.yaml

anneal run --target test-coverage --experiments 10

Code Golf — deterministic eval, minimize

anneal register \
  --name code-golf \
  --artifact examples/code-golf/app.py \
  --eval-mode deterministic \
  --run-cmd "bash examples/code-golf/eval.sh" \
  --parse-cmd "cat" \
  --direction minimize \
  --scope examples/code-golf/scope.yaml

anneal run --target code-golf --experiments 10

Local Artifacts (no git tracking required)

Artifact files don't need to be committed to git. If they're untracked, anneal copies them into the worktree automatically during registration. For files you don't want in version control at all, use --in-place to skip worktree isolation entirely:

anneal register \
  --name local-skill \
  --artifact SKILL.md \
  --eval-mode stochastic \
  --criteria eval_criteria.toml \
  --direction maximize \
  --scope scope.yaml \
  --in-place

Documentation

Doc	What's in it
Overview	Motivation, lineage, and the core idea
Eval Guide	Writing good binary evaluation criteria
Recipes	Copy-paste registration commands for common targets
Use Cases	Where anneal works, where it doesn't, and why
Features	Search strategies, statistical methods, knowledge system
Architecture	Module map and design principles
System Design	Full technical design document
CI Integration	GitHub Actions workflow and status JSON output

Testing

uv run pytest tests/ -x -q          # 820 tests
uv run pytest tests/ --cov=anneal    # With coverage

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 415 Commits
.github/workflows		.github/workflows
anneal		anneal
assets		assets
benchmarks		benchmarks
docs		docs
examples		examples
experiments		experiments
paper		paper
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

anneal

Quick Start

Supported Providers

Two Evaluation Modes

Where Anneal Works

Where Anneal Does Not Work

Results

Code Golf — 93.7% size reduction

Prompt Optimization — stochastic eval

Examples

Prompt Optimization — stochastic eval

Test Coverage — deterministic eval, maximize

Code Golf — deterministic eval, minimize

Local Artifacts (no git tracking required)

Documentation

Testing

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

anneal

Quick Start

Supported Providers

Two Evaluation Modes

Where Anneal Works

Where Anneal Does Not Work

Results

Code Golf — 93.7% size reduction

Prompt Optimization — stochastic eval

Examples

Prompt Optimization — stochastic eval

Test Coverage — deterministic eval, maximize

Code Golf — deterministic eval, minimize

Local Artifacts (no git tracking required)

Documentation

Testing

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages