inference-dalek

LLM-based Verus formal verification agent for curve25519-dalek.

inference-dalek generates Verus contracts (requires / ensures), loop invariants, and termination proofs for Rust functions, then repairs verification failures through multi-turn LLM cycles. It targets cryptographic code in the curve25519-dalek family and uses a layered domain model (field → scalar → Edwards → Montgomery → Ristretto) so each function is proved against an accumulated set of already-verified lemmas.

What it does

Given a function with admit() or missing proof blocks, the pipeline:

Spec generation — emits requires / ensures clauses from the function body and surrounding context.
Assembly — inserts the spec into a scaffold and runs Verus.
Proof generation — on failure, generates loop invariants, decreases clauses, and proof blocks.
Repair loop — classifies Verus errors (22 distinct types), dispatches to a repair-skill agent, re-verifies, and guards against regressions.
Escalation — admits and flags the function for human review after a configurable number of turns.

For the full design rationale, see pipeline_complete.md.

Install

pip install -e ".[dev]"          # core + pytest
pip install -e ".[openai,dev]"   # include OpenAI backend

Required environment:

ANTHROPIC_API_KEY
VERUS_PATH (auto-detected if verus is on your PATH)

Optional: OPENAI_API_KEY for the OpenAI backend.

CLI

inference-dalek <subcommand>
# run | proof | struct | project | evaluate | baseline | synthesize | generate

Single-function debug run with full DEBUG logging:

inference-dalek run --sample <id> -v -v --debug-log debug.log

Batch HAB evaluation:

scripts/run_hab_eval.sh <experiment_id> [--model claude-opus-4-6] [--layer-set A]

Verbosity flags: -v = INFO, -v -v = DEBUG. --debug-log <file> writes DEBUG output regardless of the console level.

Tests

pytest                           # all
pytest -x                        # stop on first failure
pytest -m "not integration"      # skip Verus-binary tests
pytest -m "not slow"

Tests use a MockProvider for LLM calls and never hit real APIs.

Architecture

High-level layout:

inference_dalek/agent.py — orchestrator (VerusAgent)
inference_dalek/repair_loop.py — multi-turn verify/repair state machine
inference_dalek/error_routing.py — error type → repair skill dispatch
inference_dalek/stages/ — 18 pluggable pipeline stages
inference_dalek/providers/ — Anthropic / OpenAI backends
inference_dalek/verus/ — subprocess runner, error parsing, code surgery
inference_dalek/project/ — multi-file project verification with recovery cascade
inference_dalek/eval/ — HAB evaluation framework

See CLAUDE.md for conventions, hard rules, and detailed module map.

Related work

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, Chelsea Finn. Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv:2603.28052, 2026. — End-to-end optimization of multi-stage LLM pipelines; relevant context for tuning agentic verification harnesses like the one in this repository.

License

BSD 3-Clause.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
inference_dalek		inference_dalek
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pipeline_complete.md		pipeline_complete.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

inference-dalek

What it does

Install

CLI

Tests

Architecture

Related work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

inference-dalek

What it does

Install

CLI

Tests

Architecture

Related work

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages