LLM-based Verus formal verification agent for curve25519-dalek.
inference-dalek generates Verus contracts (requires / ensures), loop invariants, and termination proofs for Rust functions, then repairs verification failures through multi-turn LLM cycles. It targets cryptographic code in the curve25519-dalek family and uses a layered domain model (field → scalar → Edwards → Montgomery → Ristretto) so each function is proved against an accumulated set of already-verified lemmas.
Given a function with admit() or missing proof blocks, the pipeline:
- Spec generation — emits
requires/ensuresclauses from the function body and surrounding context. - Assembly — inserts the spec into a scaffold and runs Verus.
- Proof generation — on failure, generates loop invariants, decreases clauses, and proof blocks.
- Repair loop — classifies Verus errors (22 distinct types), dispatches to a repair-skill agent, re-verifies, and guards against regressions.
- Escalation — admits and flags the function for human review after a configurable number of turns.
For the full design rationale, see pipeline_complete.md.
pip install -e ".[dev]" # core + pytest
pip install -e ".[openai,dev]" # include OpenAI backendRequired environment:
ANTHROPIC_API_KEYVERUS_PATH(auto-detected ifverusis on yourPATH)
Optional: OPENAI_API_KEY for the OpenAI backend.
inference-dalek <subcommand>
# run | proof | struct | project | evaluate | baseline | synthesize | generateSingle-function debug run with full DEBUG logging:
inference-dalek run --sample <id> -v -v --debug-log debug.logBatch HAB evaluation:
scripts/run_hab_eval.sh <experiment_id> [--model claude-opus-4-6] [--layer-set A]Verbosity flags: -v = INFO, -v -v = DEBUG. --debug-log <file> writes DEBUG output regardless of the console level.
pytest # all
pytest -x # stop on first failure
pytest -m "not integration" # skip Verus-binary tests
pytest -m "not slow"Tests use a MockProvider for LLM calls and never hit real APIs.
High-level layout:
inference_dalek/agent.py— orchestrator (VerusAgent)inference_dalek/repair_loop.py— multi-turn verify/repair state machineinference_dalek/error_routing.py— error type → repair skill dispatchinference_dalek/stages/— 18 pluggable pipeline stagesinference_dalek/providers/— Anthropic / OpenAI backendsinference_dalek/verus/— subprocess runner, error parsing, code surgeryinference_dalek/project/— multi-file project verification with recovery cascadeinference_dalek/eval/— HAB evaluation framework
See CLAUDE.md for conventions, hard rules, and detailed module map.
- Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, Chelsea Finn. Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv:2603.28052, 2026. — End-to-end optimization of multi-stage LLM pipelines; relevant context for tuning agentic verification harnesses like the one in this repository.