Litmus

A benchmark for smart-contract vulnerability detectors. It scores the property that decides whether a tool is usable in practice: how many false alarms it raises to surface one real bug. Every bug in the corpus has a runnable Foundry PoC, so a label holds because forge test reproduces the exploit, not because someone tagged it.

Why

Automated and AI-assisted auditors are measured on recall: whether the tool finds the bug. The cost that breaks triage is on the other side. A tool that reports fifty criticals to surface one real bug buries the signal and wastes the reviewer's time. Open contests routinely take thousands of submissions that collapse, after judging, to a handful of unique valid issues. The rest is noise. No widely used benchmark scores that noise. This one does.

What it is

A corpus of vulnerabilities where each label is backed by an executable proof:

a scan target: the contract source a tool analyses.
a runnable Foundry PoC: a passing test means the exploit is real.
machine-readable ground truth: the location and class of each vulnerability.

A scoring harness runs any tool's findings against the ground truth and reports precision, recall, and false alarms per case. A label is added only after its PoC runs.

How to run

git init -q && forge install foundry-rs/forge-std
forge test -vv          # synthetic cases (001-003), no network

# the real on-chain case (004, Euler) forks mainnet -- point it at an archive RPC:
ETH_RPC_URL=<your-mainnet-rpc> forge test -vv

# score a tool's findings against the ground truth:
python harness/score.py --findings harness/example-findings.json

The findings format a tool must emit is in SPEC.md, with an example in harness/example-findings.json.

The corpus

Case	Class	Source	PoC
001	rounding / precision	ERC-4626 first-depositor inflation (synthetic)	runnable
002	reentrancy / temporal	read-only reentrancy in an LP price oracle (synthetic)	runnable
003	accounting-desync	cached total vs. fee-on-transfer real balance (synthetic)	runnable
004	accounting-desync (missing solvency check)	Euler Finance, real, $197M	fork (needs `ETH_RPC_URL`)

Case 004 reproduces the March 2023 Euler Finance $197M loss against the deployed mainnet bytecode. The other three are self-contained reproductions of a known bug class.

First result: stock Slither

Stock Slither (standard detectors, no plugins), scored through the harness on the source-analyzable cases:

Tool	recall	precision	FP rate (alarms / case)
Slither (stock)	0.33	0.14	2.0

It catches the read-only reentrancy, which is where pattern-matching works. It misses the inflation attack (001) and the accounting-desync (003) because it ships no detector for either; both require reasoning about economic intent rather than syntax. Of three PoC-proven exploits it flags one, and none of its other results points at the two that drain the vault. A couple of those extra flags are real low-severity smells rather than pure noise, but none is the exploitable bug. Recall on real impact is what counts, and a low false-positive rate is what makes the output worth reading.

slither . --filter-paths "lib/|test/" --json slither-out.json
python harness/adapters/slither_to_litmus.py slither-out.json > slither-findings.json
python harness/score.py --findings slither-findings.json --only case001,case002,case003

AI-assisted auditors next.

Status

v0: the case format (SPEC.md), the scoring harness, the Slither adapter, and four cases, three synthetic plus the Euler fork, wired end to end. It grows from there: more of the eight-class taxonomy (reentrancy/temporal, rounding/precision, economic/game-theory, oracle, init/upgrade, accounting-desync, signature/replay, liveness/DoS), more historical exploits, and adapters for common static and AI tools.

License

MIT. Contributions welcome; the case format is in SPEC.md.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
bench		bench
harness		harness
reference		reference
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SPEC.md		SPEC.md
foundry.toml		foundry.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Litmus

Why

What it is

How to run

The corpus

First result: stock Slither

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Litmus

Why

What it is

How to run

The corpus

First result: stock Slither

Status

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages