Visual anomaly detection (VAD) arises across many real-world domains β industrial
inspection, medical imaging, road-scene safety, infrastructure monitoring,
remote-sensing change detection β each with its own anomaly definition and
modality, so per-domain training rarely transfers. Vision-language models (VLMs)
applied directly conflate world-knowledge priors with task-specific anomaly
definitions and emit confident but wrong answers. We argue that VAD is a
compositional perception task: locate candidates, compare against normal
references, apply domain knowledge, and commit to a calibrated score. We
therefore present AnomalyClaw, a training-free VAD agent that judges
through multi-turn refutation. Each turn proposes candidate anomalies and
invokes a 13-tool catalog (visual inspection, reference understanding, frozen
expert probes) to refute each candidate against the references; the refutation
score is fused with a parallel Direct VLM judgment on the same backbone. On
our new CrossDomainVAD-12 benchmark (12 domains, 1,418 test items),
AnomalyClaw delivers consistent macro-AUROC gains on every backbone β +6.23 pp
on GPT-5.5, +7.93 pp on Seed2.0-lite, +3.52 pp on Qwen3.5-VL-27B
(
- 2026-05 β Code, benchmark manifests, and pre-computed result tables released.
- 2026-05 β Preprint available on arXiv:2605.10397.
AnomalyClaw runs two parallel branches on the same VLM backbone and fuses their scores:
ββββββββββββββββββββββββββββββββββββββββ
β Direct branch (one VLM call) β
item βββββββΊ β generic-descriptor anomaly score β βββΊ s_direct
(refs+query) β β
ββββββββββββββββββββββββββββββββββββββββ€
β Refutation branch (multi-turn) β
β turn 1 : propose K candidate anomalies
β turn t : pick a tool, observe, refute
β turn N : commit final score β βββΊ s_refute
ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
anomaly_score = Ξ±Β·s_direct + (1-Ξ±)Β·s_refute (Ξ± = 0.5)
The 13-tool catalog the refutation branch can invoke (agent_tools_v8.py):
| Family | Tools |
|---|---|
| Visual inspection | side_by_side_compare, region_zoom, segment_anomaly |
| Reference understanding | reference_retriever, reference_profile, image_diff, rotate_align |
| Frozen expert probes | subspace_ad_score, anomaly_vfm_score, expert_heatmap |
| Structure / texture | blob_layout_viz, change_heatmap_viz, fft_spectrum_viz |
Each tool carries an applicability annotation so the agent does not collapse
into a single primitive across domains. See benchmark/scripts/AGENTS.md for
the full v12 controller and refutation protocol.
A 12-domain, reference-based VAD benchmark under a single per-image AUROC protocol. Each domain contributes 20 / 40 / 120 items for calibration / dev / test (D7 has 98 test); total 1,418 test items.
| ID | Domain | Source | Modality |
|---|---|---|---|
| D1 | Industrial manufacturing | MVTec-AD | RGB |
| D2 | Complex industrial | VisA | RGB |
| D3 | Logical anomalies | MVTec-LOCO | RGB |
| D4 | 3D product | Real3D-AD (rendered) | RGB views |
| D5 | Retail products | GoodsAD | RGB |
| D6 | Infrastructure (concrete) | SDNET2018 | RGB |
| D7 | Remote-sensing change | LEVIR-CD+ | bi-temporal RGB |
| D8 | Dermatology | DermaMNIST | RGB |
| D9 | Brain MRI | BraTS 2021 | MRI slice |
| D10 | Liver CT | BMAD-Liver | CT slice |
| D11 | GI endoscopy | HyperKvasir | RGB |
| D12 | Road safety | BDD100K + RoadAnomaly21 | RGB |
Manifests are versioned in benchmark/manifests_v2/. Image paths use a
portable {DATA_ROOT}/... placeholder, resolved at load time via the
ANOMALYCLAW_DATA env var. See DATA.md for per-dataset download
instructions β we do not redistribute raw images because most upstream
licenses (e.g. MVTec) forbid it.
Macro AUROC on CrossDomainVAD-12 test (n = 1,418).
| Backbone | Direct | AnomalyClaw (Ours) | 95% CI | |
|---|---|---|---|---|
| GPT-5.5 | 0.752 | 0.814 | +6.23 pp | [+4.84, +7.63] |
| Seed2.0-lite | 0.688 | 0.767 | +7.93 pp | [+6.20, +9.53] |
| Qwen3.5-VL-27B | 0.713 | 0.748 | +3.52 pp | [+1.93, +5.11] |
All
The exact run outputs that back the table ship in:
benchmark/results/v2/v12_passive_test/ # Qwen3.5-VL-27B β 0.7480
benchmark/results/v2/v12_passive_test_seedvl/ # Seed2.0-lite β 0.7672
benchmark/results/v2/v12_passive_gpt55_test/ # GPT-5.5 β 0.8135
Each is a directory of D{1..12}.json per-item records, blended with Ξ±=0.5.
git clone https://github.com/jam-cc/AnomalyClaw.git
cd AnomalyClaw
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtRequired: Python 3.10+, PyTorch 2.x, and an OpenAI-compatible VLM endpoint (local vLLM or a hosted API). The agent itself has no neural training step.
export ANOMALYCLAW_DATA=$PWD/benchmark/data
mkdir -p "$ANOMALYCLAW_DATA"Then follow the per-dataset table in DATA.md to populate
$ANOMALYCLAW_DATA (we ship manifests, retrieval indices, and expert score
caches in-repo β only the raw image folders need to come from upstream).
# Local Qwen3.5-VL-27B via vLLM (the paper setup):
bash benchmark/scripts/launch_qwen35_replicas.sh
export QWEN_API_BASE=http://localhost:8210/v1
export QWEN_MODEL=Qwen3.5-VL-27B
export QWEN_API_KEY=EMPTY
# Or hosted GPT / SeedVL:
# export GPT_API_KEY=... GPT_API_BASE=...
# export SEED_API_KEY=... SEED_API_BASE=...No API keys are baked into the code; each backend reads its own env vars.
python benchmark/scripts/agent_v12.py \
--manifest benchmark/manifests_v2/D1_industrial_manifest.json \
--split test \
--backend qwen3 \
--output benchmark/results/v2/v12_passive_test/D1.json \
--max_turns 3 --max_workers 8 --resumebash benchmark/scripts/run_v12_passive_test.sh # all 12 test domainsReproduce the paper's Table 1 macro AUROC in one command (per backbone):
# Qwen3.5-VL-27B β 0.7480 (paper: 0.748)
python benchmark/scripts/aggregate_v12.py \
--results benchmark/results/v2/v12_passive_test
# Seed2.0-lite β 0.7672 (paper: 0.767)
python benchmark/scripts/aggregate_v12.py \
--results benchmark/results/v2/v12_passive_test_seedvl
# GPT-5.5 β 0.8135 (paper: 0.814)
python benchmark/scripts/aggregate_v12.py \
--results benchmark/results/v2/v12_passive_gpt55_testaggregate_v12.py implements the paper's exact aggregation: AD-mode filter
(mode == "anomaly_detection", both branch scores present, no error) and
Ξ±=0.5 blend of direct_score + v9_score.
To get per-item metrics on a single domain file:
python benchmark/scripts/evaluate.py \
--results benchmark/results/v2/v12_passive_test/D1.json \
--output /tmp/D1_metrics.jsonFor the MMAD-MCQA fair-baseline pipeline (paper Β§4.6):
python benchmark/scripts/mmad_eval_v12_mmad.py --help
python benchmark/scripts/mmad_eval_single_letter.py --helpAnomalyClaw/
βββ benchmark/
β βββ BENCHMARK_SPEC.json # CrossDomainVAD-12 spec
β βββ manifests_v2/ # 12-domain canonical manifests
β βββ retrieval_index/D*_index.npz # DINOv2 reference embeddings
β βββ results/ # paper-final result JSONs + expert caches
β βββ scripts/
β βββ agent_v12.py # canonical AD agent β
β βββ agent_v12_mmad.py # MMAD-MCQA variant
β βββ agent_v12_logitdirect.py # logit-Direct deployment variant
β βββ agent_prompt_v{9,10,12_mmad*}.py
β βββ agent_tools_v8.py # 13-tool catalog β
β βββ infer.py # backend clients + image utils
β βββ evaluate.py # macro AUROC + bootstrap CI
β βββ mmad_eval_v12_mmad*.py # MMAD-MCQA evaluators
β βββ baseline_*.py # AnomalyDINO / VisualAD / AD-Copilot / IAD-R1
β βββ expert_*.py # SubspaceAD / AnomalyVFM wrappers
β βββ run_*.sh, launch_*.sh # reproducer shells
βββ experts/ # upstream baseline clone targets (README only)
Released under CC BY-NC 4.0 β non-commercial use only. Third-party datasets and expert baselines retain their original licenses.
@article{jiang2026anomalyclaw,
title = {AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation},
author = {Jiang, Xi and Zhao, Yinjie and Yang, Zesheng and Zheng, Feng},
journal = {arXiv preprint arXiv:2605.10397},
year = {2026},
eprint = {2605.10397},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2605.10397}
}Built on top of Qwen2.5/3.5-VL, DINOv2, and the public AnomalyVFM, AnomalyDINO, SubspaceAD (and other baselines). We thank the MMAD authors for the MCQA benchmark we use in Β§4.6.

