This repository provides the official code and evaluation tools for VisReason, including dataset indexing, API and vLLM inference scripts, automatic judging, and result summarization. (ACL 2026 Findings)
- [2026-05-26] We have added the paper link (arXiv).
- [2026-05-23] We have organized and released the inference and evaluation scripts.
- [2026-05-23] We provide examples for both API-based inference and vLLM-based local inference.
- [2026-05-23] We have released the full VisReason dataset on Hugging Face.
- Add citation information.
- Add license information.
- Add continual benchmark updates
This repository hosts the stable official release. For the testing version, please visit YifanWang-Lingf/VisReason.
VisReason is a benchmark for evaluating vision-centric reasoning in everyday scenarios where perception and inference are tightly coupled. Unlike many STEM-oriented or knowledge-intensive visual reasoning benchmarks, VisReason is designed to test whether MLLMs can reason directly from visual evidence rather than relying mainly on language-mediated abstractions.
VisReason contains 1,505 carefully curated questions across 10 reasoning categories, covering perceptual, structural, and conceptual reasoning. The tasks include visual difference identification, 3D-spatial reasoning, game-board reasoning, and implicit rule inference from visual cues.
Our evaluation shows that VisReason poses a qualitatively different challenge from existing benchmarks, exposing large gaps between humans and current MLLMs and revealing that test-time reasoning strategies such as explicit CoT prompting provide limited gains without stronger visual grounding.
The full dataset description, field definitions, and download instructions are provided on Hugging Face.
VisReason/
data/
img_<class_number>/
datajson_label.<ext>
...
class_1.jsonl
...
datasets.json
results/
<model>/
class_<class_number>_results.json
class_<class_number>_judge.json
summary.json
utils/
datainfer/
bbox_utils.py
test_api_batch.py
test_vLLM.py
judge.py
README.md
pip install openai tqdm pydantic datasets rich pillowFor local vLLM inference, install and run vLLM separately in the environment that serves the model.
test_api_batch.py calls an OpenAI-compatible API endpoint. Set your credentials through environment variables:
export API_KEY="xxx"
export BASE_URL="xxx"Run inference:
# thinking models
python test_api_batch.py <model>
# instruct models
python test_api_batch.py <model> cot
python test_api_batch.py <model> nocotThe default prompt mode is nocot.
Start a vLLM OpenAI-compatible server, for example:
CUDA_VISIBLE_DEVICES=0,1 \
python -m vllm.entrypoints.openai.api_server \
--model /path/to/model \
--trust-remote-code \
--port 8008 \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.8 \
--served-model-name <model>Run inference:
# thinking models
python test_vLLM.py <model> --host localhost --port 8008
# instruct models
python test_vLLM.py <model> cot --host localhost --port 8008
python test_vLLM.py <model> nocot --host localhost --port 8008The default prompt mode is nocot.
Coming soon.
Inference results are saved under:
results/<model>/class_X_results.json
results/<model>_cot/class_X_results.json
Each result file is a JSON list. Each item keeps the original sample fields and adds:
{
"response": "model output"
}Run judging:
export API_KEY="xxx"
export BASE_URL="xxx"
python judge.py <judge_model> results/<model>The evaluator writes:
results/<model>/class_X_judge.json
results/<model>/summary.json
For class_1 and class_2, bounding-box predictions are evaluated with IoU at threshold 0.5. Other classes are evaluated by the judge model.
Set API credentials:
export API_KEY="xxx"
export BASE_URL="xxx"Run inference:
python test_api_batch.py o4-miniThe generated results will be saved to:
results/o4-mini/class_1_results.json
results/o4-mini/class_2_results.json
...
results/o4-mini/class_10_results.json
Run evaluation:
python judge.py gpt-5 results/o4-miniThe judged outputs and final summary will be saved to:
results/o4-mini/class_1_judge.json
results/o4-mini/class_2_judge.json
...
results/o4-mini/summary.json
Start the vLLM server:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m vllm.entrypoints.openai.api_server \
--model ../hf_models/Qwen3-VL-235B-A22B-Instruct \
--trust-remote-code \
--port 8008 \
--tensor-parallel-size 8 \
--max-model-len 8192 \
--max-num-seqs 8 \
--gpu-memory-utilization 0.9 \
--served-model-name Qwen3-VL-235B-A22B-InstructRun inference:
python test_vLLM.py Qwen3-VL-235B-A22B-Instruct cot --host 12x.xx.xx.xx --port 8008The generated results will be saved to:
results/Qwen3-VL-235B-A22B-Instruct_cot/class_1_results.json
results/Qwen3-VL-235B-A22B-Instruct_cot/class_2_results.json
...
results/Qwen3-VL-235B-A22B-Instruct_cot/class_10_results.json
Run evaluation:
python judge.py gpt-5 results/Qwen3-VL-235B-A22B-Instruct_cotThe judged outputs and final summary will be saved to:
results/Qwen3-VL-235B-A22B-Instruct_cot/class_1_judge.json
results/Qwen3-VL-235B-A22B-Instruct_cot/class_2_judge.json
...
results/Qwen3-VL-235B-A22B-Instruct_cot/summary.json
VisReason reports accuracy at the class level and uses the average class accuracy as the final score.
For class_1 and class_2, predictions are evaluated as bounding-box localization tasks. The evaluator parses the predicted boxes from the model response, matches them with the ground-truth boxes, and counts a box as correct when its IoU is at least 0.5.
For the remaining classes, predictions are judged by an LLM-based evaluator. The judge compares the model response with the ground-truth answer and outputs whether the prediction is correct.
For each class, accuracy is computed as:
accuracy = number of correct samples / number of evaluated samples
The final score is the unweighted mean over all class accuracies:
average accuracy = mean(class_1_acc, class_2_acc, ..., class_10_acc)
The evaluator writes per-class scores and the final average score to:
results/<model>/summary.json
TODO
If you have any questions, please reach out to:
- Yifan Wang - wangyifan2026@ia.ac.cn