MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning
Haozhan Shen1,2,
Shilin Yan1†,
Hongwei Xue1‡,
Shuaiqi Lu1,
Xiaojun Tang1,
Guannan Zhang1,
Tiancheng Zhao3‡,
Jianwei Yin2
†Project Leader ‡Corresponding Author
1Accio Team, Alibaba Group 2Zhejiang University 3ZJU-BJ
2026.03.13🌟 We release MM-CondChain, the first benchmark for visually grounded deep compositional reasoning in MLLMs.
We introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning in Multimodal Large Language Models (MLLMs).
Key features of MM-CondChain:
- Multi-layer compositional reasoning: Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence.
- Programmatic verifiability: We propose a VPIR-based (Verifiable Programmatic Intermediate Representation) agentic synthesis pipeline that ensures each condition is mechanically verifiable.
- Paired hard negatives: The Composer automatically produces paired True-path and False-path instances, where they differ by exactly one flipped predicate.
- Three visual domains: Natural images, data charts, and GUI trajectories.
- Deterministic evaluation: All instances are formulated as multiple-choice questions with deterministic answers, enabling reproducible evaluation without LLM-as-judge.
Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, confirming that deep compositional reasoning remains a fundamental challenge.
| Domain | Images/Trajectories | Samples |
|---|---|---|
| Natural | 398 | 796 |
| Chart | 200 | 400 |
| GUI | 377 (3,421 frames) | 754 |
| Total | 975 | 1,950 |
Each image/trajectory yields one conditional chain, compiled into a paired True-path and False-path instance.
MM-CondChain/
├── README.md
├── data/
│ ├── natural.jsonl
│ ├── chart.jsonl
│ └── gui.jsonl
└── images/
├── natural/
│ └── *.jpg
├── chart/
│ └── *.png
└── gui/
└── <trajectory_id>/
└── <trajectory_id>_*.png
Each JSONL file contains samples with the following fields:
{
"id": "natural_001",
"domain": "natural",
"image": "images/natural/sa_24810.jpg",
"true_path": {
"full_instruction": "If the fisherman wearing a baseball cap is ...",
"pseudocode": "# the fisherman wearing a baseball cap\nif (is_occluded and ...) ...",
"correct_answer": "F1"
},
"false_path": {
"diverge_node": "qa_1",
"full_instruction": "If the fisherman wearing a baseball cap is ...",
"pseudocode": "# the fisherman wearing a baseball cap\nif (is_occluded and ...) ...",
"correct_answer": "A1"
}
}Note on image paths:
- For Natural and Chart domains,
imageis a single image path (e.g.,images/natural/sa_24810.jpg). - For GUI domain,
imageis a trajectory folder path (e.g.,images/gui/GENERAL-9532638838594693992). To load GUI images, list all PNG files in the folder sorted by filename.
pip install openai tqdmOpenAI API:
export OPENAI_API_KEY="your-api-key"Azure OpenAI:
export AZURE_OPENAI_API_KEY="your-api-key"
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com"vLLM: No API key required (or set --api_key EMPTY).
We provide an evaluation script that supports OpenAI API, Azure OpenAI, and vLLM-based open-source models.
For Proprietary Models (OpenAI API):
python -m eval.eval \
--api_type openai \
--model gpt-4o \
--domain natural \
--image_root /path/to/mm-condchain/imagesFor Open-Source Models (vLLM):
We strongly recommend deploying open-source models with vLLM. MM-CondChain's compositional instructions are long and complex, which can lead to lengthy generation sequences. vLLM's continuous batching and efficient memory management handle this much better than naive Transformers inference.
To install vLLM, please refer to the official installation guide.
# Step 1: Start vLLM server
vllm serve Qwen/Qwen3-VL-8B-Instruct \
--host 0.0.0.0 \
--port 8001 \
--tensor-parallel-size 2 \
--served-model-name qwen3-vl-8b-instruct
# Step 2: Run evaluation
python -m eval.eval \
--api_type vllm \
--base_url http://localhost:8001/v1 \
--model qwen3-vl-8b-instruct \
--domain natural \
--image_root /path/to/mm-condchain/images \
--stream| Argument | Description |
|---|---|
--api_type |
API type: openai, azure, or vllm |
--model |
Model name (e.g., gpt-4o, qwen3-vl-8b-instruct) |
--domain |
Domain to evaluate: natural, chart, or gui |
--image_root |
Root directory for images |
--data_path |
(Optional) Path to JSONL file. Auto-inferred from image_root/../data/{domain}.jsonl if not provided |
--base_url |
vLLM server URL (required for --api_type vllm) |
--output |
Output JSON path (default: ./results/{model}_{domain}.json) |
--workers |
Number of parallel workers (default: 8) |
--resume |
Resume from existing output file |
--stream |
Enable streaming (recommended for vLLM) |
We report the following metrics:
- True-path Accuracy: Accuracy on True-path instances (all conditions hold)
- False-path Accuracy: Accuracy on False-path hard negatives (one condition flipped)
- Path F1: Harmonic mean of True-path and False-path accuracy
| Model | Natural F1 | Chart F1 | GUI F1 | Avg F1 |
|---|---|---|---|---|
| Gemini-3-Pro | 55.91 | 66.04 | 38.05 | 53.33 |
| GPT-5-0807 | 47.51 | 65.44 | 38.06 | 50.34 |
| Gemini-3-Flash | 47.19 | 61.96 | 35.78 | 48.31 |
| Qwen3-VL-235B-Thinking | 49.31 | 59.96 | 31.23 | 46.83 |
| Qwen3.5-397B-A17B | 38.97 | 58.55 | 40.19 | 45.90 |
If you find MM-CondChain helpful for your research, please consider citing our work:
@article{shen2026mm,
title={MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning},
author={Shen, Haozhan and Yan, Shilin and Xue, Hongwei and Lu, Shuaiqi and Tang, Xiaojun and Zhang, Guannan and Zhao, Tiancheng and Yin, Jianwei},
journal={arXiv preprint arXiv:2603.12266},
year={2026}
}This dataset is released under the Apache 2.0 License.

