MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Haozhan Shen^1,2, Shilin Yan^1†, Hongwei Xue^1‡, Shuaiqi Lu¹, Xiaojun Tang¹,
Guannan Zhang¹, Tiancheng Zhao^3‡, Jianwei Yin²

^†Project Leader ^‡Corresponding Author

¹Accio Team, Alibaba Group ²Zhejiang University ³ZJU-BJ

[🏠 Project Page] [📖 arXiv Paper] [💻 GitHub] [🏆 Leaderboard] [🤗 Dataset]

🔥 News

2026.03.13 🌟 We release MM-CondChain, the first benchmark for visually grounded deep compositional reasoning in MLLMs.

👀 MM-CondChain Overview

We introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning in Multimodal Large Language Models (MLLMs).

Key features of MM-CondChain:

Multi-layer compositional reasoning: Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence.
Programmatic verifiability: We propose a VPIR-based (Verifiable Programmatic Intermediate Representation) agentic synthesis pipeline that ensures each condition is mechanically verifiable.
Paired hard negatives: The Composer automatically produces paired True-path and False-path instances, where they differ by exactly one flipped predicate.
Three visual domains: Natural images, data charts, and GUI trajectories.
Deterministic evaluation: All instances are formulated as multiple-choice questions with deterministic answers, enabling reproducible evaluation without LLM-as-judge.

Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, confirming that deep compositional reasoning remains a fundamental challenge.

📊 Dataset Statistics

Domain	Images/Trajectories	Samples
Natural	398	796
Chart	200	400
GUI	377 (3,421 frames)	754
Total	975	1,950

Each image/trajectory yields one conditional chain, compiled into a paired True-path and False-path instance.

📁 Dataset Structure

MM-CondChain/
├── README.md
├── data/
│   ├── natural.jsonl
│   ├── chart.jsonl
│   └── gui.jsonl
└── images/
    ├── natural/
    │   └── *.jpg
    ├── chart/
    │   └── *.png
    └── gui/
        └── <trajectory_id>/
            └── <trajectory_id>_*.png

Each JSONL file contains samples with the following fields:

{
  "id": "natural_001",
  "domain": "natural",
  "image": "images/natural/sa_24810.jpg",
  "true_path": {
    "full_instruction": "If the fisherman wearing a baseball cap is ...",
    "pseudocode": "# the fisherman wearing a baseball cap\nif (is_occluded and ...) ...",
    "correct_answer": "F1"
  },
  "false_path": {
    "diverge_node": "qa_1",
    "full_instruction": "If the fisherman wearing a baseball cap is ...",
    "pseudocode": "# the fisherman wearing a baseball cap\nif (is_occluded and ...) ...",
    "correct_answer": "A1"
  }
}

Note on image paths:

For Natural and Chart domains, image is a single image path (e.g., images/natural/sa_24810.jpg).
For GUI domain, image is a trajectory folder path (e.g., images/gui/GENERAL-9532638838594693992). To load GUI images, list all PNG files in the folder sorted by filename.

🚀 Evaluation

Installation

pip install openai tqdm

Setup

OpenAI API:

export OPENAI_API_KEY="your-api-key"

Azure OpenAI:

export AZURE_OPENAI_API_KEY="your-api-key"
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com"

vLLM: No API key required (or set --api_key EMPTY).

Quick Start

We provide an evaluation script that supports OpenAI API, Azure OpenAI, and vLLM-based open-source models.

For Proprietary Models (OpenAI API):

python -m eval.eval \
    --api_type openai \
    --model gpt-4o \
    --domain natural \
    --image_root /path/to/mm-condchain/images

For Open-Source Models (vLLM):

We strongly recommend deploying open-source models with vLLM. MM-CondChain's compositional instructions are long and complex, which can lead to lengthy generation sequences. vLLM's continuous batching and efficient memory management handle this much better than naive Transformers inference.

To install vLLM, please refer to the official installation guide.

# Step 1: Start vLLM server
vllm serve Qwen/Qwen3-VL-8B-Instruct \
    --host 0.0.0.0 \
    --port 8001 \
    --tensor-parallel-size 2 \
    --served-model-name qwen3-vl-8b-instruct

# Step 2: Run evaluation
python -m eval.eval \
    --api_type vllm \
    --base_url http://localhost:8001/v1 \
    --model qwen3-vl-8b-instruct \
    --domain natural \
    --image_root /path/to/mm-condchain/images \
    --stream

CLI Arguments

Argument	Description
`--api_type`	API type: `openai`, `azure`, or `vllm`
`--model`	Model name (e.g., `gpt-4o`, `qwen3-vl-8b-instruct`)
`--domain`	Domain to evaluate: `natural`, `chart`, or `gui`
`--image_root`	Root directory for images
`--data_path`	(Optional) Path to JSONL file. Auto-inferred from `image_root/../data/{domain}.jsonl` if not provided
`--base_url`	vLLM server URL (required for `--api_type vllm`)
`--output`	Output JSON path (default: `./results/{model}_{domain}.json`)
`--workers`	Number of parallel workers (default: 8)
`--resume`	Resume from existing output file
`--stream`	Enable streaming (recommended for vLLM)

Metrics

We report the following metrics:

True-path Accuracy: Accuracy on True-path instances (all conditions hold)
False-path Accuracy: Accuracy on False-path hard negatives (one condition flipped)
Path F1: Harmonic mean of True-path and False-path accuracy

📈 Experimental Results

Model	Natural F1	Chart F1	GUI F1	Avg F1
Gemini-3-Pro	55.91	66.04	38.05	53.33
GPT-5-0807	47.51	65.44	38.06	50.34
Gemini-3-Flash	47.19	61.96	35.78	48.31
Qwen3-VL-235B-Thinking	49.31	59.96	31.23	46.83
Qwen3.5-397B-A17B	38.97	58.55	40.19	45.90

📖 Citation

If you find MM-CondChain helpful for your research, please consider citing our work:

@article{shen2026mm,
  title={MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning},
  author={Shen, Haozhan and Yan, Shilin and Xue, Hongwei and Lu, Shuaiqi and Tang, Xiaojun and Zhang, Guannan and Zhao, Tiancheng and Yin, Jianwei},
  journal={arXiv preprint arXiv:2603.12266},
  year={2026}
}

📜 License

This dataset is released under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
eval		eval
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

🔥 News

👀 MM-CondChain Overview

📊 Dataset Statistics

📁 Dataset Structure

🚀 Evaluation

Installation

Setup

Quick Start

CLI Arguments

Metrics

📈 Experimental Results

📖 Citation

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

🔥 News

👀 MM-CondChain Overview

📊 Dataset Statistics

📁 Dataset Structure

🚀 Evaluation

Installation

Setup

Quick Start

CLI Arguments

Metrics

📈 Experimental Results

📖 Citation

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages