Skip to content

Accio-Lab/MM-CondChain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation


MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Haozhan Shen1,2, Shilin Yan1†, Hongwei Xue1‡, Shuaiqi Lu1, Xiaojun Tang1,
Guannan Zhang1, Tiancheng Zhao3‡, Jianwei Yin2

Project Leader Corresponding Author

1Accio Team, Alibaba Group 2Zhejiang University 3ZJU-BJ


🔥 News

  • 2026.03.13 🌟 We release MM-CondChain, the first benchmark for visually grounded deep compositional reasoning in MLLMs.

👀 MM-CondChain Overview

We introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning in Multimodal Large Language Models (MLLMs).

Key features of MM-CondChain:

  • Multi-layer compositional reasoning: Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence.
  • Programmatic verifiability: We propose a VPIR-based (Verifiable Programmatic Intermediate Representation) agentic synthesis pipeline that ensures each condition is mechanically verifiable.
  • Paired hard negatives: The Composer automatically produces paired True-path and False-path instances, where they differ by exactly one flipped predicate.
  • Three visual domains: Natural images, data charts, and GUI trajectories.
  • Deterministic evaluation: All instances are formulated as multiple-choice questions with deterministic answers, enabling reproducible evaluation without LLM-as-judge.

Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, confirming that deep compositional reasoning remains a fundamental challenge.

📊 Dataset Statistics

Domain Images/Trajectories Samples
Natural 398 796
Chart 200 400
GUI 377 (3,421 frames) 754
Total 975 1,950

Each image/trajectory yields one conditional chain, compiled into a paired True-path and False-path instance.

📁 Dataset Structure

MM-CondChain/
├── README.md
├── data/
│   ├── natural.jsonl
│   ├── chart.jsonl
│   └── gui.jsonl
└── images/
    ├── natural/
    │   └── *.jpg
    ├── chart/
    │   └── *.png
    └── gui/
        └── <trajectory_id>/
            └── <trajectory_id>_*.png

Each JSONL file contains samples with the following fields:

{
  "id": "natural_001",
  "domain": "natural",
  "image": "images/natural/sa_24810.jpg",
  "true_path": {
    "full_instruction": "If the fisherman wearing a baseball cap is ...",
    "pseudocode": "# the fisherman wearing a baseball cap\nif (is_occluded and ...) ...",
    "correct_answer": "F1"
  },
  "false_path": {
    "diverge_node": "qa_1",
    "full_instruction": "If the fisherman wearing a baseball cap is ...",
    "pseudocode": "# the fisherman wearing a baseball cap\nif (is_occluded and ...) ...",
    "correct_answer": "A1"
  }
}

Note on image paths:

  • For Natural and Chart domains, image is a single image path (e.g., images/natural/sa_24810.jpg).
  • For GUI domain, image is a trajectory folder path (e.g., images/gui/GENERAL-9532638838594693992). To load GUI images, list all PNG files in the folder sorted by filename.

🚀 Evaluation

Installation

pip install openai tqdm

Setup

OpenAI API:

export OPENAI_API_KEY="your-api-key"

Azure OpenAI:

export AZURE_OPENAI_API_KEY="your-api-key"
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com"

vLLM: No API key required (or set --api_key EMPTY).

Quick Start

We provide an evaluation script that supports OpenAI API, Azure OpenAI, and vLLM-based open-source models.

For Proprietary Models (OpenAI API):

python -m eval.eval \
    --api_type openai \
    --model gpt-4o \
    --domain natural \
    --image_root /path/to/mm-condchain/images

For Open-Source Models (vLLM):

We strongly recommend deploying open-source models with vLLM. MM-CondChain's compositional instructions are long and complex, which can lead to lengthy generation sequences. vLLM's continuous batching and efficient memory management handle this much better than naive Transformers inference.

To install vLLM, please refer to the official installation guide.

# Step 1: Start vLLM server
vllm serve Qwen/Qwen3-VL-8B-Instruct \
    --host 0.0.0.0 \
    --port 8001 \
    --tensor-parallel-size 2 \
    --served-model-name qwen3-vl-8b-instruct

# Step 2: Run evaluation
python -m eval.eval \
    --api_type vllm \
    --base_url http://localhost:8001/v1 \
    --model qwen3-vl-8b-instruct \
    --domain natural \
    --image_root /path/to/mm-condchain/images \
    --stream

CLI Arguments

Argument Description
--api_type API type: openai, azure, or vllm
--model Model name (e.g., gpt-4o, qwen3-vl-8b-instruct)
--domain Domain to evaluate: natural, chart, or gui
--image_root Root directory for images
--data_path (Optional) Path to JSONL file. Auto-inferred from image_root/../data/{domain}.jsonl if not provided
--base_url vLLM server URL (required for --api_type vllm)
--output Output JSON path (default: ./results/{model}_{domain}.json)
--workers Number of parallel workers (default: 8)
--resume Resume from existing output file
--stream Enable streaming (recommended for vLLM)

Metrics

We report the following metrics:

  • True-path Accuracy: Accuracy on True-path instances (all conditions hold)
  • False-path Accuracy: Accuracy on False-path hard negatives (one condition flipped)
  • Path F1: Harmonic mean of True-path and False-path accuracy

📈 Experimental Results

Model Natural F1 Chart F1 GUI F1 Avg F1
Gemini-3-Pro 55.91 66.04 38.05 53.33
GPT-5-0807 47.51 65.44 38.06 50.34
Gemini-3-Flash 47.19 61.96 35.78 48.31
Qwen3-VL-235B-Thinking 49.31 59.96 31.23 46.83
Qwen3.5-397B-A17B 38.97 58.55 40.19 45.90

📖 Citation

If you find MM-CondChain helpful for your research, please consider citing our work:

@article{shen2026mm,
  title={MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning},
  author={Shen, Haozhan and Yan, Shilin and Xue, Hongwei and Lu, Shuaiqi and Tang, Xiaojun and Zhang, Guannan and Zhao, Tiancheng and Yin, Jianwei},
  journal={arXiv preprint arXiv:2603.12266},
  year={2026}
}

📜 License

This dataset is released under the Apache 2.0 License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages