GenIR: Generative Visual Feedback for Mental Image Retrieval

Overview

This is the official repository for the paper GenIR: Generative Visual Feedback for Mental Image Retrieval (NeurIPS 2025). GenIR introduces a novel approach to Mental Image Retrieval (MIR) - a realistic search scenario where users iteratively refine their queries based on mental images to retrieve intended images from a database. Unlike traditional one-shot text-to-image retrieval, our method addresses the multi-round, interactive nature of real-world human search behavior.

Key Contributions

New Task Definition: We introduce Mental Image Retrieval (MIR), bridging the gap between benchmark performance and real-world search applications
Generative Visual Feedback: Our GenIR paradigm uses diffusion-based image generation to provide clear, interpretable visual feedback at each interaction round
Automated Dataset Generation: We develop a fully automated pipeline to create high-quality multi-round MIR datasets
Superior Performance: GenIR significantly outperforms existing interactive retrieval methods in MIR scenarios

Method

The Problem

Traditional vision-language models excel at text-to-image retrieval benchmarks but struggle with real-world search scenarios where:

Users search based on mental images (ranging from vague recollections to vivid mental representations)
Search is an iterative, multi-round process rather than one-shot
Users need clear, actionable feedback to refine their queries effectively

Our Solution

GenIR leverages diffusion-based image generation to:

Reify AI understanding at each interaction round through synthetic visual representations
Provide interpretable feedback that users can easily understand and act upon
Enable intuitive query refinement through visual rather than abstract verbal feedback

Architecture

[Mental Image] → [Query] → [GenIR System] → [Visual Feedback + Retrieved Images]
                    ↑                              ↓
                [Refined Query] ← [User Feedback] ← [Next Round]

Dataset

MSCOCO Dataset

We use the MSCOCO 2017 Unlabeled images dataset (123K images, 19GB) for our experiments.

Download from: COCO Dataset

After downloading, organize the dataset in the following structure:

data/
└── mscoco/
    └── unlabeled2017/
        └── [image files]

The required JSON files for queries and corpus are provided in the ChatIR folder

Our Synthetic Dataset

We have created and uploaded our synthetic dataset generated by our GenIR framework to Hugging Face for reproducibility and easy access.

Access the Dataset

You can download our synthetic dataset from Hugging Face:

# Using huggingface_hub
pip install huggingface_hub

from huggingface_hub import snapshot_download

# Download the dataset
snapshot_download(
    repo_id="dyang39/GenIR",
    repo_type="dataset",
    local_dir="./data/GenIR_dataset_MSCOCO"
)

Direct link: https://huggingface.co/datasets/dyang39/GenIR

Installation

This repository implements our proposed GenIR framework for mental image retrieval using multi-round caption refinement. The framework leverages generative models to improve retrieval performance through iterative refinement of image captions. While we primarily use Stable Diffusion 3.5 (SD3.5) and Gemma-3-4B in our experiments due to their ease of weight sharing, the framework is flexible and supports other generative models.

Clone & Submodule Setup

git clone <repo-url>
cd generative_ir
git submodule update --init --recursive

Note: The Infinity submodule has additional dependencies. If you plan to use the Infinity model, also run:
pip install -r Infinity/requirements.txt

Requirements

pip install -r requirements.txt

Usage

Running Experiments

We provide three different implementations:

GenIR (Ours): Fake image feedback using generative models (default: SD3.5 and Gemma-3-4B)
```
python genIR_CaptionImageRefinement.py
```

Baseline 1: Prediction feedback

python genIR_CaptionRefinment_VIsualPredictionFeedBack.py

Baseline 2: Textual feedback

python genIR_CaptionRefinment_TextOnly.py

Evaluation

GenIR Evaluation: Evaluate fake image to real image retrieval
```
python ChatIR/eval_img.py 
```
Text-only Baseline Evaluation
```
python ChatIR/eval_textonly.py 
```

Evaluation Protocol & Replicability

Test Set Size: For each dataset (MSCOCO, FFHQ, Flickr30k, and Clothing-ADC), we randomly sample 2,000 images from their respective validation or test sets.
Search Space: During evaluation, the retrieval system searches against the entire database of each dataset (e.g., evaluating against all 1M+ images for Clothing-ADC). This ensures a realistic and challenging large-scale retrieval scenario.
Compute-Efficient Replication: Empirically, we observed that the Hits@K metrics stabilize after approximately 500 samples, with no significant statistical changes thereafter. For those looking to replicate our framework or test follow-up methods with limited compute, a minimum of 500 samples is recommended.

Results

GenIR demonstrates significant improvements over existing interactive retrieval methods in Mental Image Retrieval scenarios. Detailed experimental results and comparisons are available in our paper.

Acknowledgments

This work builds upon several previous works:

ChatIR - For the base implementation and evaluation framework
BLIP - For the vision-language model components
Infinity - For the generative model implementations
FLUX.1-dev - For the text-to-image generation capabilities

This is an academic research project. Feel free to use this code for research purposes.

Contact

For questions or discussions, please contact:

Diji Yang: dyang39@ucsc.edu
Minghao Liu: mliu40@ucsc.edu

Citation

If you find this work useful for your research, please cite:

@article{yang2025genir,
  title={GenIR: Generative Visual Feedback for Mental Image Retrieval},
  author={Yang, Diji and Liu, Minghao and Lo, Chung-Hsiang and Zhang, Yi and Davis, James},
  journal={arXiv preprint arXiv:2506.06220},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
BLIP		BLIP
ChatIR		ChatIR
Infinity @ 4c27bbb		Infinity @ 4c27bbb
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
genIR_CaptionImageRefinement.py		genIR_CaptionImageRefinement.py
genIR_CaptionRefinment_TextOnly.py		genIR_CaptionRefinment_TextOnly.py
genIR_CaptionRefinment_VIsualPredictionFeedBack.py		genIR_CaptionRefinment_VIsualPredictionFeedBack.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenIR: Generative Visual Feedback for Mental Image Retrieval

Overview

Key Contributions

Method

The Problem

Our Solution

Architecture

Dataset

MSCOCO Dataset

Our Synthetic Dataset

Access the Dataset

Installation

Clone & Submodule Setup

Requirements

Usage

Running Experiments

Evaluation

Evaluation Protocol & Replicability

Results

Acknowledgments

Contact

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GenIR: Generative Visual Feedback for Mental Image Retrieval

Overview

Key Contributions

Method

The Problem

Our Solution

Architecture

Dataset

MSCOCO Dataset

Our Synthetic Dataset

Access the Dataset

Installation

Clone & Submodule Setup

Requirements

Usage

Running Experiments

Evaluation

Evaluation Protocol & Replicability

Results

Acknowledgments

Contact

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages