This is the official repository for the paper GenIR: Generative Visual Feedback for Mental Image Retrieval (NeurIPS 2025). GenIR introduces a novel approach to Mental Image Retrieval (MIR) - a realistic search scenario where users iteratively refine their queries based on mental images to retrieve intended images from a database. Unlike traditional one-shot text-to-image retrieval, our method addresses the multi-round, interactive nature of real-world human search behavior.
- New Task Definition: We introduce Mental Image Retrieval (MIR), bridging the gap between benchmark performance and real-world search applications
- Generative Visual Feedback: Our GenIR paradigm uses diffusion-based image generation to provide clear, interpretable visual feedback at each interaction round
- Automated Dataset Generation: We develop a fully automated pipeline to create high-quality multi-round MIR datasets
- Superior Performance: GenIR significantly outperforms existing interactive retrieval methods in MIR scenarios
Traditional vision-language models excel at text-to-image retrieval benchmarks but struggle with real-world search scenarios where:
- Users search based on mental images (ranging from vague recollections to vivid mental representations)
- Search is an iterative, multi-round process rather than one-shot
- Users need clear, actionable feedback to refine their queries effectively
GenIR leverages diffusion-based image generation to:
- Reify AI understanding at each interaction round through synthetic visual representations
- Provide interpretable feedback that users can easily understand and act upon
- Enable intuitive query refinement through visual rather than abstract verbal feedback
[Mental Image] → [Query] → [GenIR System] → [Visual Feedback + Retrieved Images]
↑ ↓
[Refined Query] ← [User Feedback] ← [Next Round]
We use the MSCOCO 2017 Unlabeled images dataset (123K images, 19GB) for our experiments.
- Download from: COCO Dataset
- After downloading, organize the dataset in the following structure:
data/ └── mscoco/ └── unlabeled2017/ └── [image files] - The required JSON files for queries and corpus are provided in the
ChatIRfolder
We have created and uploaded our synthetic dataset generated by our GenIR framework to Hugging Face for reproducibility and easy access.
You can download our synthetic dataset from Hugging Face:
# Using huggingface_hub
pip install huggingface_hubfrom huggingface_hub import snapshot_download
# Download the dataset
snapshot_download(
repo_id="dyang39/GenIR",
repo_type="dataset",
local_dir="./data/GenIR_dataset_MSCOCO"
)Direct link: https://huggingface.co/datasets/dyang39/GenIR
This repository implements our proposed GenIR framework for mental image retrieval using multi-round caption refinement. The framework leverages generative models to improve retrieval performance through iterative refinement of image captions. While we primarily use Stable Diffusion 3.5 (SD3.5) and Gemma-3-4B in our experiments due to their ease of weight sharing, the framework is flexible and supports other generative models.
git clone <repo-url>
cd generative_ir
git submodule update --init --recursiveNote: The Infinity submodule has additional dependencies. If you plan to use the Infinity model, also run:
pip install -r Infinity/requirements.txt
pip install -r requirements.txtWe provide three different implementations:
-
GenIR (Ours): Fake image feedback using generative models (default: SD3.5 and Gemma-3-4B)
python genIR_CaptionImageRefinement.py
-
Baseline 1: Prediction feedback
python genIR_CaptionRefinment_VIsualPredictionFeedBack.py
-
Baseline 2: Textual feedback
python genIR_CaptionRefinment_TextOnly.py
-
GenIR Evaluation: Evaluate fake image to real image retrieval
python ChatIR/eval_img.py
-
Text-only Baseline Evaluation
python ChatIR/eval_textonly.py
- Test Set Size: For each dataset (MSCOCO, FFHQ, Flickr30k, and Clothing-ADC), we randomly sample 2,000 images from their respective validation or test sets.
- Search Space: During evaluation, the retrieval system searches against the entire database of each dataset (e.g., evaluating against all 1M+ images for Clothing-ADC). This ensures a realistic and challenging large-scale retrieval scenario.
- Compute-Efficient Replication: Empirically, we observed that the Hits@K metrics stabilize after approximately 500 samples, with no significant statistical changes thereafter. For those looking to replicate our framework or test follow-up methods with limited compute, a minimum of 500 samples is recommended.
GenIR demonstrates significant improvements over existing interactive retrieval methods in Mental Image Retrieval scenarios. Detailed experimental results and comparisons are available in our paper.
This work builds upon several previous works:
- ChatIR - For the base implementation and evaluation framework
- BLIP - For the vision-language model components
- Infinity - For the generative model implementations
- FLUX.1-dev - For the text-to-image generation capabilities
This is an academic research project. Feel free to use this code for research purposes.
For questions or discussions, please contact:
- Diji Yang: dyang39@ucsc.edu
- Minghao Liu: mliu40@ucsc.edu
If you find this work useful for your research, please cite:
@article{yang2025genir,
title={GenIR: Generative Visual Feedback for Mental Image Retrieval},
author={Yang, Diji and Liu, Minghao and Lo, Chung-Hsiang and Zhang, Yi and Davis, James},
journal={arXiv preprint arXiv:2506.06220},
year={2025}
}