Skip to content

mikelmh025/generative_ir

Repository files navigation

GenIR: Generative Visual Feedback for Mental Image Retrieval

arXiv Project Page NeurIPS

Overview

This is the official repository for the paper GenIR: Generative Visual Feedback for Mental Image Retrieval (NeurIPS 2025). GenIR introduces a novel approach to Mental Image Retrieval (MIR) - a realistic search scenario where users iteratively refine their queries based on mental images to retrieve intended images from a database. Unlike traditional one-shot text-to-image retrieval, our method addresses the multi-round, interactive nature of real-world human search behavior.

Key Contributions

  • New Task Definition: We introduce Mental Image Retrieval (MIR), bridging the gap between benchmark performance and real-world search applications
  • Generative Visual Feedback: Our GenIR paradigm uses diffusion-based image generation to provide clear, interpretable visual feedback at each interaction round
  • Automated Dataset Generation: We develop a fully automated pipeline to create high-quality multi-round MIR datasets
  • Superior Performance: GenIR significantly outperforms existing interactive retrieval methods in MIR scenarios

Method

The Problem

Traditional vision-language models excel at text-to-image retrieval benchmarks but struggle with real-world search scenarios where:

  • Users search based on mental images (ranging from vague recollections to vivid mental representations)
  • Search is an iterative, multi-round process rather than one-shot
  • Users need clear, actionable feedback to refine their queries effectively

Our Solution

GenIR leverages diffusion-based image generation to:

  1. Reify AI understanding at each interaction round through synthetic visual representations
  2. Provide interpretable feedback that users can easily understand and act upon
  3. Enable intuitive query refinement through visual rather than abstract verbal feedback

Architecture

[Mental Image] → [Query] → [GenIR System] → [Visual Feedback + Retrieved Images]
                    ↑                              ↓
                [Refined Query] ← [User Feedback] ← [Next Round]

Dataset

MSCOCO Dataset

We use the MSCOCO 2017 Unlabeled images dataset (123K images, 19GB) for our experiments.

  • Download from: COCO Dataset
  • After downloading, organize the dataset in the following structure:
    data/
    └── mscoco/
        └── unlabeled2017/
            └── [image files]
    
  • The required JSON files for queries and corpus are provided in the ChatIR folder

Our Synthetic Dataset

We have created and uploaded our synthetic dataset generated by our GenIR framework to Hugging Face for reproducibility and easy access.

Access the Dataset

You can download our synthetic dataset from Hugging Face:

# Using huggingface_hub
pip install huggingface_hub
from huggingface_hub import snapshot_download

# Download the dataset
snapshot_download(
    repo_id="dyang39/GenIR",
    repo_type="dataset",
    local_dir="./data/GenIR_dataset_MSCOCO"
)

Direct link: https://huggingface.co/datasets/dyang39/GenIR

Installation

This repository implements our proposed GenIR framework for mental image retrieval using multi-round caption refinement. The framework leverages generative models to improve retrieval performance through iterative refinement of image captions. While we primarily use Stable Diffusion 3.5 (SD3.5) and Gemma-3-4B in our experiments due to their ease of weight sharing, the framework is flexible and supports other generative models.

Clone & Submodule Setup

git clone <repo-url>
cd generative_ir
git submodule update --init --recursive

Note: The Infinity submodule has additional dependencies. If you plan to use the Infinity model, also run:

pip install -r Infinity/requirements.txt

Requirements

pip install -r requirements.txt

Usage

Running Experiments

We provide three different implementations:

  • GenIR (Ours): Fake image feedback using generative models (default: SD3.5 and Gemma-3-4B)

    python genIR_CaptionImageRefinement.py
  • Baseline 1: Prediction feedback

    python genIR_CaptionRefinment_VIsualPredictionFeedBack.py 
  • Baseline 2: Textual feedback

    python genIR_CaptionRefinment_TextOnly.py 

Evaluation

  • GenIR Evaluation: Evaluate fake image to real image retrieval

    python ChatIR/eval_img.py 
  • Text-only Baseline Evaluation

    python ChatIR/eval_textonly.py 

Evaluation Protocol & Replicability

  • Test Set Size: For each dataset (MSCOCO, FFHQ, Flickr30k, and Clothing-ADC), we randomly sample 2,000 images from their respective validation or test sets.
  • Search Space: During evaluation, the retrieval system searches against the entire database of each dataset (e.g., evaluating against all 1M+ images for Clothing-ADC). This ensures a realistic and challenging large-scale retrieval scenario.
  • Compute-Efficient Replication: Empirically, we observed that the Hits@K metrics stabilize after approximately 500 samples, with no significant statistical changes thereafter. For those looking to replicate our framework or test follow-up methods with limited compute, a minimum of 500 samples is recommended.

Results

GenIR demonstrates significant improvements over existing interactive retrieval methods in Mental Image Retrieval scenarios. Detailed experimental results and comparisons are available in our paper.

Acknowledgments

This work builds upon several previous works:

  • ChatIR - For the base implementation and evaluation framework
  • BLIP - For the vision-language model components
  • Infinity - For the generative model implementations
  • FLUX.1-dev - For the text-to-image generation capabilities

This is an academic research project. Feel free to use this code for research purposes.

Contact

For questions or discussions, please contact:

Citation

If you find this work useful for your research, please cite:

@article{yang2025genir,
  title={GenIR: Generative Visual Feedback for Mental Image Retrieval},
  author={Yang, Diji and Liu, Minghao and Lo, Chung-Hsiang and Zhang, Yi and Davis, James},
  journal={arXiv preprint arXiv:2506.06220},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors