The above logo is generated by OpenAI
AGIC (Attention-Guided Image Captioning) is a framework designed to improve the relevance of image captions by leveraging a contextual relevance amplification mechanism, implemented through an attention-guided process. Inspired by recent research (Liu et al., 2025), AGIC uses attention patterns from vision transformers to amplify relevant image regions and generate more accurate and meaningful captions.
AGIC works in three main stages:
- Attention Weights Extraction:
- Extracts attention maps from a pre-trained vision transformer to identify the most relevant image regions.
- Image Amplification:
- Amplifies the original image features using the extracted attention weights, making relevant regions more prominent.
- Caption Generation:
- Generates captions for the amplified image using a hybrid decoding strategy (beam search, Top-k, Top-p sampling, temperature scaling) to enhance diversity and fluency.
- Flickr8k (HuggingFace link)
- Flickr30k (HuggingFace link)
These datasets provide images and multiple human-annotated captions for benchmarking image captioning models.
All scripts are located in the AGIC/ directory. Each script can be run from the command line. You may need to adjust the default paths for your dataset and ground truth files.
Runs AGIC using the BLIP-2 model.
python AGIC/agic_blip2.py- Default paths:
- Images:
/home/ubuntu/flickr8k/Images/ - Ground truth:
/content/gts.json - Output:
/content/agic_8k.json
- Images:
- To change paths, edit the variables in the
main()function or pass them as arguments if you add argparse support.
Runs AGIC using the LLaVA model.
python AGIC/agic_llava.py- Default paths:
- Images:
/home/ubuntu/flickr8k/Images/ - Ground truth:
/content/gts.json - Output:
/content/agic_llava.json
- Images:
Runs AGIC with ablation studies (different amplification factors and decoding strategies).
python AGIC/agic_ablation.py- Default paths:
- Images:
/home/ubuntu/flickr8k/Images/ - Ground truth:
/content/gts.json - Output:
/content/agic_ablation.json
- Images:
Runs zero-shot captioning with BLIP-2, LLaVA, Qwen, and Fuyu models for comparison.
python AGIC/zeroshot_models.py- Default paths:
- Images:
/home/ubuntu/flickr8k/Images/ - Ground truth:
/content/gts.json - Output:
/content/zeroshot_captions.json
- Images:
Note: All scripts can be modified to accept command-line arguments for paths and parameters by adding
argparseif needed.
Install dependencies with:
pip install -r requirements.txttorch
transformers
torchvision
pillow
pandas
tqdm
pycocoevalcap
- You may also need:
json,os,glob,logging,dataclasses(Python 3.7+), andtyping(standard library). - For some models (e.g., BLIP-2, LLaVA, Qwen, Fuyu), you may need to install additional model-specific dependencies from HuggingFace.
AGIC extracts attention weights from a vision transformer, amplifies the image features using these weights, and generates captions using a hybrid decoding strategy. See the code and comments for mathematical details and implementation.
- Flickr8k Dataset
- Flickr30k Dataset
- Liu et al., 2025. Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models
