Towards a Multimodal Framework for Remote Sensing Image Change Retrieval and Captioning
Roger Ferrod, Luigi Di Caro and Dino Ienco
Published in Discovery Science (DS) 2024
Earth observation systems provide a continuous stream of satellite imagery to monitor evolving landscapes. However, most current Vision-Language models in the Remote Sensing (RS) domain are focused on static data, lacking the ability to account for temporal changes between multiple observations.
To address this gap, we introduce RSICRC, a novel multimodal model designed specifically for bi-temporal remote sensing image pairs. RSICRC jointly handles two tasks:
- Change Detection Captioning: Accurately describing the changes between two timestamps in natural language.
- Bi-temporal Text-Image Retrieval: Retrieving the correct before/after image pair based on a user's textual query.
By utilizing contrastive learning on the LEVIR-CC dataset, our model successfully bridges these paradigms, unlocking text-image retrieval capabilities while preserving state-of-the-art captioning performance.
Our architecture is inspired by CoCa but explicitly adapted to handle bi-temporal image pairs.
- Pretrained Remote Sensing Backbone: We utilize a ResNet-50 backbone fine-tuned from RemoteCLIP to independently extract features from the before and after images.
- Bi-temporal Encoder: A hierarchical self-attention block combines the temporal features into a single, unified visual representation.
- Decoupled Decoder: The transformer-based decoder is split into two parts: an unimodal module that encodes textual queries for contrastive learning and a multimodal cross-attention module that generates the final caption.
- False Negative Attraction (FNA): Adapting a captioning dataset (LEVIR-CC) for contrastive retrieval introduces "False Negatives" (different pairs with semantically identical captions). We implement a False Negative Attraction strategy, which dynamically compares caption embeddings and treats semantically similar false negatives as positive examples, significantly improving retrieval performance.
If you find our work or the pre-trained weights helpful for your research, please consider citing our Discovery Science 2024 paper:
@inproceedings{rsicrc,
title={Towards a Multimodal Framework for Remote Sensing Image Change Retrieval and Captioning},
author={Ferrod, Roger and Di Caro, Luigi and Ienco, Dino},
booktitle={International Conference on Discovery Science},
year={2024}
}