GitHub - yeshwanth95/Hash_and_search: Official repository of the CVPR 2026 (Oral) paper: "Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets".

Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets

Oral @ CVPR 2026

Yeshwanth Kumar Adimoolam¹, Charalambos Poullis², Melinos Averkiou³
¹Cyprus University of Technology, ²Concordia University, ³CYENS CoE, Cyprus

[Demo Website] [Paper] [Video]

Updates

April 13, 2023 - We have released an interactive web interface to manually inspect the full extent of data leakage and duplication in the AICrowd Mapping Challenge v1 dataset. The web interface can be found at datainspector.app

Highlights

We propose an easy-to-adopt de-duplication and leakage detection pipeline for large-scale image datasets that utilizes collision detection of perceptual hashes of images.
We employ the proposed de-duplication pipeline to identify and eliminate instances of data duplication and leakage in the AICrowd mapping challenge dataset. Approximately 250k of the 280k training images were either exact or augmented duplicates.
We demonstrate cases of significant overfitting of the recent state-of-the-art methods, potentially invalidating the results of a number of prior art reporting on this dataset for the task of building footprint extraction.

Installation

conda create -n hash_and_search python=3.10
conda activate hash_and_search
pip install -r requirements.txt

Alternatively, the following requirements can be installed manually:

ImageHash
numpy
Pillow
PyWavelets
scipy
tqdm

Compute Hashes

To compute p-hashes for images in a folder, run:

python compute_hashes.py <input_images_directory> <output_directory> <output_hashtable_filename>

To compute p-hashes of augmented images in the dataset, run:

python compute_hashes_augmented.py <input_images_directory> <output_directory> <output_hashtable_filename>

Compare Hashes

Once hashtables are constructed for two image datasets, it is possible to compare the hashtables to detect duplicates using the following command:

python compare_hashes.py <needles_hashtable> <haystack_hashtable> <output_filename>

The above command results in a .json file containing all instances of duplicates in the haystack set for each image in the needles set.

Visualise Duplicates

To inspect and visualise these duplicates between the needles and haystack sets, run:

python inspect_hashes.py
python json_to_html.py

These commands would generate a HTML file that can be opened in any standard web browser. To view the HTML file:

Download the CrowdAI dataset train split images from here.
Place the train images in the same folder as the HTML file in the following directory structure: ./data/train/images/<place_images_here>.
```
└───data
    └───train
        └───images
            └───<place_images_here.>
```
Open the HTML file in a standard web browser (e.g., Google Chrome).

Citation

If you find our work useful in your research, please consider citing:

@misc{adimoolam2026deduplication,
      title={Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets}, 
      author={Yeshwanth Kumar Adimoolam and Charalambos Poullis and Melinos Averkiou},
      year={2026},
      eprint={2304.02296},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2304.02296}, 
}

Acknowledgement

This repository benefits from

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
assets		assets
train_hashes		train_hashes
val_hashes		val_hashes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
compare_crowdai_filesizes.py		compare_crowdai_filesizes.py
compare_hashes.py		compare_hashes.py
compare_hashes_hamming_distance.py		compare_hashes_hamming_distance.py
compute_hashes.py		compute_hashes.py
compute_hashes_augmented.py		compute_hashes_augmented.py
inspect_hashes.py		inspect_hashes.py
json_to_html.py		json_to_html.py
requirements.txt		requirements.txt
run.sh		run.sh
val_in_train_np.json		val_in_train_np.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets

Oral @ CVPR 2026

Updates

Highlights

Installation

Compute Hashes

Compare Hashes

Visualise Duplicates

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets

Oral @ CVPR 2026

Updates

Highlights

Installation

Compute Hashes

Compare Hashes

Visualise Duplicates

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages