Skip to content

chs20/fuselip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Christian Schlarmann  •  Francesco Croce  •  Nicolas Flammarion  •  Matthias Hein

[Paper] [HuggingFace] [BibTeX]

FuseLIP is a multimodal embedding architecture that unifies text and image inputs through early fusion. Unlike traditional contrastive models that use separate encoders and rely on late fusion, FuseLIP employs a single transformer operating on a joint vocabulary of discrete image and text tokens. This enables deep cross-modal interaction and richer representations.

Table of Contents

Installation

To install the required packages, install Python 3.11 and run:

conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia  # slightly different results when installing pytorch via pip
pip install -r requirements.txt
export PYTHONPATH="$PYTHONPATH:./src"

Dataset Preparation

For training FuseLIP, please gather the following datasets:

For evaluation, please download the following datasets additionally:

wget https://huggingface.co/datasets/TIGER-Lab/MMEB-eval/resolve/main/images.zip
unzip images.zip -d eval_images/

Then set the paths to the datasets in ./src/config.py.

Pretrained Models

We provide pretrained FuseLIP models that can be used for evaluation or fine-tuning. The models attain the following performance:

Training Data Model Evaluation metrics (higher = better)
Classification VQA Retrieval Grounding ImageNet VG-Crop OI-Crop OI-Pos TGIT
CC3M + MM SigLIP-SSF 21.512.713.074.88.852.055.245.457.3
SigLIP-SMLF 18.014.212.774.210.253.066.246.967.2
SigLIP-BSF 22.213.613.477.210.355.156.945.956.6
SigLIP-BMLF 19.514.813.976.912.255.468.447.469.4
FuseLIP-S 18.515.911.270.813.549.659.853.979.0
FuseLIP-B 23.317.515.082.4 18.155.868.170.894.3
CC12M + MM SigLIP-SSF 30.416.223.874.221.457.160.147.166.0
SigLIP-SMLF 28.516.923.272.725.558.872.246.681.0
SigLIP-BSF 31.517.023.872.725.458.063.247.367.1
SigLIP-BMLF 30.316.823.273.428.861.574.048.978.1
FuseLIP-S 25.218.220.175.226.053.564.761.590.6
FuseLIP-B 31.219.826.282.3 32.761.571.368.994.2

Models can be loaded as follows:

from fuse_clip.fuse_clip_utils import load_model
model, image_processor, text_tokenizer = load_model("chs20/FuseLIP-S-CC3M-MM", device="cuda")

Training

Training scripts are provided for different model variants and datasets and can be run as follows:

SigLIP-SSF,   SigLIP-BSF:
SigLIP with score fusion, i.e. arithmetic addition of image embedding + text embedding

# CC3M:
./scripts/train-baseline-cc3m.sh sf [small | base]
# CC12M:
./scripts/train-baseline-cc12m.sh sf [small | base]

SigLIP-SMLF,   SigLIP-SMLF:
SigLIP with magiclens fusion - i.e. merging image and text embeddings via a small late fusion module

# CC3M:
./scripts/train-baseline-cc3m.sh mlf [small | base]
# CC12M:
./scripts/train-baseline-cc12m.sh mlf [small | base]

FuseLIP-S:
Our proposed architecture with early fusion of discrete tokens

# CC3M:
./scripts/train-fuselip-cc3m.sh small
# CC12M:
./scripts/train-fuselip-cc12m.sh small

FuseLIP-B:
Our proposed architecture with early fusion of discrete tokens

# CC3M:
./scripts/train-fuselip-cc3m.sh base
# CC12M:
./scripts/train-fuselip-cc12m.sh base

Evaluation

Main evaluation

python src/fuse_eval/eval_all.py

SugarCrepe

To evaluate compositionality performance on SugarCrepe, run:

python src/fuse_eval/eval_sugarcrepe.py

Generating VQA Data from Captions

To generate CC3M-VQA or CC12M-VQA yourself, run

python scripts/generate_vqa_data.py [--model meta-llama/Llama-3.1-8B-Instruct] [--bs 128] [--cc12m]

This script will use all available GPUs, using the specified batch size per device.

Acknowledgement

This codebase gratefully forks from

Citation

If you find this project useful, please cite our paper:

@article{schlarmann2025fuselip,
	title = {FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens},
	author = {Christian Schlarmann and Francesco Croce and Nicolas Flammarion and Matthias Hein},
	year = 2025,
	journal = {arXiv preprint arXiv:2506.03096}
}

About

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Topics

Resources

License

Stars

Watchers

Forks

Contributors