Christian Schlarmann • Francesco Croce • Nicolas Flammarion • Matthias Hein
[Paper] [HuggingFace] [BibTeX]
FuseLIP is a multimodal embedding architecture that unifies text and image inputs through early fusion. Unlike traditional contrastive models that use separate encoders and rely on late fusion, FuseLIP employs a single transformer operating on a joint vocabulary of discrete image and text tokens. This enables deep cross-modal interaction and richer representations.
To install the required packages, install Python 3.11 and run:
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia # slightly different results when installing pytorch via pip
pip install -r requirements.txt
export PYTHONPATH="$PYTHONPATH:./src"For training FuseLIP, please gather the following datasets:
- CC3M and CC12M: we download the datasets from HuggingFace and extract the images and captions. Cleaned CSV files are available here (obtained via
scripts/cc_wds_to_csv.py). - CC3M-VQA
- we also supply CC12M-VQA, but it was not used in the paper
- HQ-Edit: will be downloaded automatically when starting training
- Visual Genome (VG): obtain the following files
For evaluation, please download the following datasets additionally:
- Open Images v7: test images
- MMEB: download images via
wget https://huggingface.co/datasets/TIGER-Lab/MMEB-eval/resolve/main/images.zip
unzip images.zip -d eval_images/- ImageNet validation set
Then set the paths to the datasets in ./src/config.py.
We provide pretrained FuseLIP models that can be used for evaluation or fine-tuning. The models attain the following performance:
| Training Data | Model | Evaluation metrics (higher = better) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Classification | VQA | Retrieval | Grounding | ImageNet | VG-Crop | OI-Crop | OI-Pos | TGIT | ||
CC3M + MM |
SigLIP-SSF | 21.5 | 12.7 | 13.0 | 74.8 | 8.8 | 52.0 | 55.2 | 45.4 | 57.3 |
| SigLIP-SMLF | 18.0 | 14.2 | 12.7 | 74.2 | 10.2 | 53.0 | 66.2 | 46.9 | 67.2 | |
| SigLIP-BSF | 22.2 | 13.6 | 13.4 | 77.2 | 10.3 | 55.1 | 56.9 | 45.9 | 56.6 | |
| SigLIP-BMLF | 19.5 | 14.8 | 13.9 | 76.9 | 12.2 | 55.4 | 68.4 | 47.4 | 69.4 | |
| FuseLIP-S | 18.5 | 15.9 | 11.2 | 70.8 | 13.5 | 49.6 | 59.8 | 53.9 | 79.0 | |
| FuseLIP-B | 23.3 | 17.5 | 15.0 | 82.4 | 18.1 | 55.8 | 68.1 | 70.8 | 94.3 | |
CC12M + MM |
SigLIP-SSF | 30.4 | 16.2 | 23.8 | 74.2 | 21.4 | 57.1 | 60.1 | 47.1 | 66.0 |
| SigLIP-SMLF | 28.5 | 16.9 | 23.2 | 72.7 | 25.5 | 58.8 | 72.2 | 46.6 | 81.0 | |
| SigLIP-BSF | 31.5 | 17.0 | 23.8 | 72.7 | 25.4 | 58.0 | 63.2 | 47.3 | 67.1 | |
| SigLIP-BMLF | 30.3 | 16.8 | 23.2 | 73.4 | 28.8 | 61.5 | 74.0 | 48.9 | 78.1 | |
| FuseLIP-S | 25.2 | 18.2 | 20.1 | 75.2 | 26.0 | 53.5 | 64.7 | 61.5 | 90.6 | |
| FuseLIP-B | 31.2 | 19.8 | 26.2 | 82.3 | 32.7 | 61.5 | 71.3 | 68.9 | 94.2 | |
Models can be loaded as follows:
from fuse_clip.fuse_clip_utils import load_model
model, image_processor, text_tokenizer = load_model("chs20/FuseLIP-S-CC3M-MM", device="cuda")Training scripts are provided for different model variants and datasets and can be run as follows:
SigLIP-SSF, SigLIP-BSF:
SigLIP with score fusion, i.e. arithmetic addition of image embedding + text embedding
# CC3M:
./scripts/train-baseline-cc3m.sh sf [small | base]
# CC12M:
./scripts/train-baseline-cc12m.sh sf [small | base]SigLIP-SMLF, SigLIP-SMLF:
SigLIP with magiclens fusion - i.e. merging image and text embeddings via a small late fusion module
# CC3M:
./scripts/train-baseline-cc3m.sh mlf [small | base]
# CC12M:
./scripts/train-baseline-cc12m.sh mlf [small | base]FuseLIP-S:
Our proposed architecture with early fusion of discrete tokens
# CC3M:
./scripts/train-fuselip-cc3m.sh small
# CC12M:
./scripts/train-fuselip-cc12m.sh smallFuseLIP-B:
Our proposed architecture with early fusion of discrete tokens
# CC3M:
./scripts/train-fuselip-cc3m.sh base
# CC12M:
./scripts/train-fuselip-cc12m.sh basepython src/fuse_eval/eval_all.pyTo evaluate compositionality performance on SugarCrepe, run:
python src/fuse_eval/eval_sugarcrepe.pyTo generate CC3M-VQA or CC12M-VQA yourself, run
python scripts/generate_vqa_data.py [--model meta-llama/Llama-3.1-8B-Instruct] [--bs 128] [--cc12m]This script will use all available GPUs, using the specified batch size per device.
This codebase gratefully forks from
If you find this project useful, please cite our paper:
@article{schlarmann2025fuselip,
title = {FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens},
author = {Christian Schlarmann and Francesco Croce and Nicolas Flammarion and Matthias Hein},
year = 2025,
journal = {arXiv preprint arXiv:2506.03096}
}