FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Christian Schlarmann • Francesco Croce • Nicolas Flammarion • Matthias Hein

FuseLIP is a multimodal embedding architecture that unifies text and image inputs through early fusion. Unlike traditional contrastive models that use separate encoders and rely on late fusion, FuseLIP employs a single transformer operating on a joint vocabulary of discrete image and text tokens. This enables deep cross-modal interaction and richer representations.

Installation

To install the required packages, install Python 3.11 and run:

conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia  # slightly different results when installing pytorch via pip
pip install -r requirements.txt
export PYTHONPATH="$PYTHONPATH:./src"

Dataset Preparation

For training FuseLIP, please gather the following datasets:

CC3M and CC12M: we download the datasets from HuggingFace and extract the images and captions. Cleaned CSV files are available here (obtained via scripts/cc_wds_to_csv.py).
CC3M-VQA
- we also supply CC12M-VQA, but it was not used in the paper
HQ-Edit: will be downloaded automatically when starting training
Visual Genome (VG): obtain the following files
- images
- region descriptions
- question answers

For evaluation, please download the following datasets additionally:

Open Images v7: test images
MMEB: download images via

wget https://huggingface.co/datasets/TIGER-Lab/MMEB-eval/resolve/main/images.zip
unzip images.zip -d eval_images/

ImageNet validation set

Then set the paths to the datasets in ./src/config.py.

Pretrained Models

We provide pretrained FuseLIP models that can be used for evaluation or fine-tuning. The models attain the following performance:

Training Data	Model	Evaluation metrics (higher = better)
Training Data	Model	Classification	VQA	Retrieval	Grounding	ImageNet	VG-Crop	OI-Crop	OI-Pos	TGIT
`CC3M + MM`	SigLIP-S_SF	21.5	12.7	13.0	74.8	8.8	52.0	55.2	45.4	57.3
	SigLIP-S_MLF	18.0	14.2	12.7	74.2	10.2	53.0	66.2	46.9	67.2
	SigLIP-B_SF	22.2	13.6	13.4	77.2	10.3	55.1	56.9	45.9	56.6
	SigLIP-B_MLF	19.5	14.8	13.9	76.9	12.2	55.4	68.4	47.4	69.4
	FuseLIP-S	18.5	15.9	11.2	70.8	13.5	49.6	59.8	53.9	79.0
	FuseLIP-B	23.3	17.5	15.0	82.4	18.1	55.8	68.1	70.8	94.3
`CC12M + MM`	SigLIP-S_SF	30.4	16.2	23.8	74.2	21.4	57.1	60.1	47.1	66.0
	SigLIP-S_MLF	28.5	16.9	23.2	72.7	25.5	58.8	72.2	46.6	81.0
	SigLIP-B_SF	31.5	17.0	23.8	72.7	25.4	58.0	63.2	47.3	67.1
	SigLIP-B_MLF	30.3	16.8	23.2	73.4	28.8	61.5	74.0	48.9	78.1
	FuseLIP-S	25.2	18.2	20.1	75.2	26.0	53.5	64.7	61.5	90.6
	FuseLIP-B	31.2	19.8	26.2	82.3	32.7	61.5	71.3	68.9	94.2

Models can be loaded as follows:

from fuse_clip.fuse_clip_utils import load_model
model, image_processor, text_tokenizer = load_model("chs20/FuseLIP-S-CC3M-MM", device="cuda")

Training

Training scripts are provided for different model variants and datasets and can be run as follows:

SigLIP-S_SF, SigLIP-B_SF:
SigLIP with score fusion, i.e. arithmetic addition of image embedding + text embedding

# CC3M:
./scripts/train-baseline-cc3m.sh sf [small | base]
# CC12M:
./scripts/train-baseline-cc12m.sh sf [small | base]

SigLIP-S_MLF, SigLIP-S_MLF:
SigLIP with magiclens fusion - i.e. merging image and text embeddings via a small late fusion module

# CC3M:
./scripts/train-baseline-cc3m.sh mlf [small | base]
# CC12M:
./scripts/train-baseline-cc12m.sh mlf [small | base]

FuseLIP-S:
Our proposed architecture with early fusion of discrete tokens

# CC3M:
./scripts/train-fuselip-cc3m.sh small
# CC12M:
./scripts/train-fuselip-cc12m.sh small

FuseLIP-B:
Our proposed architecture with early fusion of discrete tokens

# CC3M:
./scripts/train-fuselip-cc3m.sh base
# CC12M:
./scripts/train-fuselip-cc12m.sh base

Evaluation

Main evaluation

python src/fuse_eval/eval_all.py

SugarCrepe

To evaluate compositionality performance on SugarCrepe, run:

python src/fuse_eval/eval_sugarcrepe.py

Generating VQA Data from Captions

To generate CC3M-VQA or CC12M-VQA yourself, run

python scripts/generate_vqa_data.py [--model meta-llama/Llama-3.1-8B-Instruct] [--bs 128] [--cc12m]

This script will use all available GPUs, using the specified batch size per device.

Acknowledgement

This codebase gratefully forks from

Citation

If you find this project useful, please cite our paper:

@article{schlarmann2025fuselip,
	title = {FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens},
	author = {Christian Schlarmann and Francesco Croce and Nicolas Flammarion and Matthias Hein},
	year = 2025,
	journal = {arXiv preprint arXiv:2506.03096}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
data		data
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Table of Contents

Installation

Dataset Preparation

Pretrained Models

Training

Evaluation

Main evaluation

SugarCrepe

Generating VQA Data from Captions

Acknowledgement

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Table of Contents

Installation

Dataset Preparation

Pretrained Models

Training

Evaluation

Main evaluation

SugarCrepe

Generating VQA Data from Captions

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages