Abstract
Deploying multimodal models in real-world scenarios requires generalization to new environments where recording conditions differ from training, a challenge known as multimodal domain generalization (MMDG). Standard architectures employ separate encoders for each modality and a fusion module, training the system end-to-end by optimizing on the fused features. In this paper, we identify that such joint optimization causes encoders to exploit cross-modal co-occurrences, statistical relationships between modalities that arise from source-specific recording conditions, rather than learning domain-invariant features. We term this failure mode Fusion Overfitting. To address this, we propose Modality-Entropy Regularization for Domain Generalization (MER-DG), which maximizes the entropy of each encoder's feature distribution to preserve feature diversity. MER-DG is architecture-agnostic and integrates into existing multimodal frameworks as an additive loss term. Extensive experiments on EPIC-Kitchens and HAC benchmarks demonstrate average improvements of approximately 5% over standard fusion and approximately 2% over state-of-the-art methods.
The code was tested using Python 3.10.4, torch 1.11.0+cu113.
Environments:
mmcv-full 1.2.7
mmaction2 0.13.0
-
Download Audio model link, rename it as
vggsound_avgpool.pth.tarand place under theEPIC-rgb-flow-audio/pretrained_modelsandHAC-rgb-flow-audio/pretrained_modelsdirectories. -
Download SlowFast model for RGB modality link and place under the
pretrained_modelsdirectories. -
Download SlowOnly model for Flow modality link and place under the
pretrained_modelsdirectories.
- EPIC-Kitchens: Download Audio files EPIC-KITCHENS-audio.zip. Follow the original EPIC-Kitchens extraction format.
- HAC: Download at link.
(See the original SimMMDG repository for the exact desired directory tree structures for the datasets).
We provide clean compilation scripts for both datasets to run our MER-DG approach alongside the standard Baseline Fusion and the state-of-the-art SimMMDG framework.
Each directory contains a unified run_experiments.sh script that organizes configuring and training the models. Before running:
- Edit
EPIC-rgb-flow-audio/run_experiments.shorHAC-rgb-flow-audio/run_experiments.sh - Point
DATAPATH=to where you stored the datasets locally.
cd EPIC-rgb-flow-audio
bash run_experiments.shcd HAC-rgb-flow-audio
bash run_experiments.shBy default, the scripts execute the following experiments sequentially:
- Baseline Fusion
- Baseline Fusion + MER-DG
- SimMMDG Baseline
- SimMMDG + MER-DG
Modify the script to isolate specific experiments. Ensure wandb is configured for metric logging.
@inproceedings{yarici2026merdg,
title={MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization},
author={Yarici, Yavuz and AlRegib, Ghassan},
booktitle={2026 International Conference on Machine Learning (ICML)},
note={Accepted on April 30, 2026},
year={2026}
}This codebase is adapted from the SimMMDG framework. We sincerely thank the authors for open-sourcing their code.
