MOD-ZOO: Multispectral Object Detection - A Unified Framework and Systematic Survey

This is the official repository for the preprint paper "Multispectral Object Detection: A Unified Framework and Systematic Survey".

This repository (MOD-ZOO) provides a comprehensive, continuously updated collection of resources (papers, codes, datasets) for Multispectral Object Detection (MOD) across Ground-based and Remote Sensing scenarios.

📑 Table of Contents

📢 News
📖 Abstract
🖼️ Unified Framework & Taxonomy
🗂️ Datasets & Benchmarks
- Ground-based Datasets
- Remote Sensing Datasets
📚 Paper List (The MOD Zoo)

📢 News

[2026/04] 🔥🔥The preprint will be available soon!
[2026/04] 🔥🔥Initial release of the MOD-ZOO repository, including taxonomy, datasets, and paper lists.

📖 Abstract

Multispectral Object Detection (MOD) has emerged as a critical methodology to overcome the limitations of visible-light imaging, particularly under adverse conditions such as low illumination and inclement weather. By integrating complementary information across diverse spectral bands, MOD ensures robust all-day and all-weather perception.

To provide a systematic survey, a unified four-stage mathematical framework is established, which deconstructs MOD into multispectral data input, feature learning, fusion schemes, and detection solutions.

🖼️ Unified Framework & Taxonomy

Building upon the concepts introduced above, the following figures visualize the structural breakdown of our survey.

Figure 1 illustrates the detailed data flow of the unified mathematical framework, mapping the progression from raw multispectral inputs to final detection outputs.
Figure 2 expands this framework into a fine-grained hierarchical taxonomy. It categorizes recent state-of-the-art literature based on their specific strategies to overcome core cross-modal challenges.

This taxonomy directly dictates the organization of the paper list in the following sections.

Figure 1. A unified four-stage framework and systematic taxonomy of MOD.

Figure 2. Hierarchical structural decomposition and taxonomy of the MOD landscape.

🗂️ Datasets & Benchmarks

An overview of representative MOD datasets spanning ground-based and remote sensing scenarios.

Figure 3. Electromagnetic spectrum mapping and visual comparisons.

Ground-based Datasets

(Legend: Pairs = Img Pairs, Res. = Resolution, Plat. = Platform(Surv. = Surveillance, Multi. = Multiple), Cls = Class, A/O = Alignment / Occlusion)

Dataset	Venue	Modality	Pairs	Res.	Plat.	Cls	Den.	A/O	Link
KAIST	CVPR'15	RGB-TIR	95.3K	640x480	Driving	1	0.62	✅/✅	Link
CVC-14	Sensors'16	RGB-TIR	8.5K	640x512	Driving	1	0.80	❌/❌	Link
FLIR-aligned	ICIP'20	RGB-TIR	5.1K	640x512	Driving	3	7.92	✅/✅	Link
LLVIP	ICCV'21	RGB-TIR	16.8K	1080x720	Surv.	1	2.51	✅/❌	Link
M³FD	CVPR'22	RGB-TIR	4.2K	1024x768	Multi.	6	8.19	✅/❌	Link
SMOD	TMM'25	RGB-TIR	8.6K	640x512	Driving	4	3.62	✅/✅	Link
MFAD	TCSVT'25	RGB-TIR	12.1K	1280x960	Driving	6	7.13	✅/❌	Link

Remote Sensing Datasets

(Legend: Pairs = Img Pairs, Res. = Resolution, Plat. = Platform, Cls = Class, A/O = Alignment / Occlusion)

Dataset	Venue	Modality	Pairs	Res.	Plat.	Cls	Den.	A/O	Link
VEDAI	JVCI'16	R-NIR	1.2K	1024x1024	UAV	9	2.93	✅/❌	Link
DroneVehicle	TCSVT'21	R-TIR	28.4K	840x712	UAV	1	16.7	❌/❌	Link
DronePerson	ISPRS'23	R-TIR	6.1K	640x512	UAV	1	11.6	✅/❌	Link
DVTOD	TIV'24	R-TIR	2.1K	1920x1080	UAV	3	2.82	❌/❌	Link
OdinMJ	GRSM'24	R-TIR	23K	640x512	UAV	1	1.98	✅/✅	Link
RGBT-Tiny	TPAMI'25	R-TIR	~47.5K	640x512	UAV	7	12.9	✅/❌	Link
SpaceNet6-OTD	TGRS'22	R-SAR	820	900x900	Sat.	1	22.0	✅/❌	Link
OGSOD-1.0	TGRS'23	R-SAR	14.6K	256x256	Sat.	3	2.62	✅/❌	Link
OGSOD-2.0	ICGIP'25	R-SAR	23.4K	256x256	Sat.	4	3.24	✅/❌	Link

```

📚 Paper List (The MOD Zoo)

We categorize representative methods according to our proposed taxonomy.

1. Feature Learning (Mitigating Representation Challenges)

This section addresses fundamental representation challenges: Modality Misalignment, Modality Imbalance, Modality Redundancy, and Modality Asymmetry.

Modality Misalignment

This challenge manifests in two primary forms: Spatial Misalignment and Semantic Misalignment.

Figure 4. Modality Misalignment

Venue	Methods	Title	Modality	Source
AAAI'26	IGIANet	Igianet: Illumination guided implicit alignment network for infrared-visible uav detection	RGB-TIR	Paper
TMM'25	DeformCAT	Deformable cross-attention transformer for weakly aligned rgb-t pedestrian detection	RGB-TIR	Paper/Code
TCSVT'25	SeaDATE	Seadate: Remedy dual-attention transformer with semantic alignment via contrast learning for multimodal object detection	RGB-TIR	Paper
CVPR'24	OAFA	Weakly misalignment-free adaptive feature alignment for uavs-based multimodal object detection	RGB-TIR	Paper
ECCV'24	DAMSDet	Damsdet: Dynamic adaptive multispectral detection transformer with competitive query selection and adaptive feature fusion	RGB-TIR	Paper/Code
ICIP'24	L-CMAF	Revisiting misalignment in multispectral pedestrian detection	RGB-TIR	Paper
TIV'24	YOLO-Adaptor	Yolo-adaptor: A fast adaptive one-stage detector for non-aligned visible-infrared object detection	RGB-TIR	Paper
MM'23	AANet	Attentive alignment network for multispectral pedestrian detection	RGB-TIR	Paper
MM'23	CALNet	Multispectral object detection via cross-modal conflict-aware learning	RGB-TIR	Paper/Code
TITS'23	MFPT	Multi-modal feature pyramid transformer for rgb-infrared object detection	RGB-TIR	Paper
ECCV'22	TSFADet	Translation, scale and rotation: cross-modal alignment meets rgb-infrared vehicle detection	RGB-TIR	Paper
ICCV'19	AR-CNN	Weakly aligned cross-modal learning for multispectral pedestrian detection	RGB-TIR	Paper/Code

Modality Imbalance

Venue	Methods	Title	Modality	Source
TCSVT'25	MSCoTDet	Mscotdet: Language-driven multi-modal fusion for improved multi-spectral pedestrian detection	RGB-TIR	Paper
TGRS'25	DKDNet	Diffusion mechanism and knowledge distillation object detection in multimodal remote sensing imagery	RGB-SAR	Paper
InfFus'25	EMOD	Efficient multispectral object detection with attentive feature aggregation leveraging zero-shot implicit illumination guidance	RGB-TIR	Paper/Code
ICCV'25	M²D-LIF	Rethinking multi-modal object detection from the perspective of mono-modality feature learning	RGB-TIR	Paper/Code
TITS'24	MS-DETR	MS-DETR: multispectral pedestrian detection transformer with loosely coupled fusion and modality-balanced optimization	RGB-TIR	Paper/Code
IROS'24	DCSANet	Desanet: Dual cross-channel and spatial attention make RGB-T object detection better	RGB-TIR	Paper
CVPR'24	CMM	Causal mode multiplexer: A novel framework for unbiased multispectral pedestrian detection	RGB-TIR	Paper/Code
ECCV'22	MBNet	Improving multispectral pedestrian detection by addressing modality imbalance problems	RGB-TIR	Paper/Code

Modality Redundancy

Venue	Methods	Title	Modality	Source
NeuCom'24	DHFNet	Dhfnet: Decoupled hierarchical fusion network for RGB-T dense prediction tasks	RGB-TIR	Paper
RS'22	RISNet	Improving rgb-infrared object detection by reducing cross-modality redundancy	RGB-TIR	Paper
PR'22	YOLOFusion	Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery	RGB-NIR	Paper/Code

Modality Asymmetry

Venue	Methods	Title	Modality	Source
MM'25	UniRGB-IR	Unirgb-ir: A unified framework for visible-infrared semantic tasks via adapter tuning	RGB-TIR	Paper/Code
ECCV'24	ModTr	Modality translation for object detection adaptation without forgetting prior knowledge	TIR	Paper/Code
CVPR'24	D3T	D3t: Distinctive dual-domain teacher zigzagging across rgb-thermal gap for domain-adaptive object detection	TIR	Paper/Code
MM'23	TIRDet	Tirdet: Mono-modality thermal infrared object detection based on prior thermal-to-visible translation	TIR	Paper/Code
TCSVT'22	DCRL-PDN	Deep cross-modal representation learning and distillation for illumination-invariant pedestrian detection	RGB	Paper
AAAI'22	VPD	Towards versatile pedestrian detector with multisensory-matching and multispectral recalling memory	RGB-TIR	Paper
ECCV'20	TC-Det	Task-conditioned domain adaptation for pedestrian detection in thermal imagery	TIR	Paper/Code
CVPRW'19	UMAD	Unsupervised domain adaptation for multispectral pedestrian detection	RGB-TIR	Paper/Code
CVPR'17	CMT-CNN	Learning cross-modal deep representations for robust pedestrian detection	RGB-TIR	Paper/Code

2. Fusion Scheme

Categorized by Fusion Stage Design and Fusion Function Construction.

Figure 5. Fusion Stage Design.

Figure 6. Fusion Function Construction.

Venue	Methods	Title	Modality	Source
TIP'26	AFFNet	Adaptive fine-grained fusion network for multimodal UAV object detection	RGB-TIR	Paper
InfFus'26	MSFF	Multispectral state-space feature fusion: Bridging shared and cross-parametric interactions for object detection	RGB-TIR	Paper/Code
InfFus'26	COMO	COMO: cross-mamba interaction and offset-guided fusion for multimodal object detection	RGB-TIR	Paper/Code
TII'25	RetinexDet	Retinexdet: Enhancing multispectral object detection via retinex state space duality and wavelet-based frequency adaptive fusion	RGB-TIR	Paper
TGRS'25	MPFF	Aerial image object detection based on rgb-infrared multibranch progressive fusion	RGB-TIR	Paper
TGRS'25	DHANet	Dhanet: Dual-stream hierarchical interaction networks for multimodal drone object detection	RGB-TIR	Paper/Code
TGRS'25	DMM	DMM: disparity-guided multispectral mamba for oriented object detection in remote sensing	RGB-TIR	Paper/Code
PR'25	MSTF	Multispectral transformer fusion via exploiting similarity and complementarity for robust pedestrian detection	RGB-TIR	Paper
TMM'25	Fusion-Mamba	Fusion-mamba for cross-modality object detection	RGB-TIR	Paper/Code
MM'25	CSSFDet	Contextually-guided state space fusion for misaligned multi-spectral object detection	RGB-TIR	Paper
MM'25	SemFusion	Sam-guided semantic knowledge fusion for visible-infrared object detection	RGB-TIR	Paper/Code
ICCV'25	WaveMamba	Wavemamba: Wavelet-driven mamba fusion for rgb-infrared object detection	RGB-TIR	Paper
ICCV'25	M-SpecGene	M-specgene: Generalized foundation model for rgbt multispectral vision	RGB-TIR	Paper/Code
TNNLS'24	LRAF-Net	Lraf-net: Long-range attention fusion network for visible-infrared object detection	RGB-TIR	Paper
TNNLS'24	TFDet	Tfdet: Target-aware fusion for RGB-T pedestrian detection	RGB-TIR	Paper/Code
ECCV'24	MMPedestron	When pedestrian detection meets multi-modal learning: Generalist model and benchmark dataset	Multi	Paper/Code
NIPS'24	E2E-MFD	E2e-mfd: Towards end-to-end synchronous multimodal fusion detection	RGB-TIR	Paper/Code
TMM'23	CMPD	Confidence-aware fusion using dempster-shafer theory for multispectral pedestrian detection	RGB-TIR	Paper/Code
TCSVT'22	UA-CMDet	Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning	RGB-TIR	Paper/Code
InfFus'19	CIAN	Cross-modality interactive attention network for multispectral pedestrian detection	RGB-TIR	Paper/Code
PR'19	IAF R-CNN	Illumination-aware faster r-cnn for robust multispectral pedestrian detection	RGB-TIR	Paper/Code

3. Detection Solutions (Task-Specific)

This section categorizes detection solutions based on specific application challenges: Small Object Detection, Robust Perception Under Adverse Conditions, and Adversarial Attacks.

Small Object Detection

Venue	Methods	Title	Modality	Source
TIM'25	AMSDet	Adaptive modality selection drone-based RGBT detector for tiny targets	RGB-TIR	Paper
TGRS'23	SuperYOLO	Superyolo: Super resolution assisted object detection in multimodal remote sensing imagery	RGB-NIR	Paper/Code
ISPRS'23	QFDet	Drone-based rgbt tiny person detection	RGB-TIR	Paper/Code
BMVC'20	ASMPD	Anchor-free small-scale multispectral pedestrian detection	RGB-TIR	Paper/Code
ISPRS'19	HMFFN	Box-level segmentation supervised deep neural networks for accurate and real-time multispectral pedestrian detection	RGB-TIR	Paper

Robust Object Detection

Venue	Methods	Title	Modality	Source
TCSVT'25	CFMW	CFMW: cross-modality fusion mamba for robust object detection under adverse weather	RGB-TIR	Paper/Code
PRL'25	RRD	Learning a robust rgb-thermal detector for extreme modality imbalance	RGB-TIR	Paper
RAL'25	HA-MLPD	Hybrid attention for robust RGB-T pedestrian detection in real-world conditions	RGB-TIR	Paper
MMUL'25	VL-ACFDet	Vision-language-guided adaptive cross-modal fusion for multispectral object detection under adverse weather conditions	RGB-TIR	Paper
TGRS'24	LF-MDet	Low-rank multimodal remote sensing object detection with frequency filtering experts	RGB-TIR	Paper/Code
ECCV'22	ProbEn	Multimodal object detection via probabilistic ensembling	RGB-TIR	Paper/Code

Adversarial Attack & Defense

Venue	Methods	Title	Modality	Source
MM'25	CDUPatch	Cdupatch: Color-driven universal adversarial patch attack for dual-modal visible-infrared detectors	RGB-TIR	Paper
TPAMI'24	UAPatch	Unified adversarial patch for visible-infrared cross-modal attacks in the physical world	RGB-TIR	Paper/Code
AAAI'23	MIC	Multispectral invisible coating: Laminated visible-thermal physical attack against multispectral object detectors using transparent low-e films	RGB-TIR	Paper
ICASSP'23	SRG-ASRP	Similarity relation preserving cross-modal learning for multispectral pedestrian detection against adversarial attacks	RGB-TIR	Paper

(Note: We welcome pull requests to update this list with the latest SOTA papers!)

Contact

Please contact us at fqy2017@gmail.com for any questions.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
figures		figures
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MOD-ZOO: Multispectral Object Detection - A Unified Framework and Systematic Survey

📑 Table of Contents

📢 News

📖 Abstract

🖼️ Unified Framework & Taxonomy