This is the official repository for the preprint paper "Multispectral Object Detection: A Unified Framework and Systematic Survey".
This repository (MOD-ZOO) provides a comprehensive, continuously updated collection of resources (papers, codes, datasets) for Multispectral Object Detection (MOD) across Ground-based and Remote Sensing scenarios.
- π’ News
- π Abstract
- πΌοΈ Unified Framework & Taxonomy
- ποΈ Datasets & Benchmarks
- π Paper List (The MOD Zoo)
- [2026/04] π₯π₯The preprint will be available soon!
- [2026/04] π₯π₯Initial release of the MOD-ZOO repository, including taxonomy, datasets, and paper lists.
Multispectral Object Detection (MOD) has emerged as a critical methodology to overcome the limitations of visible-light imaging, particularly under adverse conditions such as low illumination and inclement weather. By integrating complementary information across diverse spectral bands, MOD ensures robust all-day and all-weather perception.
To provide a systematic survey, a unified four-stage mathematical framework is established, which deconstructs MOD into multispectral data input, feature learning, fusion schemes, and detection solutions.
Building upon the concepts introduced above, the following figures visualize the structural breakdown of our survey.
- Figure 1 illustrates the detailed data flow of the unified mathematical framework, mapping the progression from raw multispectral inputs to final detection outputs.
- Figure 2 expands this framework into a fine-grained hierarchical taxonomy. It categorizes recent state-of-the-art literature based on their specific strategies to overcome core cross-modal challenges.
This taxonomy directly dictates the organization of the paper list in the following sections.
An overview of representative MOD datasets spanning ground-based and remote sensing scenarios.
(Legend: Pairs = Img Pairs, Res. = Resolution, Plat. = Platform(Surv. = Surveillance, Multi. = Multiple), Cls = Class, A/O = Alignment / Occlusion)
| Dataset | Venue | Modality | Pairs | Res. | Plat. | Cls | Den. | A/O | Link |
|---|---|---|---|---|---|---|---|---|---|
| KAIST | CVPR'15 | RGB-TIR | 95.3K | 640x480 | Driving | 1 | 0.62 | β /β | Link |
| CVC-14 | Sensors'16 | RGB-TIR | 8.5K | 640x512 | Driving | 1 | 0.80 | β/β | Link |
| FLIR-aligned | ICIP'20 | RGB-TIR | 5.1K | 640x512 | Driving | 3 | 7.92 | β /β | Link |
| LLVIP | ICCV'21 | RGB-TIR | 16.8K | 1080x720 | Surv. | 1 | 2.51 | β /β | Link |
| MΒ³FD | CVPR'22 | RGB-TIR | 4.2K | 1024x768 | Multi. | 6 | 8.19 | β /β | Link |
| SMOD | TMM'25 | RGB-TIR | 8.6K | 640x512 | Driving | 4 | 3.62 | β /β | Link |
| MFAD | TCSVT'25 | RGB-TIR | 12.1K | 1280x960 | Driving | 6 | 7.13 | β /β | Link |
(Legend: Pairs = Img Pairs, Res. = Resolution, Plat. = Platform, Cls = Class, A/O = Alignment / Occlusion)
| Dataset | Venue | Modality | Pairs | Res. | Plat. | Cls | Den. | A/O | Link |
|---|---|---|---|---|---|---|---|---|---|
| VEDAI | JVCI'16 | R-NIR | 1.2K | 1024x1024 | UAV | 9 | 2.93 | β /β | Link |
| DroneVehicle | TCSVT'21 | R-TIR | 28.4K | 840x712 | UAV | 1 | 16.7 | β/β | Link |
| DronePerson | ISPRS'23 | R-TIR | 6.1K | 640x512 | UAV | 1 | 11.6 | β /β | Link |
| DVTOD | TIV'24 | R-TIR | 2.1K | 1920x1080 | UAV | 3 | 2.82 | β/β | Link |
| OdinMJ | GRSM'24 | R-TIR | 23K | 640x512 | UAV | 1 | 1.98 | β /β | Link |
| RGBT-Tiny | TPAMI'25 | R-TIR | ~47.5K | 640x512 | UAV | 7 | 12.9 | β /β | Link |
| SpaceNet6-OTD | TGRS'22 | R-SAR | 820 | 900x900 | Sat. | 1 | 22.0 | β /β | Link |
| OGSOD-1.0 | TGRS'23 | R-SAR | 14.6K | 256x256 | Sat. | 3 | 2.62 | β /β | Link |
| OGSOD-2.0 | ICGIP'25 | R-SAR | 23.4K | 256x256 | Sat. | 4 | 3.24 | β /β | Link |
We categorize representative methods according to our proposed taxonomy.
This section addresses fundamental representation challenges: Modality Misalignment, Modality Imbalance, Modality Redundancy, and Modality Asymmetry.
This challenge manifests in two primary forms: Spatial Misalignment and Semantic Misalignment.
| Venue | Methods | Title | Modality | Source |
|---|---|---|---|---|
| AAAI'26 | IGIANet | Igianet: Illumination guided implicit alignment network for infrared-visible uav detection | RGB-TIR | Paper |
| TMM'25 | DeformCAT | Deformable cross-attention transformer for weakly aligned rgb-t pedestrian detection | RGB-TIR | Paper/Code |
| TCSVT'25 | SeaDATE | Seadate: Remedy dual-attention transformer with semantic alignment via contrast learning for multimodal object detection | RGB-TIR | Paper |
| CVPR'24 | OAFA | Weakly misalignment-free adaptive feature alignment for uavs-based multimodal object detection | RGB-TIR | Paper |
| ECCV'24 | DAMSDet | Damsdet: Dynamic adaptive multispectral detection transformer with competitive query selection and adaptive feature fusion | RGB-TIR | Paper/Code |
| ICIP'24 | L-CMAF | Revisiting misalignment in multispectral pedestrian detection | RGB-TIR | Paper |
| TIV'24 | YOLO-Adaptor | Yolo-adaptor: A fast adaptive one-stage detector for non-aligned visible-infrared object detection | RGB-TIR | Paper |
| MM'23 | AANet | Attentive alignment network for multispectral pedestrian detection | RGB-TIR | Paper |
| MM'23 | CALNet | Multispectral object detection via cross-modal conflict-aware learning | RGB-TIR | Paper/Code |
| TITS'23 | MFPT | Multi-modal feature pyramid transformer for rgb-infrared object detection | RGB-TIR | Paper |
| ECCV'22 | TSFADet | Translation, scale and rotation: cross-modal alignment meets rgb-infrared vehicle detection | RGB-TIR | Paper |
| ICCV'19 | AR-CNN | Weakly aligned cross-modal learning for multispectral pedestrian detection | RGB-TIR | Paper/Code |
| Venue | Methods | Title | Modality | Source |
|---|---|---|---|---|
| TCSVT'25 | MSCoTDet | Mscotdet: Language-driven multi-modal fusion for improved multi-spectral pedestrian detection | RGB-TIR | Paper |
| TGRS'25 | DKDNet | Diffusion mechanism and knowledge distillation object detection in multimodal remote sensing imagery | RGB-SAR | Paper |
| InfFus'25 | EMOD | Efficient multispectral object detection with attentive feature aggregation leveraging zero-shot implicit illumination guidance | RGB-TIR | Paper/Code |
| ICCV'25 | MΒ²D-LIF | Rethinking multi-modal object detection from the perspective of mono-modality feature learning | RGB-TIR | Paper/Code |
| TITS'24 | MS-DETR | MS-DETR: multispectral pedestrian detection transformer with loosely coupled fusion and modality-balanced optimization | RGB-TIR | Paper/Code |
| IROS'24 | DCSANet | Desanet: Dual cross-channel and spatial attention make RGB-T object detection better | RGB-TIR | Paper |
| CVPR'24 | CMM | Causal mode multiplexer: A novel framework for unbiased multispectral pedestrian detection | RGB-TIR | Paper/Code |
| ECCV'22 | MBNet | Improving multispectral pedestrian detection by addressing modality imbalance problems | RGB-TIR | Paper/Code |
| Venue | Methods | Title | Modality | Source |
|---|---|---|---|---|
| NeuCom'24 | DHFNet | Dhfnet: Decoupled hierarchical fusion network for RGB-T dense prediction tasks | RGB-TIR | Paper |
| RS'22 | RISNet | Improving rgb-infrared object detection by reducing cross-modality redundancy | RGB-TIR | Paper |
| PR'22 | YOLOFusion | Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery | RGB-NIR | Paper/Code |
| Venue | Methods | Title | Modality | Source |
|---|---|---|---|---|
| MM'25 | UniRGB-IR | Unirgb-ir: A unified framework for visible-infrared semantic tasks via adapter tuning | RGB-TIR | Paper/Code |
| ECCV'24 | ModTr | Modality translation for object detection adaptation without forgetting prior knowledge | TIR | Paper/Code |
| CVPR'24 | D3T | D3t: Distinctive dual-domain teacher zigzagging across rgb-thermal gap for domain-adaptive object detection | TIR | Paper/Code |
| MM'23 | TIRDet | Tirdet: Mono-modality thermal infrared object detection based on prior thermal-to-visible translation | TIR | Paper/Code |
| TCSVT'22 | DCRL-PDN | Deep cross-modal representation learning and distillation for illumination-invariant pedestrian detection | RGB | Paper |
| AAAI'22 | VPD | Towards versatile pedestrian detector with multisensory-matching and multispectral recalling memory | RGB-TIR | Paper |
| ECCV'20 | TC-Det | Task-conditioned domain adaptation for pedestrian detection in thermal imagery | TIR | Paper/Code |
| CVPRW'19 | UMAD | Unsupervised domain adaptation for multispectral pedestrian detection | RGB-TIR | Paper/Code |
| CVPR'17 | CMT-CNN | Learning cross-modal deep representations for robust pedestrian detection | RGB-TIR | Paper/Code |
Categorized by Fusion Stage Design and Fusion Function Construction.
| Venue | Methods | Title | Modality | Source |
|---|---|---|---|---|
| TIP'26 | AFFNet | Adaptive fine-grained fusion network for multimodal UAV object detection | RGB-TIR | Paper |
| InfFus'26 | MSFF | Multispectral state-space feature fusion: Bridging shared and cross-parametric interactions for object detection | RGB-TIR | Paper/Code |
| InfFus'26 | COMO | COMO: cross-mamba interaction and offset-guided fusion for multimodal object detection | RGB-TIR | Paper/Code |
| TII'25 | RetinexDet | Retinexdet: Enhancing multispectral object detection via retinex state space duality and wavelet-based frequency adaptive fusion | RGB-TIR | Paper |
| TGRS'25 | MPFF | Aerial image object detection based on rgb-infrared multibranch progressive fusion | RGB-TIR | Paper |
| TGRS'25 | DHANet | Dhanet: Dual-stream hierarchical interaction networks for multimodal drone object detection | RGB-TIR | Paper/Code |
| TGRS'25 | DMM | DMM: disparity-guided multispectral mamba for oriented object detection in remote sensing | RGB-TIR | Paper/Code |
| PR'25 | MSTF | Multispectral transformer fusion via exploiting similarity and complementarity for robust pedestrian detection | RGB-TIR | Paper |
| TMM'25 | Fusion-Mamba | Fusion-mamba for cross-modality object detection | RGB-TIR | Paper/Code |
| MM'25 | CSSFDet | Contextually-guided state space fusion for misaligned multi-spectral object detection | RGB-TIR | Paper |
| MM'25 | SemFusion | Sam-guided semantic knowledge fusion for visible-infrared object detection | RGB-TIR | Paper/Code |
| ICCV'25 | WaveMamba | Wavemamba: Wavelet-driven mamba fusion for rgb-infrared object detection | RGB-TIR | Paper |
| ICCV'25 | M-SpecGene | M-specgene: Generalized foundation model for rgbt multispectral vision | RGB-TIR | Paper/Code |
| TNNLS'24 | LRAF-Net | Lraf-net: Long-range attention fusion network for visible-infrared object detection | RGB-TIR | Paper |
| TNNLS'24 | TFDet | Tfdet: Target-aware fusion for RGB-T pedestrian detection | RGB-TIR | Paper/Code |
| ECCV'24 | MMPedestron | When pedestrian detection meets multi-modal learning: Generalist model and benchmark dataset | Multi | Paper/Code |
| NIPS'24 | E2E-MFD | E2e-mfd: Towards end-to-end synchronous multimodal fusion detection | RGB-TIR | Paper/Code |
| TMM'23 | CMPD | Confidence-aware fusion using dempster-shafer theory for multispectral pedestrian detection | RGB-TIR | Paper/Code |
| TCSVT'22 | UA-CMDet | Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning | RGB-TIR | Paper/Code |
| InfFus'19 | CIAN | Cross-modality interactive attention network for multispectral pedestrian detection | RGB-TIR | Paper/Code |
| PR'19 | IAF R-CNN | Illumination-aware faster r-cnn for robust multispectral pedestrian detection | RGB-TIR | Paper/Code |
This section categorizes detection solutions based on specific application challenges: Small Object Detection, Robust Perception Under Adverse Conditions, and Adversarial Attacks.
| Venue | Methods | Title | Modality | Source |
|---|---|---|---|---|
| TIM'25 | AMSDet | Adaptive modality selection drone-based RGBT detector for tiny targets | RGB-TIR | Paper |
| TGRS'23 | SuperYOLO | Superyolo: Super resolution assisted object detection in multimodal remote sensing imagery | RGB-NIR | Paper/Code |
| ISPRS'23 | QFDet | Drone-based rgbt tiny person detection | RGB-TIR | Paper/Code |
| BMVC'20 | ASMPD | Anchor-free small-scale multispectral pedestrian detection | RGB-TIR | Paper/Code |
| ISPRS'19 | HMFFN | Box-level segmentation supervised deep neural networks for accurate and real-time multispectral pedestrian detection | RGB-TIR | Paper |
| Venue | Methods | Title | Modality | Source |
|---|---|---|---|---|
| TCSVT'25 | CFMW | CFMW: cross-modality fusion mamba for robust object detection under adverse weather | RGB-TIR | Paper/Code |
| PRL'25 | RRD | Learning a robust rgb-thermal detector for extreme modality imbalance | RGB-TIR | Paper |
| RAL'25 | HA-MLPD | Hybrid attention for robust RGB-T pedestrian detection in real-world conditions | RGB-TIR | Paper |
| MMUL'25 | VL-ACFDet | Vision-language-guided adaptive cross-modal fusion for multispectral object detection under adverse weather conditions | RGB-TIR | Paper |
| TGRS'24 | LF-MDet | Low-rank multimodal remote sensing object detection with frequency filtering experts | RGB-TIR | Paper/Code |
| ECCV'22 | ProbEn | Multimodal object detection via probabilistic ensembling | RGB-TIR | Paper/Code |
| Venue | Methods | Title | Modality | Source |
|---|---|---|---|---|
| MM'25 | CDUPatch | Cdupatch: Color-driven universal adversarial patch attack for dual-modal visible-infrared detectors | RGB-TIR | Paper |
| TPAMI'24 | UAPatch | Unified adversarial patch for visible-infrared cross-modal attacks in the physical world | RGB-TIR | Paper/Code |
| AAAI'23 | MIC | Multispectral invisible coating: Laminated visible-thermal physical attack against multispectral object detectors using transparent low-e films | RGB-TIR | Paper |
| ICASSP'23 | SRG-ASRP | Similarity relation preserving cross-modal learning for multispectral pedestrian detection against adversarial attacks | RGB-TIR | Paper |
Please contact us at fqy2017@gmail.com for any questions.






