基于文本引導的輕量異構編碼多模態圖像融合

王傳云; 周明奇; 孫冬冬; 王田; 高騫; 李照奎

doi:10.13374/j.issn2095-9389.2025.06.17.002

基于文本引導的輕量異構編碼多模態圖像融合

Text-guided lightweight multimodal image fusion with heterogeneous encoders

摘要

摘要: 針對資源受限的無人機平臺對紅外與可見光圖像的融合效率與感知性能需求，本文提出一種基于文本引導的輕量異構編碼多模態圖像融合網絡. 該網絡設計了一種面向紅外與可見光圖像信息表達功能互補的輕量化雙分支異構編碼，紅外圖像編碼分支強調熱目標與邊緣響應，可見光圖像編碼分支側重于紋理與細節信息建模，從而有效避免同構編碼器帶來的特征冗余與性能瓶頸. 同時，引入輕量級跨模態特征融合模塊，增強多模信息之間的互補性與融合表達能力. 進一步，通過預訓練視覺語言模型結合語義文本特征對融合過程進行引導與調控，提升融合圖像的語義一致性與環境適應性. 在三個公開多模態圖像數據集TNO、LLVIP與M3FD上，本文方法與九種代表性圖像融合算法進行了系統對比實驗與綜合評估，結果顯示本文網絡在互信息、結構相似性等多個主流評價指標上均表現優越，融合圖像在細節清晰度、邊緣結構一致性與目標可辨性方面優于現有方法. 同時，消融實驗表明所提出模型的推理時間相較基線方法減少約50%，且在不顯著犧牲性能的前提下實現了更高的效率. 除定量評估外，本文還開展了基于文本指令的定性實驗，結果顯示模型可根據不同語義指令靈活調整紅外與可見光特征融合策略，適應低光、過曝、低對比、噪聲等多種任務場景. 在保證語義一致性的同時，有效增強了熱源感知、結構清晰度與抗干擾能力，展現出傳統無引導方法難以實現的語義可控性與內容適應性.

Abstract: To meet the demands for fusion efficiency and perceptual performance of infrared and visible images on resource-constrained unmanned aerial vehicle (UAV) platforms, this paper proposes a text-guided lightweight multimodal image fusion network with heterogeneous encoders. The network employs a lightweight dual-branch heterogeneous encoding architecture designed to complementarily represent infrared and visible image information. Specifically, the infrared-encoding branch emphasizes thermal targets and edge responses, while the visible-encoding branch focuses on modeling texture and detail information. This design effectively avoids feature redundancy and performance bottlenecks commonly associated with homogeneous encoders. To enhance the collaborative representation capability of multimodal features, a lightweight cross-modal attention fusion module is introduced. This module jointly models attention relationships across channel and spatial dimensions, thereby strengthening complementary information interactions between modalities. Furthermore, leveraging semantic features extracted from the pre-trained vision–language model CLIP, the fusion process incorporates explicit semantic prior guidance. Through hierarchical feature-level modulation, the weights of infrared and visible features are dynamically adjusted, improving the semantic consistency and environmental adaptability of the fused images. Systematic comparative experiments and comprehensive evaluations were conducted on three publicly available multimodal image datasets: TNO, LLVIP, and M3FD. The proposed method was compared with nine representative image fusion algorithms. Results demonstrate that the proposed network achieves state-of-the-art performance on multiple mainstream evaluation metrics, including mutual information and structural similarity. The fused images surpass existing methods in detail clarity, edge structure consistency, and target discernibility. Ablation studies further show that the inference time of the proposed model is reduced by approximately 50% compared with baseline methods, achieving higher efficiency without significant performance degradation. In addition to quantitative evaluations, qualitative experiments guided by textual instructions were performed, demonstrating the model’s strong semantic responsiveness and content adaptability. For example, in low-light enhancement tasks, the model significantly improves brightness and visibility of fused images, highlighting thermal sources present in infrared images. However, in daytime scenarios, this instruction may cause excessive brightness in background areas, reflecting the model’s selective sensitivity to semantic inputs. In overexposure correction tasks, the model preserves thermal contrast features in infrared images, reducing interference from overexposed regions in visible images and producing fusion outcomes dominated by infrared characteristics. Under conditions of low contrast in infrared images, the model enhances texture and detail information from visible images, generating fusion results closer to the visual appearance of the visible modality. In noise-robustness tasks, when visible images are corrupted by noise, the model preferentially leverages the stable structural information from infrared images to reconstruct fused outputs, effectively mitigating noise-induced degradation and demonstrating strong anti-interference capability. In summary, the proposed model integrates heterogeneous dual-branch encoding with cross-modal attention and semantic guidance to achieve improved fusion quality and adaptability on resource-limited UAV platforms. Experimental results confirm that it can dynamically adjust fusion strategies based on different semantic inputs, enhancing the consistency and relevance of fused images for various tasks. Moreover, the model achieves a favorable balance between computational efficiency and fusion performance, making it suitable for practical deployment in complex environments.

HTML全文

參考文獻(29)

施引文獻

資源附件(0)