FusionFormer-X: Hierarchical Self-Attentive Multimodal Transformer for HSI-LiDAR Remote Sensing Scene Understanding
Abstract
The fusion of complementary modalities has become a central theme in remote sensing (RS), particularly in leveraging Hyperspectral Imaging (HSI) and Light Detection and Ranging (LiDAR) data for more accurate scene classification. In this paper, we introduce \textbf{FusionFormer-X}, a novel transformer-based architecture that systematically unifies multi-resolution heterogeneous data for RS tasks. FusionFormer-X is specifically designed to address the challenges of modality discrepancy, spatial-spectral alignment, and fine-grained feature representation. First, we embed convolutional tokenization modules to transform raw HSI and LiDAR inputs into semantically rich patch embeddings, preserving spatial locality. Next, we propose a Hierarchical Multi-Scale Multi-Head Self-Attention (H-MSMHSA) mechanism, which performs cross-modal interaction in a coarse-to-fine manner, enabling robust alignment between high-spectral-resolution HSI and lower-resolution spatial LiDAR data. We validate our framework on public RS benchmarks including Trento and MUUFL, demonstrating its superior classification performance over current state-of-the-art multimodal fusion models. These results underscore the potential of FusionFormer-X as a foundational backbone for high-fidelity multimodal remote sensing understanding.
Related articles
Related articles are currently not available for this article.